Welcome to the 48th edition of the data engineering newsletter. This week's release is a new set of articles that focus on Pedram Navid’s for SQL, Benn Stancil’s analytics is at a crossroad, Continual’s the future of modern data stack, NuBank's Scaling data analytics with software engineering best practices, DoorDash's indexing infrastructure with Apache Kafka and Elasticsearch, Pinterest's unified Flink datasource, PayPal's introduction to dataFu-Spark, Pinterest's interactive querying with Apache Spark.
Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers
RudderStack
provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.
Event: Airbnb - The Journey Toward High-Quality Data
Airbnb is hosting its first virtual tech talk focusing on data quality Wednesday, July 28th, 12:00 PM-1:00 PM PST. SignUp here
https://journeytowardhighqualitydata.splashthat.com/
Pedram Navid: For SQL
Last week we saw Jamie Brandon's manifesto against SQL. SQL's problems boil down to its inexpressiveness, incompressibility, and non-porousness. Pedram Navid writes a well-thought article for SQL, saying it is not a concern in most cases, and when it comes to composability, tools like dbt have helped bridge that gap bringing the power of jinja templating to SQL. The author raised some valid questions: all these arguments against SQL result from an almost class divide between "Software Engineering" and "Data People."?
https://pedram.substack.com/p/for-sql
Benn Stancil: Analytics is at a crossroads
Been Stancil writes beautifully summarized thoughts on For SQL, Against SQL, and the data team a short story by start by asking, "The world is full of great analysts. Will we have the courage to go looking for them?."
The author rightly points out that the most challenging and most essential problems analysts work on aren't technical or even mathematical, highlighting the challenges for analytical engineering.
https://benn.substack.com/p/analytics-is-at-a-crossroads
The blog triggered some exciting and much-needed debate on Twitter, and Josh Wills summarized it beautifully.
Continual: The Future of the Modern Data Stack
The data infrastructure came a long way from the in-house Hadoop clusters to increasingly adopting the cloud-native solution. The article narrates the emerging focus area on top of the modern data stack, such as AI, data sharing, data governance, streaming, and application serving.
https://continual.ai/post/the-future-of-the-modern-data-stack
NuBank: Scaling data analytics with software engineering best practices
Self-serving data analytics is the primary goal of an organization to scale data usage and remove the bottleneck from the data team. A well-defined process and tools to enable the process are essential for self-serving analytics. NuBank writes an exciting article sharing its self-serving path on scaling data analytics with software engineering best practices.
https://building.nubank.com.br/scaling-data-analytics-with-software-engineering-best-practices/
Sponsored - RudderStack: Real-Time Personalization with Redis and RudderStack
Nailing personalization can mean increasing revenue by 15%, but technical challenges keep many companies stuck using basic methods. RudderStack writes a step-by-step guide on designing and implementing a real-time personalization engine using Redis and RudderStack.
https://rudderstack.com/blog/real-time-personalization-with-redis-and-rudderstack
DoorDash: Building Faster Indexing with Apache Kafka and Elasticsearch
DoorDash writes about its search indexing infrastructure built on Apache Kafka, Apache Flink, and Elasticsearch. The adoption of incremental indexing to support both the CDC and ETL data, the Assembler design to connect with ETL DB, and windowed API lookup to enrich the entities are some of the highlight design strategies in the indexing infrastructure.
https://doordash.engineering/2021/07/14/open-source-search-indexing/
Pinterest: Unified Flink Source at Pinterest - Streaming Data Processing
Pinterest writes about its streaming infrastructure, Xenon, focusing on a unified Flink data source approach to combine Kafka and data on S3 that abstracts the complexity of data storage from the consumer yet deliver all the streaming guarantees. The article captures the trend in the data infrastructure that closes the gap between batch processing and stream processing.
PayPal: Introducing DataFu-Spark
Apache DataFu™ is a collection of libraries for working with large-scale data in Hadoop. It provides a well-testing solution to common big data processing problems like data deduplication and skewed joins etc. PayPal writes about DataFu integration with Spark with the example of finding the most recent updates in a record, skewed joins, join with range, counting distinct values, and calling python code from scala.
https://medium.com/paypal-tech/introducing-datafu-spark-ba67faf1933a
Pinterest: Interactive Querying with Apache Spark SQL at Pinterest
Though Presto remains the most popular query engine choice for quick interactive querying with limited resource requirements, we often end up requiring Hive or Spark SQL to query extensive data for ad-hoc exploration. Pinterest shares its experience of building Spark SQL as an interactive query engine using Apache Livy and Remote Spark Context.
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.