Data Engineering Weekly #48

Weekly Data Engineering Newsletter

Jul 18, 2021

Welcome to the 48th edition of the data engineering newsletter. This week's release is a new set of articles that focus on Pedram Navid’s for SQL, Benn Stancil’s analytics is at a crossroad, Continual’s the future of modern data stack, NuBank's Scaling data analytics with software engineering best practices, DoorDash's indexing infrastructure with Apache Kafka and Elasticsearch, Pinterest's unified Flink datasource, PayPal's introduction to dataFu-Spark, Pinterest's interactive querying with Apache Spark.

Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.

Event: Airbnb - The Journey Toward High-Quality Data

Airbnb is hosting its first virtual tech talk focusing on data quality Wednesday, July 28th, 12:00 PM-1:00 PM PST. SignUp here

https://journeytowardhighqualitydata.splashthat.com/

Pedram Navid: For SQL

Last week we saw Jamie Brandon's manifesto against SQL. SQL's problems boil down to its inexpressiveness, incompressibility, and non-porousness. Pedram Navid writes a well-thought article for SQL, saying it is not a concern in most cases, and when it comes to composability, tools like dbt have helped bridge that gap bringing the power of jinja templating to SQL. The author raised some valid questions: all these arguments against SQL result from an almost class divide between "Software Engineering" and "Data People."?

https://pedram.substack.com/p/for-sql

Benn Stancil: Analytics is at a crossroads

Been Stancil writes beautifully summarized thoughts on For SQL, Against SQL, and the data team a short story by start by asking, "The world is full of great analysts. Will we have the courage to go looking for them?." The author rightly points out that the most challenging and most essential problems analysts work on aren't technical or even mathematical, highlighting the challenges for analytical engineering.

https://benn.substack.com/p/analytics-is-at-a-crossroads

The blog triggered some exciting and much-needed debate on Twitter, and Josh Wills summarized it beautifully.

Josh Wills @josh_wills

@bennstancil Second, this entire debate assumes a world in which software engineering is the highest status profession and is therefore the one worth emulating, which is true of our world to an extent but isn't generally true in, like, most of reality.

Continual: The Future of the Modern Data Stack

The data infrastructure came a long way from the in-house Hadoop clusters to increasingly adopting the cloud-native solution. The article narrates the emerging focus area on top of the modern data stack, such as AI, data sharing, data governance, streaming, and application serving.

https://continual.ai/post/the-future-of-the-modern-data-stack

NuBank: Scaling data analytics with software engineering best practices

Self-serving data analytics is the primary goal of an organization to scale data usage and remove the bottleneck from the data team. A well-defined process and tools to enable the process are essential for self-serving analytics. NuBank writes an exciting article sharing its self-serving path on scaling data analytics with software engineering best practices.

https://building.nubank.com.br/scaling-data-analytics-with-software-engineering-best-practices/

DoorDash: Building Faster Indexing with Apache Kafka and Elasticsearch

DoorDash writes about its search indexing infrastructure built on Apache Kafka, Apache Flink, and Elasticsearch. The adoption of incremental indexing to support both the CDC and ETL data, the Assembler design to connect with ETL DB, and windowed API lookup to enrich the entities are some of the highlight design strategies in the indexing infrastructure.

https://doordash.engineering/2021/07/14/open-source-search-indexing/

Pinterest: Unified Flink Source at Pinterest - Streaming Data Processing

Pinterest writes about its streaming infrastructure, Xenon, focusing on a unified Flink data source approach to combine Kafka and data on S3 that abstracts the complexity of data storage from the consumer yet deliver all the streaming guarantees. The article captures the trend in the data infrastructure that closes the gap between batch processing and stream processing.

https://medium.com/pinterest-engineering/unified-flink-source-at-pinterest-streaming-data-processing-c9d4e89f2ed6

PayPal: Introducing DataFu-Spark

Apache DataFu™ is a collection of libraries for working with large-scale data in Hadoop. It provides a well-testing solution to common big data processing problems like data deduplication and skewed joins etc. PayPal writes about DataFu integration with Spark with the example of finding the most recent updates in a record, skewed joins, join with range, counting distinct values, and calling python code from scala.

https://medium.com/paypal-tech/introducing-datafu-spark-ba67faf1933a

Pinterest: Interactive Querying with Apache Spark SQL at Pinterest

Though Presto remains the most popular query engine choice for quick interactive querying with limited resource requirements, we often end up requiring Hive or Spark SQL to query extensive data for ad-hoc exploration. Pinterest shares its experience of building Spark SQL as an interactive query engine using Apache Livy and Remote Spark Context.

https://medium.com/pinterest-engineering/interactive-querying-with-apache-spark-sql-at-pinterest-2a3eaf60ac1b

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.

Data Engineering Weekly