Data Engineering Weekly #96

The Weekly Data Engineering Newsletter

Aug 14, 2022

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

Uber: ML Education at Uber: Program Design and Outcomes

Uber writes the second part of its internal ML education program focusing on content delivery, usage observability, marketing & reach. The blog post is a good reminder that platform engineering is not only about building abstraction and driving standardization but more about marketing & selling the abstraction to the internal customers.

https://www.uber.com/blog/ml-education-at-uber-program-design-and-outcomes/

first part: https://www.uber.com/blog/ml-education-at-uber/

PayPal: The next generation of Data Platforms is the Data Mesh

PayPal writes about its strategy to adopt data mesh principles. The blog acknowledged there is no standard implementation yet, but established a business case why PayPal needs DataMesh principles in their data strategy. It is an exciting space to observe from PayPal.

https://medium.com/paypal-tech/the-next-generation-of-data-platforms-is-the-data-mesh-b7df4b825522

The talk gives a much more systematic case for DataMesh.

InfoQ: AI, ML, and Data Engineering InfoQ Trends Report—August 2022

InfoQ released its 2022 AI, ML & Data Engineering Trends report. The Resource Managers like Yarn and stream processing now moved to the late adopted stage. There is tons of exciting new entrant Knowledge Graphs, AI pair programmer (like Github Copilot), and Synthetic Data Generation.

https://www.infoq.com/articles/ai-ml-data-engineering-trends-2022/

DoorDash: Building Scalable Real Time Event Processing with Kafka and Flink

DoorDash writes about its real-time data infrastructure on top of Kafka, Flink & Pinot. The blog narrates how a producer proxy for Kafka helped to scale the pipeline, Flink SQL abstraction, the usage of Kafka schema registry, and data warehouse integration.

https://doordash.engineering/2022/08/02/building-scalable-real-time-event-processing-with-kafka-and-flink/

Airbnb: Airbnb’s Approach to Access Management at Scale

Airbnb writes about the design of its Access Management system. The blog narrates various focus points of the design, system guarantees, and the impact of the system after the rollout.

https://medium.com/airbnb-engineering/airbnbs-approach-to-access-management-at-scale-cfa66c32f03c

Etsy: Faster ML Experimentation at Etsy with Interleaving

Online experimentation plays a central role in product development. Etsy writes about how it uses the Interleaving Experimentation Test to capture the user's preference at the individual level rather than comparing average behaviors of two groups seeing distinct experiences with typical AB tests.

https://www.etsy.com/codeascraft/faster-ml-experimentation-at-etsy-with-interleaving

Astrafy: dbt at scale on Google Cloud

Astrafy writes a 3 part series on building dbt pipeline on Google Cloud. The blog focuses on dbt in Google Cloud architecture, dbt versioning, data quality, orchestrating with Google cloud composer, and data ops.

Part 1: https://www.astrafy.io/articles/dbt-at-scale-on-google-cloud-part-1

Part 2: https://www.astrafy.io/articles/dbt-at-scale-on-google-cloud-part-2

Part 3: https://www.astrafy.io/articles/dbt-at-scale-on-google-cloud-part-3

Dominik Golebiewski: Snowflake query optimiser: unoptimised

dbt established a case for CTEs (Common Table Expression) are passthrough, and the performance impact is negligible as modern data warehouse optimizers recognize this pattern. The blog narrates how that is not the case with Snowflake by comparing the imported CTE with referencing the base table directly in CTE. The results yield a reduced build time from over 30 minutes to less than 10 minutes, roughly $600 saving in a table.

https://medium.com/@AtheonAnalytics/snowflake-query-optimiser-unoptimised-cf0223bdd136

Reference in the article:

https://discourse.getdbt.com/t/ctes-are-passthroughs-some-research/155

https://discourse.getdbt.com/t/why-the-fishtown-sql-style-guide-uses-so-many-ctes/1091

Hubspot: Building a Fast, Thread-safe Hotspot Tracking Library

LMAX Disruptor is one of the best libraries in java to build a bounded queue with lock-free enqueues. Hubspot writes about how LMAX Disruptor helped to build a fast, thread-safe tracking library.

https://product.hubspot.com/blog/hotspot-tracking-library

LMAX Disruptor Paper: LMAX Disruptor: High performance alternative to bounded queues for exchanging data between concurrent threads

Games24x7: ksqlDB in Data Engineering at Games24x7

Games24x7 shares its experience running ksqlDB and tuning for optimal performance. Tuning state storage, horizontal vs. vertical scaling and GC optimization reveal the internal functioning of ksqlDB.

https://medium.com/@anupsdtiwari/ksqldb-in-data-engineering-at-games24x7-9c66b7cf5aa0

Yelp: Spark Data Lineage

Yelp writes about its attempt to capture the Spark lineage from metadata description files.

I'm curious whether there is any automatic way to capture the Spark lineage than manual input, perhaps AST parsing of Spark logical plan. Please comment if you know any library that parses Spark's logical plan for lineage.

https://engineeringblog.yelp.com/2022/08/spark-data-lineage.html

Cloudera: Speeding up Queries With Z-Order

Z-Order indexing is a popular index optimization implemented in many data management systems. Cloudera writes about Z-Order implementation in Impala and how it improves query efficiency.

https://blog.cloudera.com/speeding-up-queries-with-z-order/

References:

Z-order indexing for multifaceted queries in Amazon DynamoDB

Performance Tuning Apache Spark with Z-Ordering and Data Skipping in Azure Databricks

Hudi Z-Order and Hilbert Space Filling Curves

All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly