Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
Uber: ML Education at Uber: Program Design and Outcomes
Uber writes the second part of its internal ML education program focusing on content delivery, usage observability, marketing & reach. The blog post is a good reminder that platform engineering is not only about building abstraction and driving standardization but more about marketing & selling the abstraction to the internal customers.
https://www.uber.com/blog/ml-education-at-uber-program-design-and-outcomes/
first part: https://www.uber.com/blog/ml-education-at-uber/
PayPal: The next generation of Data Platforms is the Data Mesh
PayPal writes about its strategy to adopt data mesh principles. The blog acknowledged there is no standard implementation yet, but established a business case why PayPal needs DataMesh principles in their data strategy. It is an exciting space to observe from PayPal.
https://medium.com/paypal-tech/the-next-generation-of-data-platforms-is-the-data-mesh-b7df4b825522
The talk gives a much more systematic case for DataMesh.
InfoQ: AI, ML, and Data Engineering InfoQ Trends Report—August 2022
InfoQ released its 2022 AI, ML & Data Engineering Trends report. The Resource Managers like Yarn and stream processing now moved to the late adopted stage. There is tons of exciting new entrant Knowledge Graphs, AI pair programmer (like Github Copilot), and Synthetic Data Generation.
https://www.infoq.com/articles/ai-ml-data-engineering-trends-2022/
DoorDash: Building Scalable Real Time Event Processing with Kafka and Flink
DoorDash writes about its real-time data infrastructure on top of Kafka, Flink & Pinot. The blog narrates how a producer proxy for Kafka helped to scale the pipeline, Flink SQL abstraction, the usage of Kafka schema registry, and data warehouse integration.
Sponsored: Firebolt is a proud sponsor of Data Engineering Weekly.
Firebolt is the cloud data warehouse for builders of next-gen analytics experiences.
Combining the benefits and ease of use .of modern architecture with a
sub-second performance at a terabyte-scale, Firebolt helps data engineering
and dev teams deliver data applications that end-users love.
Airbnb: Airbnb’s Approach to Access Management at Scale
Airbnb writes about the design of its Access Management system. The blog narrates various focus points of the design, system guarantees, and the impact of the system after the rollout.
https://medium.com/airbnb-engineering/airbnbs-approach-to-access-management-at-scale-cfa66c32f03c
Etsy: Faster ML Experimentation at Etsy with Interleaving
Online experimentation plays a central role in product development. Etsy writes about how it uses the Interleaving Experimentation Test to capture the user's preference at the individual level rather than comparing average behaviors of two groups seeing distinct experiences with typical AB tests.
https://www.etsy.com/codeascraft/faster-ml-experimentation-at-etsy-with-interleaving
Sponsored: Why data engineering needs a DSL to check data as-code
Data reliability needs its own language - a language that is specific enough to address the problems of data engineers who often find themselves firefighting data issues when reports, dashboards, or machine learning models break, yet specific enough to address the problems that data teams face, and accessible enough for non-engineers to use. (Im)Possible? Check out Soda Checks Language.
https://www.soda.io/resources/introducing-a-new-domain-specific-language-for-data-reliability
Astrafy: dbt at scale on Google Cloud
Astrafy writes a 3 part series on building dbt pipeline on Google Cloud. The blog focuses on dbt in Google Cloud architecture, dbt versioning, data quality, orchestrating with Google cloud composer, and data ops.
Part 1: https://www.astrafy.io/articles/dbt-at-scale-on-google-cloud-part-1
Part 2: https://www.astrafy.io/articles/dbt-at-scale-on-google-cloud-part-2
Part 3: https://www.astrafy.io/articles/dbt-at-scale-on-google-cloud-part-3
Dominik Golebiewski: Snowflake query optimiser: unoptimised
dbt established a case for CTEs (Common Table Expression) are passthrough, and the performance impact is negligible as modern data warehouse optimizers recognize this pattern. The blog narrates how that is not the case with Snowflake by comparing the imported CTE with referencing the base table directly in CTE. The results yield a reduced build time from over 30 minutes to less than 10 minutes, roughly $600 saving in a table.
https://medium.com/@AtheonAnalytics/snowflake-query-optimiser-unoptimised-cf0223bdd136
Reference in the article:
https://discourse.getdbt.com/t/ctes-are-passthroughs-some-research/155
https://discourse.getdbt.com/t/why-the-fishtown-sql-style-guide-uses-so-many-ctes/1091
Sponsored: How Allbirds solves identity resolution in the warehouse with dbt Labs, Snowflake, and RudderStack
Join this live webinar on August 17th for a deep dive on identity resolution in the warehouse with Allbirds Staff Data Engineer, Chandra Gangireddy. Chandra will detail the architecture and end-to-end data flow they use to break down data silos, build customer profiles, and operationalize customer data throughout the company.
Hubspot: Building a Fast, Thread-safe Hotspot Tracking Library
LMAX Disruptor is one of the best libraries in java to build a bounded queue with lock-free enqueues. Hubspot writes about how LMAX Disruptor helped to build a fast, thread-safe tracking library.
https://product.hubspot.com/blog/hotspot-tracking-library
LMAX Disruptor Paper: LMAX Disruptor: High performance alternative to bounded queues for exchanging data between concurrent threads
Games24x7: ksqlDB in Data Engineering at Games24x7
Games24x7 shares its experience running ksqlDB and tuning for optimal performance. Tuning state storage, horizontal vs. vertical scaling and GC optimization reveal the internal functioning of ksqlDB.
https://medium.com/@anupsdtiwari/ksqldb-in-data-engineering-at-games24x7-9c66b7cf5aa0
Yelp: Spark Data Lineage
Yelp writes about its attempt to capture the Spark lineage from metadata description files.
I'm curious whether there is any automatic way to capture the Spark lineage than manual input, perhaps AST parsing of Spark logical plan. Please comment if you know any library that parses Spark's logical plan for lineage.
https://engineeringblog.yelp.com/2022/08/spark-data-lineage.html
Cloudera: Speeding up Queries With Z-Order
Z-Order indexing is a popular index optimization implemented in many data management systems. Cloudera writes about Z-Order implementation in Impala and how it improves query efficiency.
https://blog.cloudera.com/speeding-up-queries-with-z-order/
References:
Z-order indexing for multifaceted queries in Amazon DynamoDB
Performance Tuning Apache Spark with Z-Ordering and Data Skipping in Azure Databricks
Hudi Z-Order and Hilbert Space Filling Curves
All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.
>Please comment if you know any library that parses Spark's logical plan for lineage.
Spark send lineage to Apache Atlas via Kafka by this: https://github.com/hortonworks-spark/spark-atlas-connector