Data Engineering Weekly

Share this post
Data Engineering Weekly #96
www.dataengineeringweekly.com

Data Engineering Weekly #96

The Weekly Data Engineering Newsletter

Ananth Packkildurai
Aug 14, 2022
7
2
Share this post
Data Engineering Weekly #96
www.dataengineeringweekly.com

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.


Uber: ML Education at Uber: Program Design and Outcomes

Uber writes the second part of its internal ML education program focusing on content delivery, usage observability, marketing & reach. The blog post is a good reminder that platform engineering is not only about building abstraction and driving standardization but more about marketing & selling the abstraction to the internal customers.

https://www.uber.com/blog/ml-education-at-uber-program-design-and-outcomes/

first part: https://www.uber.com/blog/ml-education-at-uber/


PayPal: The next generation of Data Platforms is the Data Mesh

PayPal writes about its strategy to adopt data mesh principles. The blog acknowledged there is no standard implementation yet, but established a business case why PayPal needs DataMesh principles in their data strategy. It is an exciting space to observe from PayPal.

https://medium.com/paypal-tech/the-next-generation-of-data-platforms-is-the-data-mesh-b7df4b825522

The talk gives a much more systematic case for DataMesh.


InfoQ: AI, ML, and Data Engineering InfoQ Trends Report—August 2022

InfoQ released its 2022 AI, ML & Data Engineering Trends report. The Resource Managers like Yarn and stream processing now moved to the late adopted stage. There is tons of exciting new entrant Knowledge Graphs, AI pair programmer (like Github Copilot), and Synthetic Data Generation.

https://www.infoq.com/articles/ai-ml-data-engineering-trends-2022/


DoorDash: Building Scalable Real Time Event Processing with Kafka and Flink

DoorDash writes about its real-time data infrastructure on top of Kafka, Flink & Pinot. The blog narrates how a producer proxy for Kafka helped to scale the pipeline, Flink SQL abstraction, the usage of Kafka schema registry, and data warehouse integration.

https://doordash.engineering/2022/08/02/building-scalable-real-time-event-processing-with-kafka-and-flink/


Sponsored: Firebolt is a proud sponsor of Data Engineering Weekly.

Firebolt is the cloud data warehouse for builders of next-gen analytics experiences.
Combining the benefits and ease of use .of modern architecture with a
sub-second performance at a terabyte-scale, Firebolt helps data engineering
and dev teams deliver data applications that end-users love.

https://www.firebolt.io/


Airbnb: Airbnb’s Approach to Access Management at Scale

Airbnb writes about the design of its Access Management system. The blog narrates various focus points of the design, system guarantees, and the impact of the system after the rollout.

https://medium.com/airbnb-engineering/airbnbs-approach-to-access-management-at-scale-cfa66c32f03c


Etsy: Faster ML Experimentation at Etsy with Interleaving

Online experimentation plays a central role in product development. Etsy writes about how it uses the Interleaving Experimentation Test to capture the user's preference at the individual level rather than comparing average behaviors of two groups seeing distinct experiences with typical AB tests.

https://www.etsy.com/codeascraft/faster-ml-experimentation-at-etsy-with-interleaving


Sponsored: Why data engineering needs a DSL to check data as-code

Data reliability needs its own language - a language that is specific enough to address the problems of data engineers who often find themselves firefighting data issues when reports, dashboards, or machine learning models break, yet specific enough to address the problems that data teams face, and accessible enough for non-engineers to use. (Im)Possible? Check out Soda Checks Language.

https://www.soda.io/resources/introducing-a-new-domain-specific-language-for-data-reliability


Astrafy: dbt at scale on Google Cloud

Astrafy writes a 3 part series on building dbt pipeline on Google Cloud. The blog focuses on dbt in Google Cloud architecture, dbt versioning, data quality, orchestrating with Google cloud composer, and data ops.

Part 1: https://www.astrafy.io/articles/dbt-at-scale-on-google-cloud-part-1

Part 2: https://www.astrafy.io/articles/dbt-at-scale-on-google-cloud-part-2

Part 3: https://www.astrafy.io/articles/dbt-at-scale-on-google-cloud-part-3


Dominik Golebiewski: Snowflake query optimiser: unoptimised

dbt established a case for CTEs (Common Table Expression) are passthrough, and the performance impact is negligible as modern data warehouse optimizers recognize this pattern. The blog narrates how that is not the case with Snowflake by comparing the imported CTE with referencing the base table directly in CTE. The results yield a reduced build time from over 30 minutes to less than 10 minutes, roughly $600 saving in a table.

https://medium.com/@AtheonAnalytics/snowflake-query-optimiser-unoptimised-cf0223bdd136

Reference in the article:

https://discourse.getdbt.com/t/ctes-are-passthroughs-some-research/155

https://discourse.getdbt.com/t/why-the-fishtown-sql-style-guide-uses-so-many-ctes/1091


Sponsored: How Allbirds solves identity resolution in the warehouse with dbt Labs, Snowflake, and RudderStack

Join this live webinar on August 17th for a deep dive on identity resolution in the warehouse with Allbirds Staff Data Engineer, Chandra Gangireddy. Chandra will detail the architecture and end-to-end data flow they use to break down data silos, build customer profiles, and operationalize customer data throughout the company.

https://www.rudderstack.com/events/how-allbirds-solves-identity-resolution-in-the-warehouse-with-dbt-labs-snowflake-and-rudderstack


Hubspot: Building a Fast, Thread-safe Hotspot Tracking Library

LMAX Disruptor is one of the best libraries in java to build a bounded queue with lock-free enqueues. Hubspot writes about how LMAX Disruptor helped to build a fast, thread-safe tracking library.

https://product.hubspot.com/blog/hotspot-tracking-library

LMAX Disruptor Paper: LMAX Disruptor: High performance alternative to bounded queues for exchanging data between concurrent threads


Games24x7: ksqlDB in Data Engineering at Games24x7

Games24x7 shares its experience running ksqlDB and tuning for optimal performance. Tuning state storage, horizontal vs. vertical scaling and GC optimization reveal the internal functioning of ksqlDB.

https://medium.com/@anupsdtiwari/ksqldb-in-data-engineering-at-games24x7-9c66b7cf5aa0


Yelp: Spark Data Lineage

Yelp writes about its attempt to capture the Spark lineage from metadata description files.

I'm curious whether there is any automatic way to capture the Spark lineage than manual input, perhaps AST parsing of Spark logical plan. Please comment if you know any library that parses Spark's logical plan for lineage.

https://engineeringblog.yelp.com/2022/08/spark-data-lineage.html


Cloudera: Speeding up Queries With Z-Order

Z-Order indexing is a popular index optimization implemented in many data management systems. Cloudera writes about Z-Order implementation in Impala and how it improves query efficiency.

https://blog.cloudera.com/speeding-up-queries-with-z-order/

References:

Z-order indexing for multifaceted queries in Amazon DynamoDB

Performance Tuning Apache Spark with Z-Ordering and Data Skipping in Azure Databricks

Hudi Z-Order and Hilbert Space Filling Curves


All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

2
Share this post
Data Engineering Weekly #96
www.dataengineeringweekly.com
2 Comments
Shinji
Aug 15, 2022·edited Aug 15, 2022

>Please comment if you know any library that parses Spark's logical plan for lineage.

Spark send lineage to Apache Atlas via Kafka by this: https://github.com/hortonworks-spark/spark-atlas-connector

Expand full comment
ReplyCollapse
1 reply by Ananth Packkildurai
1 more comment…
TopNewCommunity

No posts

Ready for more?

© 2023 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing