Data Engineering Weekly #97

The Weekly Data Engineering Newsletter

Aug 22, 2022

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

OneHouse: Apache Hudi vs. Delta Lake vs. Apache Iceberg - Lakehouse Feature Comparison

The performance benchmark often takes center stage in the LakeHouse & Data Warehouse comparisons. Rich feature support & developer friendliness are often unspoken and left to individuals to compare. OneHouse writes an excellent feature-by-feature comparison of open-source Apache Hudi with Delta Lake and Apache Iceberg.

https://www.onehouse.ai/blog/apache-hudi-vs-delta-lake-vs-apache-iceberg-lakehouse-feature-comparison

Dagster: Roman roads in data engineering - stop starting from scratch whenever you write a data pipeline

Reusability in the data pipeline eliminates duplications and improves the consistency of the data assets. Dagster writes about how it enables the data assets and operations reusability. It would be great to add how this reusable component converts into the target runtimes like Snowflake et al.,

https://dagster.io/blog/roman-roads-assets-ops

InfoQ: Debezium and Quarkus - Change Data Capture Patterns to Avoid Dual-Writes Problems

The event-driven architecture often encounters the dual-write problem where the producer mutates the state in the relational databases and sends an event notification about the changes. It brings a typical two-phase commit complexity to maintain the consistent state of the system. The author writes about how the Outbox pattern with Debezium can potentially solve this problem.

https://www.infoq.com/articles/change-data-capture-debezium/

Stan Lin: Apache Spark on Kubernetes-Lessons Learned from Launching Millions of Spark Executors (Databricks Data+AI Summit 2022)

An excellent summarization of the Data/AI summit talk 2022 on running Spark on Kubernetes. You can watch the talk here

https://medium.com/@coderstan/apache-spark-on-kubernetes-lessons-learned-from-launching-millions-of-spark-executors-databricks-9187890f0dc3

Adobe: Wins from Effective Kafka Monitoring at Adobe: Stability, Performance, and Cost Savings

Adobe writes some practical tips to improve the operational efficiency of the Kafka infrastructure. The blog narrates some valuable metrics to measure the dynamics of producer-consumer performances.

https://medium.com/adobetech/wins-from-effective-kafka-monitoring-at-adobe-stability-performance-and-cost-savings-a3ecb701ee5b

Stitch Fix: Validating your data with Hamilton

Late 2021, Stitch Fix open source Hamilton is a general purpose micro-framework for creating python functions. Stitch Fix writes an exciting blog on implementing the data quality check part of the Hamilton workflow. TIL about Pandera data validation library.

https://multithreaded.stitchfix.com/blog/2022/07/26/hamilton-data-quality/

Sean Gallagher: Machine learning, concluded: Did the “no-code” tools beat manual analysis?

Are the low-code/ no-code ML tools can beat manual analysis? Can a relative novice use these tools effectively and accurately? Are these tools are cost-effective than hiring an expert data scientist? The author runs through a simulation to find out. It is exciting, and read for yourself to find the conclusion.

https://arstechnica.com/information-technology/2022/08/no-code-wrapped-our-ml-experiment-concludes-but-did-the-machine-win/

eBay: Sherlock.io - An Upgraded Machine Learning Monitoring System

Anomaly detection is a critical function in the observability & reliability of the system. eBay writes about how it incorporates a machine learning approach on Prometheus to deliver an intelligent alerting & monitoring system.

https://tech.ebayinc.com/engineering/sherlock.io-an-upgraded-machine-learning-monitoring-system/

Stas Sajin: Why is Snowflake so expensive?

Simplicity and cost in the modern data stack are often indirectly proportional to each other :-( It's common to hear about the expensive billing of the cloud data warehouses. The author explains why Snowflake is expensive and what it can do better.

https://blog.devgenius.io/why-is-snowflake-so-expensive-92b67203945

Endeavor: Machine Learning Platform at Endeavor

Endeavor writes about its ML platform built on Prefect, dbt & Snowflake. The overarching principles to build the platform, the dbt model organization & model export ops are some exciting reads.

https://medium.com/@endeavordata/machine-learning-platform-at-endeavor-93ba88b66986

AfroInfoTech: Time Travel versus Slowly Changing Dimension Type 2

The historical version or the slowly changing dimensions increases the complexity of a data management system. Slowly changing dimensions types came in the era of storage scarcity. What is the advantage of slowly changing dimension techniques vs. time travel? Find out more in the blog.

https://afroinfotech.medium.com/time-travel-versus-slowly-changing-dimension-type-2-a37ac1538e0d

Riskified Technology: Know Your Limits - Cluster Benchmarks

Benchmarking a stateful system to find the cluster capacity is critical for capacity planning and supporting the company's future growth. As the author rightly pointed out, relying on the vendor benchmark is unreliable. Riskfields writes about using the OpenMessaging Benchmarks (OMB) to gain benchmark insights.

https://medium.com/riskified-technology/know-your-limits-cluster-benchmarks-ecc6c3c77574

All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly