Data Engineering Weekly #97
The Weekly Data Engineering Newsletter
Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
OneHouse: Apache Hudi vs. Delta Lake vs. Apache Iceberg - Lakehouse Feature Comparison
The performance benchmark often takes center stage in the LakeHouse & Data Warehouse comparisons. Rich feature support & developer friendliness are often unspoken and left to individuals to compare. OneHouse writes an excellent feature-by-feature comparison of open-source Apache Hudi with Delta Lake and Apache Iceberg.
Dagster: Roman roads in data engineering - stop starting from scratch whenever you write a data pipeline
Reusability in the data pipeline eliminates duplications and improves the consistency of the data assets. Dagster writes about how it enables the data assets and operations reusability. It would be great to add how this reusable component converts into the target runtimes like Snowflake et al.,
InfoQ: Debezium and Quarkus - Change Data Capture Patterns to Avoid Dual-Writes Problems
The event-driven architecture often encounters the dual-write problem where the producer mutates the state in the relational databases and sends an event notification about the changes. It brings a typical two-phase commit complexity to maintain the consistent state of the system. The author writes about how the Outbox pattern with Debezium can potentially solve this problem.
Stan Lin: Apache Spark on Kubernetes-Lessons Learned from Launching Millions of Spark Executors (Databricks Data+AI Summit 2022)
An excellent summarization of the Data/AI summit talk 2022 on running Spark on Kubernetes. You can watch the talk here
Sponsored: Firebolt is a proud sponsor of Data Engineering Weekly.
Firebolt is the cloud data warehouse for builders of next-gen analytics experiences.
Combining the benefits and ease of use .of modern architecture with a
sub-second performance at a terabyte-scale, Firebolt helps data engineering
and dev teams deliver data applications that end-users love.
Adobe: Wins from Effective Kafka Monitoring at Adobe: Stability, Performance, and Cost Savings
Adobe writes some practical tips to improve the operational efficiency of the Kafka infrastructure. The blog narrates some valuable metrics to measure the dynamics of producer-consumer performances.
Stitch Fix: Validating your data with Hamilton
Late 2021, Stitch Fix open source Hamilton is a general purpose micro-framework for creating python functions. Stitch Fix writes an exciting blog on implementing the data quality check part of the Hamilton workflow. TIL about Pandera data validation library.
Sponsored: Why data engineering needs a DSL to check data as-code
Data reliability needs its language - a language that is specific enough to address the problems of data engineers who often find themselves firefighting data issues when reports, dashboards, or machine learning models break, yet specific enough to address the problems that data teams face, and accessible enough for non-engineers to use. (I’m)Possible? Check out Soda Checks Language.
Sean Gallagher: Machine learning, concluded: Did the “no-code” tools beat manual analysis?
Are the low-code/ no-code ML tools can beat manual analysis? Can a relative novice use these tools effectively and accurately? Are these tools are cost-effective than hiring an expert data scientist? The author runs through a simulation to find out. It is exciting, and read for yourself to find the conclusion.
eBay: Sherlock.io - An Upgraded Machine Learning Monitoring System
Anomaly detection is a critical function in the observability & reliability of the system. eBay writes about how it incorporates a machine learning approach on Prometheus to deliver an intelligent alerting & monitoring system.
Sponsored: Why Business Applications Create Data Integration Debt
This article from Ben Rogajan explores the challenges of data integration in a world where more teams need access to more data for more complex use cases, and it outlines the pitfalls of attacking data integration without a thoughtful strategy.
Stas Sajin: Why is Snowflake so expensive?
Simplicity and cost in the modern data stack are often indirectly proportional to each other :-( It's common to hear about the expensive billing of the cloud data warehouses. The author explains why Snowflake is expensive and what it can do better.
Endeavor: Machine Learning Platform at Endeavor
Endeavor writes about its ML platform built on Prefect, dbt & Snowflake. The overarching principles to build the platform, the dbt model organization & model export ops are some exciting reads.
AfroInfoTech: Time Travel versus Slowly Changing Dimension Type 2
The historical version or the slowly changing dimensions increases the complexity of a data management system. Slowly changing dimensions types came in the era of storage scarcity. What is the advantage of slowly changing dimension techniques vs. time travel? Find out more in the blog.
Riskified Technology: Know Your Limits - Cluster Benchmarks
Benchmarking a stateful system to find the cluster capacity is critical for capacity planning and supporting the company's future growth. As the author rightly pointed out, relying on the vendor benchmark is unreliable. Riskfields writes about using the OpenMessaging Benchmarks (OMB) to gain benchmark insights.
All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.