Data Engineering Weekly

Data Engineering Weekly

Share this post

Data Engineering Weekly
Data Engineering Weekly
Data Engineering Weekly #97
Copy link
Facebook
Email
Notes
More
User's avatar
Discover more from Data Engineering Weekly
The Weekly Data Engineering Newsletter
Over 35,000 subscribers
Already have an account? Sign in

Data Engineering Weekly #97

The Weekly Data Engineering Newsletter

Ananth Packkildurai's avatar
Ananth Packkildurai
Aug 22, 2022
7

Share this post

Data Engineering Weekly
Data Engineering Weekly
Data Engineering Weekly #97
Copy link
Facebook
Email
Notes
More
Share

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.


OneHouse: Apache Hudi vs. Delta Lake vs. Apache Iceberg - Lakehouse Feature Comparison

The performance benchmark often takes center stage in the LakeHouse & Data Warehouse comparisons. Rich feature support & developer friendliness are often unspoken and left to individuals to compare. OneHouse writes an excellent feature-by-feature comparison of open-source Apache Hudi with Delta Lake and Apache Iceberg.

https://www.onehouse.ai/blog/apache-hudi-vs-delta-lake-vs-apache-iceberg-lakehouse-feature-comparison


Dagster: Roman roads in data engineering - stop starting from scratch whenever you write a data pipeline

Reusability in the data pipeline eliminates duplications and improves the consistency of the data assets. Dagster writes about how it enables the data assets and operations reusability. It would be great to add how this reusable component converts into the target runtimes like Snowflake et al.,

https://dagster.io/blog/roman-roads-assets-ops


InfoQ: Debezium and Quarkus - Change Data Capture Patterns to Avoid Dual-Writes Problems

The event-driven architecture often encounters the dual-write problem where the producer mutates the state in the relational databases and sends an event notification about the changes. It brings a typical two-phase commit complexity to maintain the consistent state of the system. The author writes about how the Outbox pattern with Debezium can potentially solve this problem.

https://www.infoq.com/articles/change-data-capture-debezium/


Stan Lin: Apache Spark on Kubernetes-Lessons Learned from Launching Millions of Spark Executors (Databricks Data+AI Summit 2022)

An excellent summarization of the Data/AI summit talk 2022 on running Spark on Kubernetes. You can watch the talk here

https://medium.com/@coderstan/apache-spark-on-kubernetes-lessons-learned-from-launching-millions-of-spark-executors-databricks-9187890f0dc3


Sponsored: Firebolt is a proud sponsor of Data Engineering Weekly.

Firebolt is the cloud data warehouse for builders of next-gen analytics experiences.
Combining the benefits and ease of use .of modern architecture with a
sub-second performance at a terabyte-scale, Firebolt helps data engineering
and dev teams deliver data applications that end-users love.

https://www.firebolt.io/


Adobe: Wins from Effective Kafka Monitoring at Adobe: Stability, Performance, and Cost Savings

Adobe writes some practical tips to improve the operational efficiency of the Kafka infrastructure. The blog narrates some valuable metrics to measure the dynamics of producer-consumer performances.

https://medium.com/adobetech/wins-from-effective-kafka-monitoring-at-adobe-stability-performance-and-cost-savings-a3ecb701ee5b


Stitch Fix: Validating your data with Hamilton

Late 2021, Stitch Fix open source Hamilton is a general purpose micro-framework for creating python functions. Stitch Fix writes an exciting blog on implementing the data quality check part of the Hamilton workflow. TIL about Pandera data validation library.

https://multithreaded.stitchfix.com/blog/2022/07/26/hamilton-data-quality/


Sponsored: Why data engineering needs a DSL to check data as-code

Data reliability needs its language - a language that is specific enough to address the problems of data engineers who often find themselves firefighting data issues when reports, dashboards, or machine learning models break, yet specific enough to address the problems that data teams face, and accessible enough for non-engineers to use. (I’m)Possible? Check out Soda Checks Language.

https://www.soda.io/resources/introducing-a-new-domain-specific-language-for-data-reliability


Sean Gallagher: Machine learning, concluded: Did the “no-code” tools beat manual analysis?

Are the low-code/ no-code ML tools can beat manual analysis? Can a relative novice use these tools effectively and accurately? Are these tools are cost-effective than hiring an expert data scientist? The author runs through a simulation to find out. It is exciting, and read for yourself to find the conclusion.

https://arstechnica.com/information-technology/2022/08/no-code-wrapped-our-ml-experiment-concludes-but-did-the-machine-win/


eBay: Sherlock.io - An Upgraded Machine Learning Monitoring System

Anomaly detection is a critical function in the observability & reliability of the system. eBay writes about how it incorporates a machine learning approach on Prometheus to deliver an intelligent alerting & monitoring system.

https://tech.ebayinc.com/engineering/sherlock.io-an-upgraded-machine-learning-monitoring-system/


Sponsored: Why Business Applications Create Data Integration Debt

This article from Ben Rogajan explores the challenges of data integration in a world where more teams need access to more data for more complex use cases, and it outlines the pitfalls of attacking data integration without a thoughtful strategy.

https://www.rudderstack.com/blog/why-business-applications-create-data-integration-debt


Stas Sajin: Why is Snowflake so expensive?

Simplicity and cost in the modern data stack are often indirectly proportional to each other :-( It's common to hear about the expensive billing of the cloud data warehouses. The author explains why Snowflake is expensive and what it can do better.

https://blog.devgenius.io/why-is-snowflake-so-expensive-92b67203945


Endeavor: Machine Learning Platform at Endeavor

Endeavor writes about its ML platform built on Prefect, dbt & Snowflake. The overarching principles to build the platform, the dbt model organization & model export ops are some exciting reads.

https://medium.com/@endeavordata/machine-learning-platform-at-endeavor-93ba88b66986


AfroInfoTech: Time Travel versus Slowly Changing Dimension Type 2

The historical version or the slowly changing dimensions increases the complexity of a data management system. Slowly changing dimensions types came in the era of storage scarcity. What is the advantage of slowly changing dimension techniques vs. time travel? Find out more in the blog.

https://afroinfotech.medium.com/time-travel-versus-slowly-changing-dimension-type-2-a37ac1538e0d


Riskified Technology: Know Your Limits - Cluster Benchmarks

Benchmarking a stateful system to find the cluster capacity is critical for capacity planning and supporting the company's future growth. As the author rightly pointed out, relying on the vendor benchmark is unreliable. Riskfields writes about using the OpenMessaging Benchmarks (OMB) to gain benchmark insights.

https://medium.com/riskified-technology/know-your-limits-cluster-benchmarks-ecc6c3c77574


All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.


Subscribe to Data Engineering Weekly

By Ananth Packkildurai · Launched 5 years ago
The Weekly Data Engineering Newsletter
Marcos Ortiz's avatar
7 Likes
7

Share this post

Data Engineering Weekly
Data Engineering Weekly
Data Engineering Weekly #97
Copy link
Facebook
Email
Notes
More
Share

Discussion about this post

User's avatar
Functional Data Engineering - A Blueprint
How to build a Recoverable & Reproducible data pipeline
Dec 22, 2022 • 
Ananth Packkildurai
73

Share this post

Data Engineering Weekly
Data Engineering Weekly
Functional Data Engineering - A Blueprint
Copy link
Facebook
Email
Notes
More
3
The Future of Data Engineering: DEW's 2025 Predictions
Emerging Innovations, Evolving Roles, and the Roadmap to Scalable AI-Driven Insights
Dec 19, 2024 • 
Ananth Packkildurai
47

Share this post

Data Engineering Weekly
Data Engineering Weekly
The Future of Data Engineering: DEW's 2025 Predictions
Copy link
Facebook
Email
Notes
More
2
Towards Composable Data Infrastructure
A Case for Federated Data Catalog
Apr 11 • 
Ananth Packkildurai
37

Share this post

Data Engineering Weekly
Data Engineering Weekly
Towards Composable Data Infrastructure
Copy link
Facebook
Email
Notes
More

Ready for more?

© 2025 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More

Create your profile

User's avatar

Only paid subscribers can comment on this post

Already a paid subscriber? Sign in

Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.