Data Engineering Weekly #95

The Weekly Data Engineering Newsletter

Jul 31, 2022

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

Uber: ML Education at Uber: Frameworks Inspired by Engineering Principles

Uber wins by scaling machine learning. We recognize org-wide that a powerful way to scale machine learning adoption is by educating. That’s why we created the Machine Learning Education Program.

Educating internal customers is a vital part of a platform. An Engineering org is a mesh of various skills, and not everyone reads Data Engineering Weekly or goes to all the cool Data & ML conferences. Uber writes an excellent article on ML education and its approaches from the engineering principles perspective.

https://eng.uber.com/ml-education-at-uber/

Amplify Partners: Building Modern Data Teams

The Amplify team published an excellent collection of articles to help build modern data teams. The collection includes building a data org, data strategy, the focus of a data team, and establishing a career in data et al.,

https://amplifypartners.com/moderndatateamshub/

Stephen Bailey: Airflow's Problem

Oh wow! Shot Fired. Possibly one of the best articles articulating clearly the problems with Airflow.

So what is my problem with Airflow? My problem is that Airflow was not designed to address these problems — it lacks the ambition we need, even while occupying the critical pedestal as the foundational execution engine.
In fact, Airflow is already displaced. Airflow qua Airflow is already obsolete, and it happened right within the Airflow ecosystem. It’s called Astronomer.

I had a similar experience with the Astronomer team while explaining most of the problems mentioned by the author and sketched why dbt is Airflow's missed opportunity. I wrote the gist of it here

https://www.dataengineeringweekly.com/p/bundling-vs-unbundling-the-tale-of.

The response I got; Our customer struggles to spin off and maintain Airflow, and we are going after the cloud infrastructure. From a business aspect, I agree & understand the strategy. However, I can't stop thinking; is commercialization killing the mission of an open source system?

https://stkbailey.substack.com/p/airflows-problem

Benn Stancil: The powder keg of the modern data stack

Another classic narration from Benn about the current state of the data landscape and the promising land of what it could be.

For the last few years, most data startups have followed Peter Thiel’s advice: Avoid competition. This positioning, however, can’t last forever. Eventually, as startups grow, companies that see one another as polite partners will start to jockey for the same space.

I can’t second this thought enough.

Ananth Packkildurai @ananthdurai

🎯 "Having this many tools without a coherent, centralized control plane is lunacy and a terrible endstate for data practitioners and their stakeholders."🎯 I hear a comparison of modern data stack(MDS) with Unix philosophy!!! Who is the Unix terminal for MDS is a Billion $??

Nick Schrock @schrockn

4/ I don’t think anyone believes that this is an ideal end state. The post itself advocates for consolidation. Having this many tools without a coherent, centralized control plane is lunacy, and a terrible endstate for data practitioners and their stakeholders. We see two trends:

https://benn.substack.com/p/powder-keg

George Kozlov: Databricks usage and cost analysis

Cost optimization is a critical engineering function in system design. The author explains the strategies and sourcing data to optimize the Databricks cloud cost.

https://medium.com/@progeorgek/databricks-usage-and-cost-analysis-e974e380916a

99.co: The Evolution of Transformation Layer Architecture in 99 Group (DBT, Airflow, and Kubernetes)

99.co writes an excellent overview of the infrastructure setup of Airflow, dbt & Kubernetes with code examples.

https://medium.com/99dotco/the-evolution-of-transformation-layer-architecture-in-99-group-dbt-airflow-and-kubernetes-cb46900f3662

Confluent: 4 Must-Have Tests for Your Apache Kafka CI/CD with GitHub Actions

Confluent writes an excellent article about the approaches one can take to apply unit & integration testing for Apache Kafka pipelines.

https://www.confluent.io/blog/apache-kafka-ci-cd-with-github/

Niels Claeys: Make Spark resilient against spot interruptions on Kubernetes

Running the Apache Spark workload on spot instances can reduce the infrastructure cost up to 60-90% compares to on-demand instances. The author writes some tips & design strategies to design Spark applications spot instances ready.

https://blog.dataminded.com/make-spark-resilient-against-spot-interruptions-on-kubernetes-a2d6403399b0

Alexey Grishchenko: Lakehouse

The CIDR 2021 paper on Lakehouse Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics is an informative read on modern Lakehouse design. The author writes an excellent summary of the paper narrating the historical development before the Lakehouse systems.

https://0x0fff.com/lakehouse/

Kevin Hu: A Framework to Understand How Low-Quality Data Hurts Business Performance

Is Data Quality impacts the business operation? The author walks through various maturity stages of data adoption in an organization and the framework for measuring the impact of data quality in each maturity cycle.

https://towardsdatascience.com/a-framework-to-understand-how-low-quality-data-hurts-business-performance-386c10c4fe1e

Criteo: Reporting Data at Criteo: How to Measure at Scale

Criteo writes about its reporting infrastructure built on top of Apache Kafka, HDFS, and Vertica. (🤔 is that what folks call pre-modern data stack?). The blog narrates Critieo's approach to data quality, data modeling, and data partition approaches.

https://medium.com/criteo-engineering/reporting-data-at-criteo-how-to-measure-at-scale-b315b0d8d78a

Google AI: ML-Enhanced Code Completion Improves Developer Productivity

I’m an active user of the Tabnine code completion plugin in IntelliJ, and I’m sure many with Github copilot. Google writes about how it combines ML and SE to develop a novel Transformer-based hybrid semantic ML code completion for internal Google developers.

https://ai.googleblog.com/2022/07/ml-enhanced-code-completion-improves.html

Harvard: CS109A - Introduction to Data Science Lectures

The Harvard University CS109A- introduction to Data Science lecture materials, Python notebooks, and video lectures are now free for download.

https://harvard-iacs.github.io/2019-CS109A/pages/materials.html

All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly