Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
Uber: ML Education at Uber: Frameworks Inspired by Engineering Principles
Uber wins by scaling machine learning. We recognize org-wide that a powerful way to scale machine learning adoption is by educating. That’s why we created the Machine Learning Education Program.
Educating internal customers is a vital part of a platform. An Engineering org is a mesh of various skills, and not everyone reads Data Engineering Weekly or goes to all the cool Data & ML conferences. Uber writes an excellent article on ML education and its approaches from the engineering principles perspective.
https://eng.uber.com/ml-education-at-uber/
Amplify Partners: Building Modern Data Teams
The Amplify team published an excellent collection of articles to help build modern data teams. The collection includes building a data org, data strategy, the focus of a data team, and establishing a career in data et al.,
https://amplifypartners.com/moderndatateamshub/
Stephen Bailey: Airflow's Problem
Oh wow! Shot Fired. Possibly one of the best articles articulating clearly the problems with Airflow.
So what is my problem with Airflow? My problem is that Airflow was not designed to address these problems — it lacks the ambition we need, even while occupying the critical pedestal as the foundational execution engine.
In fact, Airflow is already displaced. Airflow qua Airflow is already obsolete, and it happened right within the Airflow ecosystem. It’s called Astronomer.
I had a similar experience with the Astronomer team while explaining most of the problems mentioned by the author and sketched why dbt is Airflow's missed opportunity. I wrote the gist of it here
https://www.dataengineeringweekly.com/p/bundling-vs-unbundling-the-tale-of.
The response I got; Our customer struggles to spin off and maintain Airflow, and we are going after the cloud infrastructure. From a business aspect, I agree & understand the strategy. However, I can't stop thinking; is commercialization killing the mission of an open source system?
https://stkbailey.substack.com/p/airflows-problem
Benn Stancil: The powder keg of the modern data stack
Another classic narration from Benn about the current state of the data landscape and the promising land of what it could be.
For the last few years, most data startups have followed Peter Thiel’s advice: Avoid competition. This positioning, however, can’t last forever. Eventually, as startups grow, companies that see one another as polite partners will start to jockey for the same space.
I can’t second this thought enough.
https://benn.substack.com/p/powder-keg
Sponsored: Firebolt - Firebolt is a proud sponsor of Data Engineering Weekly.
Firebolt is the cloud data warehouse for builders of next-gen analytics experiences.
Combining the benefits and ease of use of modern architecture with a
sub-second performance at a terabyte-scale, Firebolt helps data engineering
and dev teams deliver data applications that end-users love.
George Kozlov: Databricks usage and cost analysis
Cost optimization is a critical engineering function in system design. The author explains the strategies and sourcing data to optimize the Databricks cloud cost.
https://medium.com/@progeorgek/databricks-usage-and-cost-analysis-e974e380916a
99.co: The Evolution of Transformation Layer Architecture in 99 Group (DBT, Airflow, and Kubernetes)
99.co writes an excellent overview of the infrastructure setup of Airflow, dbt & Kubernetes with code examples.
Sponsored: Soda - Data engineers, your life is about to get easier.
Get hands-on with the new open-source framework to test and monitor data as-code, across every data workload, from ingestion to transformation to production. Easy to set up, read, and maintain. Try out and install Soda Core to see how to stop firefighting data issues, maintain reliable pipelines, and deliver high-quality, reliable data products. Access the docs here,
https://docs.soda.io/soda-core/overview-main.html
Confluent: 4 Must-Have Tests for Your Apache Kafka CI/CD with GitHub Actions
Confluent writes an excellent article about the approaches one can take to apply unit & integration testing for Apache Kafka pipelines.
https://www.confluent.io/blog/apache-kafka-ci-cd-with-github/
Niels Claeys: Make Spark resilient against spot interruptions on Kubernetes
Running the Apache Spark workload on spot instances can reduce the infrastructure cost up to 60-90% compares to on-demand instances. The author writes some tips & design strategies to design Spark applications spot instances ready.
Sponsored: The Data Stack Show Live - The Future of ML
Join this live recording of The Data Stack Show with Continual CEO, Tristan Zajonc and Tecton Engineering Manager, Willem Pienaar to explore the democratization of ML, how it's impacting businesses today, and where things are headed next.
https://datastackshow.com/events/the-future-of-machine-learning/
Alexey Grishchenko: Lakehouse
The CIDR 2021 paper on Lakehouse Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics is an informative read on modern Lakehouse design. The author writes an excellent summary of the paper narrating the historical development before the Lakehouse systems.
Kevin Hu: A Framework to Understand How Low-Quality Data Hurts Business Performance
Is Data Quality impacts the business operation? The author walks through various maturity stages of data adoption in an organization and the framework for measuring the impact of data quality in each maturity cycle.
Criteo: Reporting Data at Criteo: How to Measure at Scale
Criteo writes about its reporting infrastructure built on top of Apache Kafka, HDFS, and Vertica. (🤔 is that what folks call pre-modern data stack?). The blog narrates Critieo's approach to data quality, data modeling, and data partition approaches.
https://medium.com/criteo-engineering/reporting-data-at-criteo-how-to-measure-at-scale-b315b0d8d78a
Google AI: ML-Enhanced Code Completion Improves Developer Productivity
I’m an active user of the Tabnine code completion plugin in IntelliJ, and I’m sure many with Github copilot. Google writes about how it combines ML and SE to develop a novel Transformer-based hybrid semantic ML code completion for internal Google developers.
https://ai.googleblog.com/2022/07/ml-enhanced-code-completion-improves.html
Harvard: CS109A - Introduction to Data Science Lectures
The Harvard University CS109A- introduction to Data Science lecture materials, Python notebooks, and video lectures are now free for download.
https://harvard-iacs.github.io/2019-CS109A/pages/materials.html
All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.