Data Engineering Weekly

Share this post
Data Engineering Weekly #95
www.dataengineeringweekly.com

Data Engineering Weekly #95

The Weekly Data Engineering Newsletter

Ananth Packkildurai
Jul 31, 2022
7
Share this post
Data Engineering Weekly #95
www.dataengineeringweekly.com

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.


Uber: ML Education at Uber: Frameworks Inspired by Engineering Principles

Uber wins by scaling machine learning. We recognize org-wide that a powerful way to scale machine learning adoption is by educating. That’s why we created the Machine Learning Education Program.

Educating internal customers is a vital part of a platform. An Engineering org is a mesh of various skills, and not everyone reads Data Engineering Weekly or goes to all the cool Data & ML conferences. Uber writes an excellent article on ML education and its approaches from the engineering principles perspective.

https://eng.uber.com/ml-education-at-uber/


Amplify Partners: Building Modern Data Teams

The Amplify team published an excellent collection of articles to help build modern data teams. The collection includes building a data org, data strategy, the focus of a data team, and establishing a career in data et al.,

https://amplifypartners.com/moderndatateamshub/


Stephen Bailey: Airflow's Problem

Oh wow! Shot Fired. Possibly one of the best articles articulating clearly the problems with Airflow.

So what is my problem with Airflow? My problem is that Airflow was not designed to address these problems — it lacks the ambition we need, even while occupying the critical pedestal as the foundational execution engine.

In fact, Airflow is already displaced. Airflow qua Airflow is already obsolete, and it happened right within the Airflow ecosystem. It’s called Astronomer.

I had a similar experience with the Astronomer team while explaining most of the problems mentioned by the author and sketched why dbt is Airflow's missed opportunity. I wrote the gist of it here

https://www.dataengineeringweekly.com/p/bundling-vs-unbundling-the-tale-of.

The response I got; Our customer struggles to spin off and maintain Airflow, and we are going after the cloud infrastructure. From a business aspect, I agree & understand the strategy. However, I can't stop thinking; is commercialization killing the mission of an open source system?

https://stkbailey.substack.com/p/airflows-problem


Benn Stancil: The powder keg of the modern data stack

Another classic narration from Benn about the current state of the data landscape and the promising land of what it could be.

For the last few years, most data startups have followed Peter Thiel’s advice: Avoid competition. This positioning, however, can’t last forever. Eventually, as startups grow, companies that see one another as polite partners will start to jockey for the same space.

I can’t second this thought enough.

Twitter avatar for @ananthdurai
Ananth Packkildurai @ananthdurai
🎯 "Having this many tools without a coherent, centralized control plane is lunacy and a terrible endstate for data practitioners and their stakeholders."🎯 I hear a comparison of modern data stack(MDS) with Unix philosophy!!! Who is the Unix terminal for MDS is a Billion $??
Twitter avatar for @schrockn
Nick Schrock @schrockn
4/ I don’t think anyone believes that this is an ideal end state. The post itself advocates for consolidation. Having this many tools without a coherent, centralized control plane is lunacy, and a terrible endstate for data practitioners and their stakeholders. We see two trends:
12:14 AM ∙ Feb 18, 2022
5Likes1Retweet

https://benn.substack.com/p/powder-keg


Sponsored: Firebolt - Firebolt is a proud sponsor of Data Engineering Weekly.

Firebolt is the cloud data warehouse for builders of next-gen analytics experiences.
Combining the benefits and ease of use of modern architecture with a
sub-second performance at a terabyte-scale, Firebolt helps data engineering
and dev teams deliver data applications that end-users love.

https://www.firebolt.io/


George Kozlov: Databricks usage and cost analysis

Cost optimization is a critical engineering function in system design. The author explains the strategies and sourcing data to optimize the Databricks cloud cost.

https://medium.com/@progeorgek/databricks-usage-and-cost-analysis-e974e380916a


99.co: The Evolution of Transformation Layer Architecture in 99 Group (DBT, Airflow, and Kubernetes)

99.co writes an excellent overview of the infrastructure setup of Airflow, dbt & Kubernetes with code examples.

https://medium.com/99dotco/the-evolution-of-transformation-layer-architecture-in-99-group-dbt-airflow-and-kubernetes-cb46900f3662


Sponsored: Soda - Data engineers, your life is about to get easier.

Get hands-on with the new open-source framework to test and monitor data as-code, across every data workload, from ingestion to transformation to production. Easy to set up, read, and maintain. Try out and install Soda Core to see how to stop firefighting data issues, maintain reliable pipelines, and deliver high-quality, reliable data products. Access the docs here,

https://docs.soda.io/soda-core/overview-main.html


Confluent: 4 Must-Have Tests for Your Apache Kafka CI/CD with GitHub Actions

Confluent writes an excellent article about the approaches one can take to apply unit & integration testing for Apache Kafka pipelines.

https://www.confluent.io/blog/apache-kafka-ci-cd-with-github/


Niels Claeys: Make Spark resilient against spot interruptions on Kubernetes

Running the Apache Spark workload on spot instances can reduce the infrastructure cost up to 60-90% compares to on-demand instances. The author writes some tips & design strategies to design Spark applications spot instances ready.

https://blog.dataminded.com/make-spark-resilient-against-spot-interruptions-on-kubernetes-a2d6403399b0


Sponsored: The Data Stack Show Live - The Future of ML

Join this live recording of The Data Stack Show with Continual CEO, Tristan Zajonc and Tecton Engineering Manager, Willem Pienaar to explore the democratization of ML, how it's impacting businesses today, and where things are headed next.

https://datastackshow.com/events/the-future-of-machine-learning/


Alexey Grishchenko: Lakehouse

The CIDR 2021 paper on Lakehouse Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics is an informative read on modern Lakehouse design. The author writes an excellent summary of the paper narrating the historical development before the Lakehouse systems.

https://0x0fff.com/lakehouse/


Kevin Hu: A Framework to Understand How Low-Quality Data Hurts Business Performance

Is Data Quality impacts the business operation? The author walks through various maturity stages of data adoption in an organization and the framework for measuring the impact of data quality in each maturity cycle.

https://towardsdatascience.com/a-framework-to-understand-how-low-quality-data-hurts-business-performance-386c10c4fe1e


Criteo: Reporting Data at Criteo: How to Measure at Scale

Criteo writes about its reporting infrastructure built on top of Apache Kafka, HDFS, and Vertica. (🤔 is that what folks call pre-modern data stack?). The blog narrates Critieo's approach to data quality, data modeling, and data partition approaches.

https://medium.com/criteo-engineering/reporting-data-at-criteo-how-to-measure-at-scale-b315b0d8d78a


Google AI: ML-Enhanced Code Completion Improves Developer Productivity

I’m an active user of the Tabnine code completion plugin in IntelliJ, and I’m sure many with Github copilot. Google writes about how it combines ML and SE to develop a novel Transformer-based hybrid semantic ML code completion for internal Google developers.

https://ai.googleblog.com/2022/07/ml-enhanced-code-completion-improves.html


Harvard: CS109A - Introduction to Data Science Lectures

The Harvard University CS109A- introduction to Data Science lecture materials, Python notebooks, and video lectures are now free for download.

https://harvard-iacs.github.io/2019-CS109A/pages/materials.html


All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Share this post
Data Engineering Weekly #95
www.dataengineeringweekly.com
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing