Data Engineering Weekly #94
The Weekly Data Engineering Newsletter
Sponsored: Rudderstack - DROP the Modern Data Stack
It’s time to make sense of today’s data tooling ecosystem. Check out rudderstack.com/dmds to get a guide that will help you build a practical data stack for every phase of your company’s journey to data maturity. The guide includes architecture and tactical advice to help you progress through four stages: Starter, Growth, Machine Learning, and Real-Time. Visit RudderStack.com/dmds today to DROP the modern data stack and USE a practical data engineering framework.
Ponder: Pandas vs. SQL
Walmart recently wrote about the DataBathing framework to transpile the SQL to Spark Dataframe. It made me curious to research more, and Rajesh Parikh shared his insights and this excellent discussion about pandas and SQL. The vision of layered architecture with pluggable runtime supporting SQL & DataFrame API is tempting. Apache Beam attempts a similar strategy but primarily focuses on unifying both batch & real-time computing.
We got the next two posts going to be on the same topic - Coincident 🤔
Databricks: Power to the SQL People - Introducing Python UDFs in Databricks SQL
Staying with SQL & Pandas, Databricks announces the support for Python UDF in SQL. The macros and UDFs are trying to empower the declarative nature of SQL, but is that enough? Please share your thoughts in the comments or on Twitter @data_weekly.
Mimoune Djouallah: First Look at Google Malloy
We got SQL to Python dataframe transpiler & Python UDFs, where Malloy approaches from the Data Modeling perspective to describe data relationship & transformation. The author writes the first look of Malloy, walking through the SQL generation, ease of use, and BI integrations.
Thanks for reading Data Engineering Weekly! Subscribe for free to receive new posts and support my work.
LinkedIn: Measuring downstream impact on social networks by using an attribution framework
Network analytics of downstream impact reveal unique insights, but how hard is it? What if there is a 1:n relationship downstream, and it is time-sensitive? LinkedIn writes about the CAMEL framework for downstream impact analytics that handles 1:n relationship & time factors.
Sponsored: Soda - Data engineers, your life is about to get easier.
Get hands-on with the new open-source framework to test and monitor data as-code, across every data workload, from ingestion to transformation to production. Easy to set up, read, and maintain. Try out and install Soda Core to see how to stop firefighting data issues, maintain reliable pipelines, and deliver high-quality, reliable data products. Access the docs here,
Ben Rogojan: Should You Use Apache Airflow?
Apache Airflow is a widely adopted data engineering orchestration engine and has also gone through multiple criticisms. The author walks through the current sentiment and features of Airflow.
David Jayatillake: Going with the Airflow
Staying with Airflow, we can’t stop thinking, should one need an Airflow-like orchestration engine if I use only dbt? The author narrates the scenario where the organization-wide orchestration engine requires.
For the readers, it is also a good reference to refresh the evolution of Airflow & dbt.
Sponsored: Firebolt - Firebolt is a proud sponsor of Data Engineering Weekly.
Firebolt is the cloud data warehouse for builders of next-gen analytics experiences.
Combining the benefits and ease of use of modern architecture with a
sub-second performance at a terabyte-scale, Firebolt helps data engineering
and dev teams deliver data applications that end-users love.
Airbnb: Democratizing Metrics at Airbnb
Airbnb's Minerva blogs influenced many conversations around the metrics layer. dbt talked about its vision of the metric system, and there was some exciting discussion about the metric layer & metadata.
Airbnb talks about the second generation of Minerva (2.0) system design, narrating the challenges with the initial version of Minerva.
Minerva’s previous blogs for reference:
Uber: Supercharging A/B Testing at Uber
Uber writes about its next-generation A/B testing infrastructure with the system design goal. The system design focuses on user productivity & flexibility to support a variety of experimentation.
Shopify: Data-Centric Machine Learning - Building Shopify Inbox’s Message Classification Model
Model-Centric vs. Data-Centric is an exciting system development in ML infrastructure. Shopify writes about its Data-Centric ML approach to build Shopify's Inbox classification system.
Sponsored: Rudderstack - The Data Maturity Journey - Webinar July 27th at 10:30 AM PT / 1:30 ET
Join RudderStack live with the Seattle Data Guy, Ben Rogojan, and Max Werner, Owner of Obsessive Analytics Consulting, to learn about the four stages of The Data Maturity Journey. You'll come away with practical architectures you can use to drive better decision-making at every stage of your company’s growth.
Adobe: Exploring Kafka Consumer’s Internals
Adobe writes an excellent overview of Kafka consumer’s internal focusing on Fetcher, Consumer group, Rebalancing & Partition Assigner.
Afshine Amidi & Shervine Amidi: CS 229 - Machine Learning cheatsheet
If you’re starting to explore ML, The cheatsheet is an excellent guide.
The ML course note repo shares notes for AI, ML & NLP related course
All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.