Data Engineering Weekly #94

The Weekly Data Engineering Newsletter

Jul 24, 2022

Ponder: Pandas vs. SQL

Discussion between Aditya Parameswaran & Sesh Seshadri

Walmart recently wrote about the DataBathing framework to transpile the SQL to Spark Dataframe. It made me curious to research more, and Rajesh Parikh shared his insights and this excellent discussion about pandas and SQL. The vision of layered architecture with pluggable runtime supporting SQL & DataFrame API is tempting. Apache Beam attempts a similar strategy but primarily focuses on unifying both batch & real-time computing.

We got the next two posts going to be on the same topic - Coincident 🤔

Databricks: Power to the SQL People - Introducing Python UDFs in Databricks SQL

Staying with SQL & Pandas, Databricks announces the support for Python UDF in SQL. The macros and UDFs are trying to empower the declarative nature of SQL, but is that enough? Please share your thoughts in the comments or on Twitter @data_weekly.

https://databricks.com/blog/2022/07/22/power-to-the-sql-people-introducing-python-udfs-in-databricks-sql.html

Mimoune Djouallah: First Look at Google Malloy

We got SQL to Python dataframe transpiler & Python UDFs, where Malloy approaches from the Data Modeling perspective to describe data relationship & transformation. The author writes the first look of Malloy, walking through the SQL generation, ease of use, and BI integrations.

https://datamonkeysite.com/2022/07/22/first-look-at-google-malloy/

LinkedIn: Measuring downstream impact on social networks by using an attribution framework

Network analytics of downstream impact reveal unique insights, but how hard is it? What if there is a 1:n relationship downstream, and it is time-sensitive? LinkedIn writes about the CAMEL framework for downstream impact analytics that handles 1:n relationship & time factors.

https://engineering.linkedin.com/blog/2022/measuring-downstream-impact-on-social-networks

Ben Rogojan: Should You Use Apache Airflow?

Apache Airflow is a widely adopted data engineering orchestration engine and has also gone through multiple criticisms. The author walks through the current sentiment and features of Airflow.

https://medium.com/coriers/should-you-use-apache-airflow-e71c6cf7c0c4

David Jayatillake: Going with the Airflow

Staying with Airflow, we can’t stop thinking, should one need an Airflow-like orchestration engine if I use only dbt? The author narrates the scenario where the organization-wide orchestration engine requires.

https://davidsj.substack.com/p/going-with-the-airflow-part-1

For the readers, it is also a good reference to refresh the evolution of Airflow & dbt.

Data Engineering Weekly

Bundling Vs. UnBundling: The Tale of Airflow Operator and dbt Ref

I started working on the data pipeline at the early stage of Hadoop/ Bigdata when Big Data was a buzzword. Apache Oozie (anyone remembers Oozie?) is a go-to tool to orchestrate the data pipeline, where you have to hand-code workflow in an XML file(not surprisingly, the file name is workflow.xml…

4 years ago · 6 likes · 2 comments · Ananth Packkildurai

Airbnb: Democratizing Metrics at Airbnb

Airbnb's Minerva blogs influenced many conversations around the metrics layer. dbt talked about its vision of the metric system, and there was some exciting discussion about the metric layer & metadata.

Airbnb talks about the second generation of Minerva (2.0) system design, narrating the challenges with the initial version of Minerva.

Minerva’s previous blogs for reference:

How Airbnb Achieved Metric Consistency at Scale

How Airbnb Standardized Metric Computation at Scale

Uber: Supercharging A/B Testing at Uber

Uber writes about its next-generation A/B testing infrastructure with the system design goal. The system design focuses on user productivity & flexibility to support a variety of experimentation.

https://eng.uber.com/supercharging-a-b-testing-at-uber/

Shopify: Data-Centric Machine Learning - Building Shopify Inbox’s Message Classification Model

Model-Centric vs. Data-Centric is an exciting system development in ML infrastructure. Shopify writes about its Data-Centric ML approach to build Shopify's Inbox classification system.

https://shopifyengineering.myshopify.com/blogs/engineering/shopify-inbox-message-classification-model

Adobe: Exploring Kafka Consumer’s Internals

Adobe writes an excellent overview of Kafka consumer’s internal focusing on Fetcher, Consumer group, Rebalancing & Partition Assigner.

https://medium.com/adobetech/exploring-kafka-consumers-internals-b0b9becaa106

Afshine Amidi & Shervine Amidi: CS 229 - Machine Learning cheatsheet

If you’re starting to explore ML, The cheatsheet is an excellent guide.

https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-unsupervised-learning

The ML course note repo shares notes for AI, ML & NLP related course

https://github.com/dair-ai/ML-Course-Notes

All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly

Discussion about this post

Ready for more?

Data Engineering Weekly

Data Engineering Weekly #94

The Weekly Data Engineering Newsletter

Sponsored: Rudderstack - DROP the Modern Data Stack

Ponder: Pandas vs. SQL

Databricks: Power to the SQL People - Introducing Python UDFs in Databricks SQL

Mimoune Djouallah: First Look at Google Malloy

LinkedIn: Measuring downstream impact on social networks by using an attribution framework

Sponsored: Soda - Data engineers, your life is about to get easier.

Ben Rogojan: Should You Use Apache Airflow?

David Jayatillake: Going with the Airflow

Sponsored: Firebolt - Firebolt is a proud sponsor of Data Engineering Weekly.

Airbnb: Democratizing Metrics at Airbnb

Uber: Supercharging A/B Testing at Uber

Shopify: Data-Centric Machine Learning - Building Shopify Inbox’s Message Classification Model

Sponsored: Rudderstack - The Data Maturity Journey - Webinar July 27th at 10:30 AM PT / 1:30 ET

Adobe: Exploring Kafka Consumer’s Internals

Afshine Amidi & Shervine Amidi: CS 229 - Machine Learning cheatsheet

Discussion about this post

Ready for more?