Data Engineering Weekly #114

The Weekly Data Engineering Newsletter

Jan 16, 2023

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

SiliconANGLE theCUBE: Analyst Predictions 2023 - The Future of Data Management

By far one of the best analyses of trends in Data Management. 2023 predictions from the panel are;

Unified metadata becomes kingmaker.
Rethink the modern data stack
SQL is back
The definition of data is expanding
BI/ Analytics get embedded & automated

A few favorite quotes from the conversation

On Data Catalogs

Data Catalogs are very soon not going to be a standalone product; They are going to get embedded - Sanjeev Mohan.

Customers with the data catalogs, whether embedded or not, express three times high satisfaction with their analytical infrastructure - David Menninger.

I write about how Data Catalogs are failing with the recent article Data Catalog - A Broken Promise. Data Catalogs are moving towards a feature, not a product. We are seeing embedded data catalog tools like reCap emerging to address the concerns.

On Modern Data Stack

The problem is that what we’re getting to, unfortunately, is what I would call lots of islands of simplicity, but it’s a complex toolchain - Tony Baer.

🎯 I defined the modern data stack sometime back as;

at-ananth-at-data-folks dot m@ st0 dot h0st @ananthdurai

@sarahmk125 MDS is a set of vendor tools that solve niche data problems (lineage, orchestration, quality) with the side effect of creating a disjointed data workflow that makes data folks lives more complicated.

But I like the term “Lots of Islands of Simplicity, but it’s a complex toolchain.”

Julius Remigio - Why I moved my dbt workloads to GitHub and saved over $65,000

dbt lab’s recent abrupt pricing change triggers interesting conversations about value creation and the per-user billing model. The author writes an exciting article about switching to Github actions for dbt to save $65,000

https://medium.com/@datajuls/why-i-moved-my-dbt-workloads-to-github-and-saved-over-65-000-759b37486001

Lars Kamp: How DoorDash built the greatest go-to-market playbook ever

What will be the impact of having an efficient data & analytical infrastructure; How can it bring you a competitive advantage? The author writes about the success of DoorDash, despite being a late mover into the market and starting merely to supply meals to the Stanford graduate, and how it uses data to beat the competition.

https://findingdistribution.substack.com/p/how-doordash-built-the-greatest-go

DoorDash: How DoorDash Upgraded a Heuristic with ML to Save Thousands of Canceled Orders

To validate the previous article on DoorDash’s efficient data & analytical platform usage, DoorDash writes about how it uses ML to save thousands of canceled orders. The article highlights the business need, the challenges of reflecting accurate store opening hours, and how the ML model replaces the heuristic approach.

https://doordash.engineering/2023/01/10/how-doordash-upgraded-a-heuristic-with-ml-to-save-thousands-of-canceled-orders/

Netflix: Causal Machine Learning for Creative Insights

Netflix is another flagship company that successfully powers its business operations & product features. The blog narrates the causal machine-learning approach in designing promotional artwork for their TV shows.

https://netflixtechblog.medium.com/causal-machine-learning-for-creative-insights-4b0ce22a8a96

Animesh Kumar: Optimizing Data Modeling for the Data-First Stack

The author captures different data stakeholders' sad states, expectations, and disappointments. Data Modeling increasingly becoming a topic of discussion recently, and I write about Data Modeling with Functional Data Engineering Principles. I like the layered approach expressed by the author and how Data Contracts play a significant role in Data Modeling.

I recently started working on Schemata [https://schemata.app/], and keep watch this space for some exciting announcements soon!!

https://moderndata101.substack.com/p/optimizing-data-modeling-for-the

Piethein Strengholt: Medallion architecture - best practices for managing Bronze, Silver, and Gold

I always find myself very uncomfortable with the naming convention of medallion data architecture. The names hold less meaning to the outcome, but its fancy. However, the medallion architecture brings a clear bucketing of data to align with the organization's delivery strategy from raw data → filer & clean data, → business metrics. The author writes a few best practices for managing medallion-style architecture.

https://piethein.medium.com/medallion-architecture-best-practices-for-managing-bronze-silver-and-gold-486de7c90055

Cars24: Upgrading Data Flow Pipeline at CARS24

Cars24 writes a two-part series on upgrading and optimizing the data flow pipeline. The conversation around the Cost of Idle CPU in Snowflake is exciting to me in the blog. The author walked through various strategies, from sync to async job submission and batch job submission strategy.

https://medium.com/cars24-data-science-blog/optimizing-data-flow-cars24-4c0a17b797d1

https://medium.com/cars24-data-science-blog/upgrading-data-flow-pipeline-cars24-1b6b8aea48e

Lysann Hesske: Behind data42 - (meta)data management

The definition of data increases its scope; the author explains the complexity behind managing metadata at data42. The blog establishes the case for centralized data management efforts to collaborate and foster data research with the outside world.

https://medium.com/@lysann_hesske/behind-data42-meta-data-management-299c524407db

Aayush Agrawal: Model calibration for classification tasks using Python

many machine learning models’ probabilistic outputs cannot be directly interpreted as the probability of an event happening. To achieve this outcome, the model needs to be calibrated. The author explains what calibration is, in which applications it is important and why, and three different methods for calibration.

isotonic regression
Sigmoid method of calibration
Logloss metric for calibration

https://medium.com/data-science-at-microsoft/model-calibration-for-classification-tasks-using-python-1a7093b57a46

Arli: Parquet Best Practices - Discover your Data without loading them

Parquet is the most popular columnar format used in data lakes. The author explains how to use a simple python script to read parquet files, metadata, and summary stats for each column.

https://towardsdatascience.com/parquet-best-practices-discover-your-data-without-loading-them-f854c57a45b6

All rights reserved ProtoGrowth Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly