Data Engineering Weekly

Share this post
Data Engineering Weekly #114
www.dataengineeringweekly.com

Data Engineering Weekly #114

The Weekly Data Engineering Newsletter

Ananth Packkildurai
Jan 16
8
Share this post
Data Engineering Weekly #114
www.dataengineeringweekly.com

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.


SiliconANGLE theCUBE: Analyst Predictions 2023 - The Future of Data Management

By far one of the best analyses of trends in Data Management. 2023 predictions from the panel are;

  1. Unified metadata becomes kingmaker.

  2. Rethink the modern data stack

  3. SQL is back

  4. The definition of data is expanding

  5. BI/ Analytics get embedded & automated

A few favorite quotes from the conversation

On Data Catalogs

Data Catalogs are very soon not going to be a standalone product; They are going to get embedded - Sanjeev Mohan.

Customers with the data catalogs, whether embedded or not, express three times high satisfaction with their analytical infrastructure - David Menninger.

I write about how Data Catalogs are failing with the recent article Data Catalog - A Broken Promise. Data Catalogs are moving towards a feature, not a product. We are seeing embedded data catalog tools like reCap emerging to address the concerns.

On Modern Data Stack

The problem is that what we’re getting to, unfortunately, is what I would call lots of islands of simplicity, but it’s a complex toolchain - Tony Baer.

🎯 I defined the modern data stack sometime back as;

Twitter avatar for @ananthdurai
at-ananth-at-data-folks dot m@ st0 dot h0st @ananthdurai
@sarahmk125 MDS is a set of vendor tools that solve niche data problems (lineage, orchestration, quality) with the side effect of creating a disjointed data workflow that makes data folks lives more complicated.
3:23 PM ∙ Jan 6, 2022
18Likes4Retweets

But I like the term “Lots of Islands of Simplicity, but it’s a complex toolchain.”


Julius Remigio - Why I moved my dbt workloads to GitHub and saved over $65,000

dbt lab’s recent abrupt pricing change triggers interesting conversations about value creation and the per-user billing model. The author writes an exciting article about switching to Github actions for dbt to save $65,000

https://medium.com/@datajuls/why-i-moved-my-dbt-workloads-to-github-and-saved-over-65-000-759b37486001


Sponsored: Upsolver - Write a SQL Query, Get a Data-in-Motion Pipeline!

Pipelines for data in motion can quickly turn into DAG hell. Upsolver SQLake lets you process fast-moving data by simply writing a SQL query.

  • Streaming plus batch unified in a single platform.

  • Stateful processing at scale - joins, aggregations, upserts

  • Orchestration auto-generated from the data and SQL

  • Templates with sample data for Kafka/Kinesis/S3 sources -> S3/Athena/Snowflake/Redshift

Try now and get 30 Days Free


Lars Kamp: How DoorDash built the greatest go-to-market playbook ever

What will be the impact of having an efficient data & analytical infrastructure; How can it bring you a competitive advantage? The author writes about the success of DoorDash, despite being a late mover into the market and starting merely to supply meals to the Stanford graduate, and how it uses data to beat the competition.

https://findingdistribution.substack.com/p/how-doordash-built-the-greatest-go


DoorDash: How DoorDash Upgraded a Heuristic with ML to Save Thousands of Canceled Orders

To validate the previous article on DoorDash’s efficient data & analytical platform usage, DoorDash writes about how it uses ML to save thousands of canceled orders. The article highlights the business need, the challenges of reflecting accurate store opening hours, and how the ML model replaces the heuristic approach.

https://doordash.engineering/2023/01/10/how-doordash-upgraded-a-heuristic-with-ml-to-save-thousands-of-canceled-orders/


Netflix: Causal Machine Learning for Creative Insights

Netflix is another flagship company that successfully powers its business operations & product features. The blog narrates the causal machine-learning approach in designing promotional artwork for their TV shows.

https://netflixtechblog.medium.com/causal-machine-learning-for-creative-insights-4b0ce22a8a96


Sponsored: 10 Things to Consider Before Choosing a Data Observability Platform

Ready to stop fighting bad data and explore end-to-end coverage with Data Observability? Learn the 10 most important things to consider when choosing a data observability platform. Get the new platform guide, and take the next step in your journey to data trust.

Get The Guide


Animesh Kumar: Optimizing Data Modeling for the Data-First Stack

The author captures different data stakeholders' sad states, expectations, and disappointments. Data Modeling increasingly becoming a topic of discussion recently, and I write about Data Modeling with Functional Data Engineering Principles. I like the layered approach expressed by the author and how Data Contracts play a significant role in Data Modeling.

I recently started working on Schemata [https://schemata.app/], and keep watch this space for some exciting announcements soon!!

https://moderndata101.substack.com/p/optimizing-data-modeling-for-the


Piethein Strengholt: Medallion architecture - best practices for managing Bronze, Silver, and Gold

I always find myself very uncomfortable with the naming convention of medallion data architecture. The names hold less meaning to the outcome, but its fancy. However, the medallion architecture brings a clear bucketing of data to align with the organization's delivery strategy from raw data → filer & clean data, → business metrics. The author writes a few best practices for managing medallion-style architecture.

https://piethein.medium.com/medallion-architecture-best-practices-for-managing-bronze-silver-and-gold-486de7c90055


Sponsored: Take Control of Your Customer Data With RudderStack

Legacy CDPs charge you a premium to keep your data in a black box. RudderStack builds your CDP on top of your data warehouse, giving you a more secure and cost-effective solution. Plus, it gives you more technical controls, so you can fully unlock the power of your customer data. 

Take control of your customer data today.


Cars24: Upgrading Data Flow Pipeline at CARS24

Cars24 writes a two-part series on upgrading and optimizing the data flow pipeline. The conversation around the Cost of Idle CPU in Snowflake is exciting to me in the blog. The author walked through various strategies, from sync to async job submission and batch job submission strategy.

https://medium.com/cars24-data-science-blog/optimizing-data-flow-cars24-4c0a17b797d1

https://medium.com/cars24-data-science-blog/upgrading-data-flow-pipeline-cars24-1b6b8aea48e


Lysann Hesske: Behind data42 - (meta)data management

The definition of data increases its scope; the author explains the complexity behind managing metadata at data42. The blog establishes the case for centralized data management efforts to collaborate and foster data research with the outside world.

https://medium.com/@lysann_hesske/behind-data42-meta-data-management-299c524407db


Aayush Agrawal: Model calibration for classification tasks using Python

many machine learning models’ probabilistic outputs cannot be directly interpreted as the probability of an event happening. To achieve this outcome, the model needs to be calibrated. The author explains what calibration is, in which applications it is important and why, and three different methods for calibration.

  1. isotonic regression

  2. Sigmoid method of calibration

  3. Logloss metric for calibration

https://medium.com/data-science-at-microsoft/model-calibration-for-classification-tasks-using-python-1a7093b57a46


Arli: Parquet Best Practices - Discover your Data without loading them

Parquet is the most popular columnar format used in data lakes. The author explains how to use a simple python script to read parquet files, metadata, and summary stats for each column.

https://towardsdatascience.com/parquet-best-practices-discover-your-data-without-loading-them-f854c57a45b6


All rights reserved ProtoGrowth Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Share this post
Data Engineering Weekly #114
www.dataengineeringweekly.com
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing