Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
SiliconANGLE theCUBE: Analyst Predictions 2023 - The Future of Data Management
By far one of the best analyses of trends in Data Management. 2023 predictions from the panel are;
Unified metadata becomes kingmaker.
Rethink the modern data stack
SQL is back
The definition of data is expanding
BI/ Analytics get embedded & automated
A few favorite quotes from the conversation
On Data Catalogs
Data Catalogs are very soon not going to be a standalone product; They are going to get embedded - Sanjeev Mohan.
Customers with the data catalogs, whether embedded or not, express three times high satisfaction with their analytical infrastructure - David Menninger.
I write about how Data Catalogs are failing with the recent article Data Catalog - A Broken Promise. Data Catalogs are moving towards a feature, not a product. We are seeing embedded data catalog tools like reCap emerging to address the concerns.
On Modern Data Stack
The problem is that what we’re getting to, unfortunately, is what I would call lots of islands of simplicity, but it’s a complex toolchain - Tony Baer.
🎯 I defined the modern data stack sometime back as;
But I like the term “Lots of Islands of Simplicity, but it’s a complex toolchain.”
Julius Remigio - Why I moved my dbt workloads to GitHub and saved over $65,000
dbt lab’s recent abrupt pricing change triggers interesting conversations about value creation and the per-user billing model. The author writes an exciting article about switching to Github actions for dbt to save $65,000
Sponsored: Upsolver - Write a SQL Query, Get a Data-in-Motion Pipeline!
Pipelines for data in motion can quickly turn into DAG hell. Upsolver SQLake lets you process fast-moving data by simply writing a SQL query.
Streaming plus batch unified in a single platform.
Stateful processing at scale - joins, aggregations, upserts
Orchestration auto-generated from the data and SQL
Templates with sample data for Kafka/Kinesis/S3 sources -> S3/Athena/Snowflake/Redshift
Lars Kamp: How DoorDash built the greatest go-to-market playbook ever
What will be the impact of having an efficient data & analytical infrastructure; How can it bring you a competitive advantage? The author writes about the success of DoorDash, despite being a late mover into the market and starting merely to supply meals to the Stanford graduate, and how it uses data to beat the competition.
https://findingdistribution.substack.com/p/how-doordash-built-the-greatest-go
DoorDash: How DoorDash Upgraded a Heuristic with ML to Save Thousands of Canceled Orders
To validate the previous article on DoorDash’s efficient data & analytical platform usage, DoorDash writes about how it uses ML to save thousands of canceled orders. The article highlights the business need, the challenges of reflecting accurate store opening hours, and how the ML model replaces the heuristic approach.
Netflix: Causal Machine Learning for Creative Insights
Netflix is another flagship company that successfully powers its business operations & product features. The blog narrates the causal machine-learning approach in designing promotional artwork for their TV shows.
https://netflixtechblog.medium.com/causal-machine-learning-for-creative-insights-4b0ce22a8a96
Sponsored: 10 Things to Consider Before Choosing a Data Observability Platform
Ready to stop fighting bad data and explore end-to-end coverage with Data Observability? Learn the 10 most important things to consider when choosing a data observability platform. Get the new platform guide, and take the next step in your journey to data trust.
Animesh Kumar: Optimizing Data Modeling for the Data-First Stack
The author captures different data stakeholders' sad states, expectations, and disappointments. Data Modeling increasingly becoming a topic of discussion recently, and I write about Data Modeling with Functional Data Engineering Principles. I like the layered approach expressed by the author and how Data Contracts play a significant role in Data Modeling.
I recently started working on Schemata [https://schemata.app/], and keep watch this space for some exciting announcements soon!!
https://moderndata101.substack.com/p/optimizing-data-modeling-for-the
Piethein Strengholt: Medallion architecture - best practices for managing Bronze, Silver, and Gold
I always find myself very uncomfortable with the naming convention of medallion data architecture. The names hold less meaning to the outcome, but its fancy. However, the medallion architecture brings a clear bucketing of data to align with the organization's delivery strategy from raw data → filer & clean data, → business metrics. The author writes a few best practices for managing medallion-style architecture.
Sponsored: Take Control of Your Customer Data With RudderStack
Legacy CDPs charge you a premium to keep your data in a black box. RudderStack builds your CDP on top of your data warehouse, giving you a more secure and cost-effective solution. Plus, it gives you more technical controls, so you can fully unlock the power of your customer data.
Take control of your customer data today.
Cars24: Upgrading Data Flow Pipeline at CARS24
Cars24 writes a two-part series on upgrading and optimizing the data flow pipeline. The conversation around the Cost of Idle CPU in Snowflake is exciting to me in the blog. The author walked through various strategies, from sync to async job submission and batch job submission strategy.
https://medium.com/cars24-data-science-blog/optimizing-data-flow-cars24-4c0a17b797d1
https://medium.com/cars24-data-science-blog/upgrading-data-flow-pipeline-cars24-1b6b8aea48e
Lysann Hesske: Behind data42 - (meta)data management
The definition of data increases its scope; the author explains the complexity behind managing metadata at data42. The blog establishes the case for centralized data management efforts to collaborate and foster data research with the outside world.
https://medium.com/@lysann_hesske/behind-data42-meta-data-management-299c524407db
Aayush Agrawal: Model calibration for classification tasks using Python
many machine learning models’ probabilistic outputs cannot be directly interpreted as the probability of an event happening. To achieve this outcome, the model needs to be calibrated. The author explains what calibration is, in which applications it is important and why, and three different methods for calibration.
isotonic regression
Sigmoid method of calibration
Logloss metric for calibration
Arli: Parquet Best Practices - Discover your Data without loading them
Parquet is the most popular columnar format used in data lakes. The author explains how to use a simple python script to read parquet files, metadata, and summary stats for each column.
All rights reserved ProtoGrowth Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.