Data Engineering Weekly #98
The Weekly Data Engineering Newsletter
Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
Wei Jian: Lessons learned after 1 year with dbt
An excellent reflection article sharing experience running dbt after a year. dbt model grown from 0 to 700 dbt models is impressive growth. The growth aligns with the strategy the team adopted, Messy to start and clean afterward.
There are two approaches to handle the demand: build dbt models carefully while meeting all best practices (more proper), or build them fast and messy and worry about tech debt later (as long as the data is correct).
ManyChat: Data Modeling Today - launching cost-effective analytics for ManyChat
“Proper data modeling will take less time and money to add new features or new capabilities to the analytics. Your data will increase over time, even if your business is stable. But proper data modeling can help you avoid proportional growth of costs.”
ManyChat, at the same time, writes about its experience putting data models as a catalyst for growth.
My take on both approaches.
Balancing the growth with clean/ reusable datasets with a cost concern will be a hot data problem to solve next. Top-Down data modeling or Bottom-Up tech debt handling, Data Modeling plays a significant role in the health of data for a company.
John Foley: The Final Days of the Legacy Data Warehouse
I heard the first time a company moved from Snowflake to Teradata, citing unpredictable costs. It surprised me since I assume the legacy data warehouses are reaching the final days, as the author pointed out.
The Twitter thread sparked some exciting conversation, where Bobby Neelon shared some practical tips to optimize Snowflake
Wise Engineering: 3 Main Elements in Your Snowflake Bill
Continuing the Snowflake cost debate, Wise engineering writes a timely article on three main elements in your Snowflake bill to understand & optimize for cost efficiency.
Someone should launch a Udemy course: Understanding Snowflake billing
Sponsored: Firebolt is a proud sponsor of Data Engineering Weekly.
Firebolt is the cloud data warehouse for builders of next-gen analytics experiences.
Combining the benefits and ease of use .of modern architecture with a
sub-second performance at a terabyte scale, Firebolt helps data engineering
and dev teams deliver data applications that end-users love.
Walmart Global Tech: Implementation of SCD-2 (Slowly Changing Dimension) with Apache Hudi
Throwing everything into a data lake is a long-gone architecture approach, as the modern LakeHouses brings Data Modeling back to the mainstream. Walmart writes an excellent article on handing SCD-2 (Slowly Changing Dimension) with Apache Hudi.
Rina Diane Caballar: The Rise of SQL - It’s become the second programming language everyone needs to know
SQL everywhere has become the norm, from transactional databases to data warehouses to streaming databases. Several noSQL systems formally adopted SQL-ish syntax as an interface. The article reflects the same sentiment about why SQL has become the second programming language everyone needs to know.
Sponsored: Insights and Musings from Holden Karau, Open Source Engineer at Netflix, in this Data Dream Team podcast episode
If you’re a data engineer and you want to become a better data engineer, or you’re switching from software engineering to data engineering...I think a great thing that you can do is code reviews and open-source to the tools that you’re going to be using. You’re going to understand how they’re built. You’re going to see how the different pieces interact. And of course, you can do this by writing code in the tools that you’re using as well.
Paul Ramsey: Rise of the Anti-Join
SQL provides multiple ways to express and examine the set operation between two or more entities. The author writes an experiment to understand various ways to frame queries for Find me all the things in set "A" that are not in set "B."
Chip Huyen: Streaming-First Infrastructure for Real-Time Machine Learning
Variability and unpredictability with late-arriving features make the real-time machine learning pipeline challenging. The author writes about various infrastructure strategies to build streaming-first infrastructure for real-time machine learning.
Sponsored: Why Business Applications Create Data Integration Debt
This article from Ben Rogajan explores the challenges of data integration in a world where more teams need access to more data for more complex use cases, and it outlines the pitfalls of attacking data integration without a thoughtful strategy.
Uber: Uber Freight Carrier Metrics with Near-Real-Time Analytics
Uber writes an exciting blog about the generation of Uber Freight's carrier scoreboard analytical system. The blog narrates the iteration from querying MySQL DB to establishing a real-time streaming system using Apache Pinot & Flink.
Etsy: Towards Machine Learning Observability at Etsy
Etsy writes an exciting article about its observability approach for the Machine Learning Platform. The blog narrates the motivation to build a centralized ML observability platform, challenges, and a high-level design.
Instacart: How Instacart Uses Machine Learning-Driven Autocomplete to Help People Fill Their Carts
Adopting Machine Learning to power the product feature to enrich the user experience is always a delight to read. Instacart writes one such experience sharing how it uses ML to enrich autocomplete.
Cory Maklin: Data Governance Checklist
As a data engineer, you can't escape from Data Governance. What is Data governance? The author explains Data Governance with a checklist to narrate the process, roles, and people involved et al.,
All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.