Data Engineering Weekly #104
The Weekly Data Engineering Newsletter
Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
Editor’s Note: DEW is the reader’s choice & Is Data Catalog living up to the hype?
Hello Data Friends, Welcome to another edition of Data Engineering Weekly. First, I’m thrilled to see this poll from AirByte. 48.5 of Data Engineers say they read Data Engineering Weekly to keep up with the data engineering landscape. Thank you all for your kind support. ❤️❤️❤️
The top of my mind for this week is Data Catalog. I'm one of the early advocates for Data Catalogs and am excited about the possibility of Data Catalogs. The Data Engineering Weekly even published a special Metadata Edition focusing on the historical development of the Data Catalog.
It is almost two years since we published the metadata edition, but I keep thinking back. Does Data Catalogs live up to the promise? Hence I published an open poll on LinkedIn to find out.
We will talk more about Data Catalog in the coming weeks. Meanwhile, share your thoughts about Data Catalog in the poll & comments section. Oh, a humble request to the Data Catalog vendors, Please abstain from the poll ❤️
Rittman Analytics: The dbt Semantic Layer, Data Orchestration, and the Modern Enterprise Data Stack
It has been an eventful last week for dbt with Coalesce conference. I missed attending in person this time but caught of tech talks via live streaming. My reaction to the conference,
You can watch all the recordings of the talk here
Two significant announcements at the dbt conference
Python language support in dbt core
public preview of the semantic layer
The author narrates an in-depth view of the dbt semantic layer.
Ben Rogojan: The Next Generation Of All-In-One Data Stacks
We debated a lot of bundling vs. unbundling. Is all-in-one data stacks the future? The article from Ben came timely as dbt unveils the semantic layer to play the hub of the analytical ecosystem. The author compares five available all-in-one data platforms and discusses their pros & cons.
[LAST CALL] There's still time to RSVP for IMPACT 2022 The Annual Data Observability Summit on October 25-26, 2022!
Don't miss a chance to get candid with your data peers on the hottest topics in data, learn about 2023 trends, and hear from the biggest names in data and analytics about the ideas and technologies pioneering our industry. Featuring, founders and data leaders from dbt Labs, Fivetran, The New York Times, GitLab, Fox Corporation.
Criteo: Highlights of RecSys 2022
RecSys is a leading conference focusing on industrial recommender engine implementation. Crtieo published the key takeaway for the 2022 RecSys conference. TIL about AI Mediated Communication and its impact.
Netflix: Orchestrating Data/ML Workflows at Scale With Netflix Maestro
Netflix writes about Maestro, it’s workflow orchestrator that can schedule and manage workflows at a massive scale. The design is a fantastic system design read on how to build a scalable orchestration engine. It is one of the very few systems I wish open-sourced soon.
Sponsored: Soda - 🗣 Podcast: How To Build A Common Understanding Of Your Data Reliability Rules
Regardless of how data is being used, it is critical that the information is trusted. The practice of data reliability engineering has gained momentum recently to address that question. Soda Checks Language helps support the efforts of data teams with the corresponding Soda Core utility that acts on this new DSL. In this Data Engineering Podcast by Tobias Macey episode, Tom Baeyens explains their reasons for creating a new syntax for expressing and validating checks for data assets and processes, as well as how to incorporate it into your own projects.
Checkout.com: Testing & Monitoring the Data Platform at Scale
Data testing and Data Observability are vital components to keep the data quality in a complex data pipeline. Checkout.com writes about how it uses dbt tests, Monte Carlo, and Data dog to test & monitor the data pipeline.
Kumu: ML Engineering at Kumu - Turning Models into Products
Kumu writes about its ML platform journey with the end-to-end Machine Learning Platform lifecycle. The ML platform is overwhelmingly complex, and the Kumu team suggests focusing on three basics to scale.
Automated Tests and Deployment
Sponsored - RudderStack: How Shippit Achieved a Unified View of Customers with Snowflake and Rudderstack
Join us live on October 25th for a free deep-dive webinar featuring Nitt Chuenprateep, Business Systems and Data Manager at Shippit. Learn from the experts as they share the secrets of Shippit’s success and how they successfully utilized Snowflake and RudderStack to become warehouse-first.
Don’t miss this opportunity to gain expert advice on how to build your ideal data stack and achieve a unified view of your customers.
Shopify: How to Structure Your Data Team for Maximum Influence
We often joke that Data Team is the backend of the backend. Gaining visibility in an org is the first significant challenge any data team will face to influence a data-driven culture. Shopify writes an exciting blog narrating how to structure the data team to maximize the influence in an org.
Analytics @ Meta: Analytics and Product-Market Fit
When developing new products, the big question we seek to answer is, “Does this product have product-market fit?”Analytics plays a central role in addressing this question. The Meta team writes an exciting blog how to approach PMF (Product-Market Fit) through analytical engineering.
Dan Frank: Experimentation Platform in a Day
Experimentation plays a vital role in analytical engineering. Should one buy expensive software to run experimentation? Is it complex to build an in-house experimentation platform? The author writes a simple enough hack to start experimentation without waiting to build a platform or buying experimentation software.
Data Engineering in 2022: Exploring dbt with DuckDB
DuckDB, an in-process database management system, has gained good traction recently. Selectively mixing DuckDB with Panda's workload improves the data join performance significantly. The author writes a step-by-step guide on using dbt and DuckDB.
Prabhuk Karthi STB: 10 Key Takeaways From Google Cloud Next22
Bye Bye Google Studio!! Google going all in on Looker as a default BI layer for Google BigQuery. Google also announced support for unstructured data analytics & Big Query ML pipeline integration with Vertex AI.
Fast.ai: 1st Two Lessons of From Deep Learning Foundations to Stable Diffusion
Fast.ai published its course content on From Deep Learning Foundations to Stable Diffusion. Since the introduction of Stable Diffusion, it gains a lot of momentum all the way to the possibility to render the Cinema 4D scene natively. I’m looking forward to take this course.
All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.