Data Engineering Weekly #104

The Weekly Data Engineering Newsletter

Oct 23, 2022

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

Editor’s Note: DEW is the reader’s choice & Is Data Catalog living up to the hype?

Hello Data Friends, Welcome to another edition of Data Engineering Weekly. First, I’m thrilled to see this poll from AirByte. 48.5 of Data Engineers say they read Data Engineering Weekly to keep up with the data engineering landscape. Thank you all for your kind support. ❤️❤️❤️

Ananth Packkildurai@ananthdurai

Super thrilled to see that 48.5% of my dear data friends say that they use @data_weekly to keep up with the data engineering landscape ❤️❤️❤️

8:04 PM · Oct 18, 2022

20 Likes

The top of my mind for this week is Data Catalog. I'm one of the early advocates for Data Catalogs and am excited about the possibility of Data Catalogs. The Data Engineering Weekly even published a special Metadata Edition focusing on the historical development of the Data Catalog.

https://www.dataengineeringweekly.com/p/data-engineering-weekly-21-metadata

It is almost two years since we published the metadata edition, but I keep thinking back. Does Data Catalogs live up to the promise? Hence I published an open poll on LinkedIn to find out.

https://www.linkedin.com/posts/ananthdurai_dataengineering-datacatalog-activity-6989772780862873600-YFmA/

We will talk more about Data Catalog in the coming weeks. Meanwhile, share your thoughts about Data Catalog in the poll & comments section. Oh, a humble request to the Data Catalog vendors, Please abstain from the poll ❤️

Rittman Analytics: The dbt Semantic Layer, Data Orchestration, and the Modern Enterprise Data Stack

It has been an eventful last week for dbt with Coalesce conference. I missed attending in person this time but caught of tech talks via live streaming. My reaction to the conference,

Ananth Packkildurai@ananthdurai

@pdrmnvd I wish there were more dbt internals tech talks :-(, but some great case studies by practitioners compensated for it.

4:36 PM · Oct 22, 2022

4 Likes

Ananth Packkildurai@ananthdurai

@pdrmnvd Top ones for me are, Json Schema from @aerialfly & @emilyhawkins__ data indigestion from @Spotify, @HubSpot design as a daily activity, outgrowing dbt run by @voxdotcom & is Kimball still relevant from @JayPeeDevlin

11:29 PM · Oct 22, 2022

10 Likes

You can watch all the recordings of the talk here

Two significant announcements at the dbt conference

Python language support in dbt core
public preview of the semantic layer

The author narrates an in-depth view of the dbt semantic layer.

https://blog.rittmananalytics.com/the-dbt-semantic-layer-data-orchestration-and-the-modern-enterprise-data-stack-78d9d9ed5c18

Ben Rogojan: The Next Generation Of All-In-One Data Stacks

We debated a lot of bundling vs. unbundling. Is all-in-one data stacks the future? The article from Ben came timely as dbt unveils the semantic layer to play the hub of the analytical ecosystem. The author compares five available all-in-one data platforms and discusses their pros & cons.

https://medium.com/coriers/the-next-generation-of-all-in-one-data-stacks-f46069ad10fd

[LAST CALL] There's still time to RSVP for IMPACT 2022 The Annual Data Observability Summit on October 25-26, 2022!

Don't miss a chance to get candid with your data peers on the hottest topics in data, learn about 2023 trends, and hear from the biggest names in data and analytics about the ideas and technologies pioneering our industry. Featuring, founders and data leaders from dbt Labs, Fivetran, The New York Times, GitLab, Fox Corporation.

Data Engineering Weekly readers, Get Your Free Ticket!!!

Criteo: Highlights of RecSys 2022

RecSys is a leading conference focusing on industrial recommender engine implementation. Crtieo published the key takeaway for the 2022 RecSys conference. TIL about AI Mediated Communication and its impact.

https://medium.com/criteo-engineering/highlights-of-recsys-2022-c136a9b6fbd0

Netflix: Orchestrating Data/ML Workflows at Scale With Netflix Maestro

Netflix writes about Maestro, it’s workflow orchestrator that can schedule and manage workflows at a massive scale. The design is a fantastic system design read on how to build a scalable orchestration engine. It is one of the very few systems I wish open-sourced soon.

https://netflixtechblog.com/orchestrating-data-ml-workflows-at-scale-with-netflix-maestro-aaa2b41b800c

Checkout.com: Testing & Monitoring the Data Platform at Scale

Data testing and Data Observability are vital components to keep the data quality in a complex data pipeline. Checkout.com writes about how it uses dbt tests, Monte Carlo, and Data dog to test & monitor the data pipeline.

https://medium.com/checkout-com-techblog/testing-monitoring-the-data-platform-at-scale-e22d9cf433e8

Kumu: ML Engineering at Kumu - Turning Models into Products

Kumu writes about its ML platform journey with the end-to-end Machine Learning Platform lifecycle. The ML platform is overwhelmingly complex, and the Kumu team suggests focusing on three basics to scale.

Code Maintainability
Automated Tests and Deployment
Deployment Governance

https://medium.com/@karlitodata/ml-engineering-at-kumu-turning-models-into-products-b2b4faeb2b40

Shopify: How to Structure Your Data Team for Maximum Influence

We often joke that Data Team is the backend of the backend. Gaining visibility in an org is the first significant challenge any data team will face to influence a data-driven culture. Shopify writes an exciting blog narrating how to structure the data team to maximize the influence in an org.

https://shopifyengineering.myshopify.com/blogs/engineering/how-to-structure-data-teams

Analytics @ Meta: Analytics and Product-Market Fit

When developing new products, the big question we seek to answer is, “Does this product have product-market fit?”Analytics plays a central role in addressing this question. The Meta team writes an exciting blog how to approach PMF (Product-Market Fit) through analytical engineering.

https://medium.com/@AnalyticsAtMeta/analytics-and-product-market-fit-11efaea403cd

Dan Frank: Experimentation Platform in a Day

Experimentation plays a vital role in analytical engineering. Should one buy expensive software to run experimentation? Is it complex to build an in-house experimentation platform? The author writes a simple enough hack to start experimentation without waiting to build a platform or buying experimentation software.

https://medium.com/deliberate-data-science/experimentation-platform-in-a-day-c60646ef1a2

Data Engineering in 2022: Exploring dbt with DuckDB

DuckDB, an in-process database management system, has gained good traction recently. Selectively mixing DuckDB with Panda's workload improves the data join performance significantly. The author writes a step-by-step guide on using dbt and DuckDB.

https://rmoff.net/2022/10/20/data-engineering-in-2022-exploring-dbt-with-duckdb/

Prabhuk Karthi STB: 10 Key Takeaways From Google Cloud Next22

Bye Bye Google Studio!! Google going all in on Looker as a default BI layer for Google BigQuery. Google also announced support for unstructured data analytics & Big Query ML pipeline integration with Vertex AI.

https://medium.com/google-cloud/10-key-take-aways-from-google-cloud-next22-d5def84a3cf4

Fast.ai: 1st Two Lessons of From Deep Learning Foundations to Stable Diffusion

Fast.ai published its course content on From Deep Learning Foundations to Stable Diffusion. Since the introduction of Stable Diffusion, it gains a lot of momentum all the way to the possibility to render the Cinema 4D scene natively. I’m looking forward to take this course.

https://www.fast.ai/posts/part2-2022-preview.html

All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly

Discussion about this post

Ready for more?

Data Engineering Weekly

Data Engineering Weekly #104

The Weekly Data Engineering Newsletter

Data Engineering Weekly Is Brought to You by RudderStack

Editor’s Note: DEW is the reader’s choice & Is Data Catalog living up to the hype?

Rittman Analytics: The dbt Semantic Layer, Data Orchestration, and the Modern Enterprise Data Stack

Ben Rogojan: The Next Generation Of All-In-One Data Stacks

[LAST CALL] There's still time to RSVP for IMPACT 2022 The Annual Data Observability Summit on October 25-26, 2022!

Criteo: Highlights of RecSys 2022

Netflix: Orchestrating Data/ML Workflows at Scale With Netflix Maestro

Sponsored: Soda - 🗣 Podcast: How To Build A Common Understanding Of Your Data Reliability Rules

Checkout.com: Testing & Monitoring the Data Platform at Scale

Kumu: ML Engineering at Kumu - Turning Models into Products

Sponsored - RudderStack: How Shippit Achieved a Unified View of Customers with Snowflake and Rudderstack

Shopify: How to Structure Your Data Team for Maximum Influence

Analytics @ Meta: Analytics and Product-Market Fit

Dan Frank: Experimentation Platform in a Day

Data Engineering in 2022: Exploring dbt with DuckDB

Prabhuk Karthi STB: 10 Key Takeaways From Google Cloud Next22

Fast.ai: 1st Two Lessons of From Deep Learning Foundations to Stable Diffusion

Discussion about this post

Ready for more?