Data Engineering Weekly #132

The Weekly Data Engineering Newsletter

May 29, 2023

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make collecting data from every application, website, and SaaS platform easy, then activating it in your warehouse and business tools. Sign up free to test out the tool today.

Editor’s Note: DEW featured in AirByte’s State of the Data & Slack’s usage of Kafka

DEW has been recognized as the number one individually run data newsletter in the industry, according to the latest AirByte poll! I am truly humbled and overwhelmed by your continuous support and encouragement.

Over the last six months, many readers contacted me and asked about growing in their career as a data engineer or how to switch to data engineering. More qualified people than me are here to write and guide our audience. If you want to write a career guidance series for Data Engineering Weekly, Please DM me on LinkedIn. I'm more than happy to collaborate and help the community.

Over the weekend, I found this an excellent thread on how Slack uses Kafka; I want to highlight this one piece.

I’m thrilled to see this and the design choices we made at that time still echoing. The fundamental design principle that made possible the cut-over operational model is

The producer [Murron] builds with dynamic routing capabilities to reroute traffic to multiple dest.
Treat Kafka as immutable infra
Adopt multi-instance over multi-tenant

We discussed Murron's design in detail here if anyone wants to know more about it.

Cowboy Ventures: The New Generative AI Infra Stack

Generative AI has taken the tech industry by storm. In Q1 2023, a whopping $1.7B was invested into gen AI startups. Cowboy ventures unbundle the various categories of Generative AI infra stack here.

https://medium.com/cowboy-ventures/the-new-infra-stack-for-generative-ai-9db8f294dc3f

HoneyComb: All the Hard Stuff Nobody Talks About when Building Products with LLMs

Continue to focus on LLM; every company in the world is trying to find how LLM fit into their product offering and user experience. HoneyComb writes an excellent article from a developer perspective showing the hard part of integrating LLM into a product experience.

https://www.honeycomb.io/blog/hard-stuff-nobody-talks-about-llm

Coinbase: Databricks cost management at Coinbase

Effective cost management in data engineering is crucial as it maximizes the value gained from data insights while minimizing expenses. It ensures sustainable and scalable data operations, fostering a balanced business growth path in the data-driven era. Coinbase writes one case about cost management for Databricks and how they use the open-source overwatch tool to manage Databrick’s cost.

https://www.coinbase.com/blog/databricks-cost-management-at-coinbase

Walmart: Exploring an Entity Resolution Framework Across Various Use Cases

Entity resolution, a crucial process that identifies and links records representing the same entity across various data sources, is indispensable for generating powerful insights about relationships and identities. This process, often leveraging fuzzy matching techniques, not only enhances data quality but also facilitates nuanced decision-making by effectively managing relationships and tracking potential matches among data records. Walmart writes about the pros and cons of approaching fuzzy matching with rule-based and ML-based matching.

https://medium.com/walmartglobaltech/exploring-an-entity-resolution-framework-across-various-use-cases-cb172632e4ae

Zendesk: dbt at Zendesk — Part I: Setting foundations for scalability

Every new technology starts somewhere small in adoption and grows over time, especially complex systems like Zendesk. It is exciting to see the case study of dbt coming out from Zendesk, focusing on foundations and scalability.

https://zendesk.engineering/dbt-at-zendesk-part-i-setting-foundations-for-scalability-34b55e6a6aa1

Instawork: Unlocking the Power of Data: How we scaled our analytics with in-house Event Logging

Event Collection at scale brings its challenges. Instawork writes about its in-house solution for event logging systems. The blog narrates the working on the event collector, writing to Kafka, and the S3 storage with Lambda triggers.

https://engineering.instawork.com/unlocking-the-power-of-data-how-we-scaled-our-analytics-with-an-in-house-event-logging-platform-520d5b58f651

Florian Trehaut: Ensuring GDPR Compliance on GCP BigQuery: Efficiently Managing the Right to Be Forgotten

As a community, I was always concerned that we seldom discussed designing data engineering for regulatory requirements. I'm glad to see the article where the author explains what is Right to Be Forgotten (RTBF) and discusses the architectural pattern in Google BigQuery.

https://medium.com/@florian.trehaut/ensuring-gdpr-compliance-on-gcp-bigquery-efficiently-managing-the-right-to-be-forgotten-a76137944633

Matt Palmer: What's the hype behind DuckDB?

So DuckDB, Is it hype? or does it have the real potential to bring architectural changes to the data warehouse? The author explains how DuckDB works and the potential impact of DuckDB in Data Engineering.

https://mattpalmer.io/posts/whats-the-hype-duckdb/

Hugo Lu: Why Orchestration is the next hot thing in Data

If I put on a purist Data Engineer hat, The Data Orchestration, Data Lineage, Data Testing, and Data Catalogs all of them should be one system. They are not a separate category.

I’m glad to read the take on the orchestration engine expressing similar thoughts and questioning why there is little innovation in the orchestration space.

https://medium.com/@hugolu87/why-orchestration-is-the-next-hot-thing-in-data-69bc32402446

Sam Moris: Crafting your DBT development workflow

Adopting technology is not only about the individual tool; you need an ecosystem of supporting tools. The author writes about such an ecosystem of tools for dbt, narrating the usage of

dbt-project-evaluator
sql-fluff
pre-commit-dbt
dbt-coverage
PR Template

https://medium.com/cts-technologies/crafting-your-dbt-development-workflow-35577d3b573d

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly