Data Engineering Weekly

Share this post

Data Engineering Weekly #132

www.dataengineeringweekly.com

Data Engineering Weekly #132

The Weekly Data Engineering Newsletter

Ananth Packkildurai
May 29, 2023
3
Share this post

Data Engineering Weekly #132

www.dataengineeringweekly.com
1
Share

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make collecting data from every application, website, and SaaS platform easy, then activating it in your warehouse and business tools. Sign up free to test out the tool today.


Editor’s Note: DEW featured in AirByte’s State of the Data & Slack’s usage of Kafka

DEW has been recognized as the number one individually run data newsletter in the industry, according to the latest AirByte poll! I am truly humbled and overwhelmed by your continuous support and encouragement.

Over the last six months, many readers contacted me and asked about growing in their career as a data engineer or how to switch to data engineering. More qualified people than me are here to write and guide our audience. If you want to write a career guidance series for Data Engineering Weekly, Please DM me on LinkedIn. I'm more than happy to collaborate and help the community.

Over the weekend, I found this an excellent thread on how Slack uses Kafka; I want to highlight this one piece.

I’m thrilled to see this and the design choices we made at that time still echoing. The fundamental design principle that made possible the cut-over operational model is

  1. The producer [Murron] builds with dynamic routing capabilities to reroute traffic to multiple dest.

  2. Treat Kafka as immutable infra

  3. Adopt multi-instance over multi-tenant

We discussed Murron's design in detail here if anyone wants to know more about it.


Cowboy Ventures: The New Generative AI Infra Stack

Generative AI has taken the tech industry by storm. In Q1 2023, a whopping $1.7B was invested into gen AI startups. Cowboy ventures unbundle the various categories of Generative AI infra stack here.

https://medium.com/cowboy-ventures/the-new-infra-stack-for-generative-ai-9db8f294dc3f


HoneyComb: All the Hard Stuff Nobody Talks About when Building Products with LLMs

Continue to focus on LLM; every company in the world is trying to find how LLM fit into their product offering and user experience. HoneyComb writes an excellent article from a developer perspective showing the hard part of integrating LLM into a product experience.

https://www.honeycomb.io/blog/hard-stuff-nobody-talks-about-llm


Coinbase: Databricks cost management at Coinbase

Effective cost management in data engineering is crucial as it maximizes the value gained from data insights while minimizing expenses. It ensures sustainable and scalable data operations, fostering a balanced business growth path in the data-driven era. Coinbase writes one case about cost management for Databricks and how they use the open-source overwatch tool to manage Databrick’s cost.

https://www.coinbase.com/blog/databricks-cost-management-at-coinbase


Walmart: Exploring an Entity Resolution Framework Across Various Use Cases

Entity resolution, a crucial process that identifies and links records representing the same entity across various data sources, is indispensable for generating powerful insights about relationships and identities. This process, often leveraging fuzzy matching techniques, not only enhances data quality but also facilitates nuanced decision-making by effectively managing relationships and tracking potential matches among data records. Walmart writes about the pros and cons of approaching fuzzy matching with rule-based and ML-based matching.

https://medium.com/walmartglobaltech/exploring-an-entity-resolution-framework-across-various-use-cases-cb172632e4ae


Sponsored: [Virtual Data Panel] Measuring Data Team ROI

As data leaders, one of our top priorities is to measure ROI. From tracking the efficacy of marketing campaigns to understanding the root cause of new spikes in user engagement, we’re tasked with keeping tabs on the business's health at all levels. But what about the ROI of our own teams? Watch a panel of data leaders as they discuss how to build strategies for measuring data team ROI.

Watch On-demand


Zendesk: dbt at Zendesk — Part I: Setting foundations for scalability

Every new technology starts somewhere small in adoption and grows over time, especially complex systems like Zendesk. It is exciting to see the case study of dbt coming out from Zendesk, focusing on foundations and scalability.

https://zendesk.engineering/dbt-at-zendesk-part-i-setting-foundations-for-scalability-34b55e6a6aa1


Instawork: Unlocking the Power of Data: How we scaled our analytics with in-house Event Logging

Event Collection at scale brings its challenges. Instawork writes about its in-house solution for event logging systems. The blog narrates the working on the event collector, writing to Kafka, and the S3 storage with Lambda triggers.

https://engineering.instawork.com/unlocking-the-power-of-data-how-we-scaled-our-analytics-with-an-in-house-event-logging-platform-520d5b58f651


Sponsored: Your Google Analytics Account Needs Immediate Attention 😱

Get this email recently? Love it or hate it, GA4 is a fact of life for many of us. Getting the most out of the tool requires a hybrid implementation to capture data server-side and client-side, but the tools Google provides make this setup complicated and unfulfilling. That’s why RudderStack built a hybrid deployment option for their GA4 integration. It’s a single-step deployment that makes capturing all the data you need for attribution easy while ensuring optimal site performance and ad blocker resiliency.

Learn how to implement GA4 for ad blocker resilience and accurate attribution.


Florian Trehaut: Ensuring GDPR Compliance on GCP BigQuery: Efficiently Managing the Right to Be Forgotten

As a community, I was always concerned that we seldom discussed designing data engineering for regulatory requirements. I'm glad to see the article where the author explains what is Right to Be Forgotten (RTBF) and discusses the architectural pattern in Google BigQuery.

https://medium.com/@florian.trehaut/ensuring-gdpr-compliance-on-gcp-bigquery-efficiently-managing-the-right-to-be-forgotten-a76137944633


Matt Palmer: What's the hype behind DuckDB?

So DuckDB, Is it hype? or does it have the real potential to bring architectural changes to the data warehouse? The author explains how DuckDB works and the potential impact of DuckDB in Data Engineering.

https://mattpalmer.io/posts/whats-the-hype-duckdb/


Hugo Lu: Why Orchestration is the next hot thing in Data

If I put on a purist Data Engineer hat, The Data Orchestration, Data Lineage, Data Testing, and Data Catalogs all of them should be one system. They are not a separate category.

I’m glad to read the take on the orchestration engine expressing similar thoughts and questioning why there is little innovation in the orchestration space.

https://medium.com/@hugolu87/why-orchestration-is-the-next-hot-thing-in-data-69bc32402446


Sam Moris: Crafting your DBT development workflow

Adopting technology is not only about the individual tool; you need an ecosystem of supporting tools. The author writes about such an ecosystem of tools for dbt, narrating the usage of

  1. dbt-project-evaluator

  2. sql-fluff

  3. pre-commit-dbt

  4. dbt-coverage

  5. PR Template

https://medium.com/cts-technologies/crafting-your-dbt-development-workflow-35577d3b573d


All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

3
Share this post

Data Engineering Weekly #132

www.dataengineeringweekly.com
1
Share
1 Comment
Share this discussion

Data Engineering Weekly #132

www.dataengineeringweekly.com
Fernando Cutire
Writes The Pipeline Post
May 30Liked by Ananth Packkildurai

Interesting article on the orchestration part, think is more than data 👌🏽

Expand full comment
Reply
Share
Top
New
Community

No posts

Ready for more?

© 2023 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing