Data Engineering Weekly #151

The Weekly Data Engineering Newsletter

Dec 04, 2023

RudderStack, one of the leading alternatives to Segment, is the Warehouse Native CDP, built to help data teams deliver value across the entire data activation lifecycle, from collection to unification and activation. Visit rudderstack.com to learn more.

Github: The architecture of today’s LLM applications

LLM is slowly changing the application architecture landscape as it becomes integral to app development. Github writes an excellent blog to capture the current state of the LLM integration architecture.

Flow chart that reads from right to left, showing components of a large language model application and how they all work together. Data source for diagram is detailed here: https://github.blog/?p=74969&preview=true#the-emerging-architecture-of-llm-apps

https://github.blog/2023-10-30-the-architecture-of-todays-llm-applications/

Microsoft: Generative AI for Beginners

Understanding Gen-AI becomes a mandatory skill for application developers and data engineers. I found this GitHub tutorial from Microsoft to be an excellent resource to get started with Gen-AI if you’re beginning your journey to understand the landscape.

https://github.com/microsoft/generative-ai-for-beginners

Airbnb: Data Quality Score: The next chapter of data quality at Airbnb

If it can be measured, it can be improved

Another excellent post from Airbnb discusses measuring the data quality and suggestions to improve it. In a typical Carrot & stick approach, a thoughtful system design with an incentive to improve goes a long way over the stick approach, as noted by the author.

We made the decision that we could no longer rely on enforcement to scale data quality at Airbnb, and we instead needed to rely on the incentivization of both the data producer and consumer.

https://medium.com/airbnb-engineering/data-quality-score-the-next-chapter-of-data-quality-at-airbnb-851dccda19c3

Netflix: Incremental Processing using Netflix Maestro and Apache Iceberg

Netflix writes about its incremental processing design with its orchestration engine Maestro on top of Iceberg. The blog is an excellent read to understand late-arriving data, backfilling, and incremental processing complications.

New data pipeline with incremental processing

https://netflixtechblog.com/incremental-processing-using-netflix-maestro-and-apache-iceberg-b8ba072ddeeb

Whatnot: Managing a Dynamic dbt Project: Guardrails, Guidelines, Gadgets

Everything breaks at scale, and that is a fine position to be in.

Whatnot shares such a case study on how the growth of dbt projects leads to longer build time, unstable builds, and bloating of unused models. I like the 3G model with Guardrails, Guidelines & Gadget, which I’m sure I will use more often :-). The solution is as simple but highly effective as adopting incremental data processing and applying ownership and lining style conventions.

https://medium.com/whatnot-engineering/managing-a-dynamic-dbt-project-929db0a134fb

Lyft: Druid Deprecation and ClickHouse Adoption at Lyft

Some legacy OLAP engines, like Druid, have high operational costs and low ROI. I experienced similar drawbacks to what Lyft is talking about in Druid. Lyft writes about its refined architecture with ClickHouse and some of the challenges they found with ClickHouse during the migration.

One of the highlights for me in the blog is the next step of ClickHouse adoption; as a believer in bringing “query to the data” rather than “data to query,” I love it.

Move Flink SQL to ClickHouse—certain Flink transformations can be directly done in the destination in ClickHouse. We plan to leverage multiple new use cases in ClickHouse.

https://eng.lyft.com/druid-deprecation-and-clickhouse-adoption-at-lyft-120af37651fd

Netflix: Streamlining Membership Data Engineering at Netflix with Psyberg

A seamless lookback, aka reconciliation pipeline support, is a must-have for your data infrastructure to support data pipelines. Netflix writes about its membership data pipeline and how it supports the lookback approach.

https://netflixtechblog.com/1-streamlining-membership-data-engineering-at-netflix-with-psyberg-f68830617dd1

DoorDash: Transforming MLOps at DoorDash with Machine Learning Workbench

DoorDash writes about ML Workbench, a centralized hub providing space for accomplishing tasks throughout the machine learning lifecycle, such as building, training, tuning, and deploying machine learning models in a production-ready environment. The workspace evolution is an exciting read for incremental architectural improvements.

https://doordash.engineering/2023/11/28/transforming-mlops-at-doordash-with-machine-learning-workbench/

Sophie Blee-Goldman: Kafka Streams and Rebalancing through the Ages

Consumers come and go.

Partitions, ever-present.

Rebalancing, the awkward middle child.

Kafka rebalancing has come a long way since then, and the author walks back to us the memory lane of Kafka rebalancing and the advancements made ever since.

https://www.responsive.dev/blog/kafka-streams-history-of-rebalancing

byte[array]: Doing range gets on cloud storage for fun and profit

Cloud blob storage like S3 has become the standard for storing large volumes of data, yet we have not talked about how optimal its interfaces are. The author explores the current complication of accessing blob storage and what can be improved to optimize the access.

https://substack.com/@bytearray/posts

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly