Data Engineering Weekly #151
The Weekly Data Engineering Newsletter
RudderStack, one of the leading alternatives to Segment, is the Warehouse Native CDP, built to help data teams deliver value across the entire data activation lifecycle, from collection to unification and activation. Visit rudderstack.com to learn more.
Github: The architecture of today’s LLM applications
LLM is slowly changing the application architecture landscape as it becomes integral to app development. Github writes an excellent blog to capture the current state of the LLM integration architecture.
Microsoft: Generative AI for Beginners
Understanding Gen-AI becomes a mandatory skill for application developers and data engineers. I found this GitHub tutorial from Microsoft to be an excellent resource to get started with Gen-AI if you’re beginning your journey to understand the landscape.
Airbnb: Data Quality Score: The next chapter of data quality at Airbnb
If it can be measured, it can be improved
Another excellent post from Airbnb discusses measuring the data quality and suggestions to improve it. In a typical Carrot & stick approach, a thoughtful system design with an incentive to improve goes a long way over the stick approach, as noted by the author.
We made the decision that we could no longer rely on enforcement to scale data quality at Airbnb, and we instead needed to rely on the incentivization of both the data producer and consumer.
Netflix: Incremental Processing using Netflix Maestro and Apache Iceberg
Netflix writes about its incremental processing design with its orchestration engine Maestro on top of Iceberg. The blog is an excellent read to understand late-arriving data, backfilling, and incremental processing complications.
Sponsored: Rudderstack - Your AI/ML success starts with data quality
When data science teams get bogged down with data quality issues, pressure to show some kind of value increases, so they begin to prioritize projects based on data availability instead of impact and business need.
Lackluster AI/ML results often stem from poor data quality. Here, the team at RudderStack unpacks the problem, shares best practices for solving it at the source, and details how superior data quality enables data science teams to do their best work.
Whatnot: Managing a Dynamic dbt Project: Guardrails, Guidelines, Gadgets
Everything breaks at scale, and that is a fine position to be in.
Whatnot shares such a case study on how the growth of dbt projects leads to longer build time, unstable builds, and bloating of unused models. I like the 3G model with Guardrails, Guidelines & Gadget, which I’m sure I will use more often :-). The solution is as simple but highly effective as adopting incremental data processing and applying ownership and lining style conventions.
Lyft: Druid Deprecation and ClickHouse Adoption at Lyft
Some legacy OLAP engines, like Druid, have high operational costs and low ROI. I experienced similar drawbacks to what Lyft is talking about in Druid. Lyft writes about its refined architecture with ClickHouse and some of the challenges they found with ClickHouse during the migration.
One of the highlights for me in the blog is the next step of ClickHouse adoption; as a believer in bringing “query to the data” rather than “data to query,” I love it.
Move Flink SQL to ClickHouse—certain Flink transformations can be directly done in the destination in ClickHouse. We plan to leverage multiple new use cases in ClickHouse.
Netflix: Streamlining Membership Data Engineering at Netflix with Psyberg
A seamless lookback, aka reconciliation pipeline support, is a must-have for your data infrastructure to support data pipelines. Netflix writes about its membership data pipeline and how it supports the lookback approach.
DoorDash: Transforming MLOps at DoorDash with Machine Learning Workbench
DoorDash writes about ML Workbench, a centralized hub providing space for accomplishing tasks throughout the machine learning lifecycle, such as building, training, tuning, and deploying machine learning models in a production-ready environment. The workspace evolution is an exciting read for incremental architectural improvements.
Sophie Blee-Goldman: Kafka Streams and Rebalancing through the Ages
Consumers come and go.
Rebalancing, the awkward middle child.
Kafka rebalancing has come a long way since then, and the author walks back to us the memory lane of Kafka rebalancing and the advancements made ever since.
byte[array]: Doing range gets on cloud storage for fun and profit
Cloud blob storage like S3 has become the standard for storing large volumes of data, yet we have not talked about how optimal its interfaces are. The author explores the current complication of accessing blob storage and what can be improved to optimize the access.
All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.