Data Engineering Weekly #194

The Weekly Data Engineering Newsletter

Oct 21, 2024

An overview of the schema generation process, from product engineer input to JSON translation, and finally to syncing to our data catalog.

Notion writes about its journey in adopting data catalogs and describes how a vanilla data catalog solution will only be effective if it adopts a strong data platform foundation. Adopting Typescript rather than the specialized IDL languages is a good strategy, although I wonder how it works in cross-language systems like Android & iOS. I presume the typescript-to-js conversion helps here.

https://www.notion.so/blog/a-brief-history-of-notions-data-catalog

Zach Wilson: How GenAI will impact data engineering

An excellent overview of the potential impact of GenAI in data engineering workflow that can change the pipeline authoring, shipping pipeline, and maintaining the pipeline. Suppose you think of data pipeline building as a manufacturing process that goes from sourcing the raw material to delivering a finished (data) product. There are tons of optimizations we can do to improve the efficiency. The problem statement is something I’m excited about, and I write about it here as something I’m working on.

https://blog.dataengineer.io/p/how-genai-will-impact-data-engineering

Uber: Open Source and In-House - How Uber Optimizes LLM Training

Featured image for Open Source and In-House: How Uber Optimizes LLM Training

Uber’s latest blog post dives into how the company trains large language models (LLMs) by blending open-source models with domain-specific expertise. Uber leverages open-source tools and infrastructure to streamline the training process while fine-tuning and optimizing the models with in-house techniques. Uber’s post outlines its infrastructure stack and training pipeline, emphasizing the balance of open-source and custom solutions to drive generative AI advancements.

https://www.uber.com/blog/open-source-and-in-house-how-uber-optimizes-llm-training/

Apurva Mehta: Stop embedding RocksDB in your Stream Processor!

The most important reason streaming is difficult is to embed the state (RocksDB) in the stream processor. The blog narrates the drawbacks, including increased downtime due to state rebuilds, inflexibility in scaling compute and storage resources, limited visibility into the state, and a lack of advanced functionality like time-to-live (TTL) management and easy state inspection and patching.

Life is too short to scale storage and computing together, and this space requires a fundamental rethinking.

https://www.responsive.dev/blog/stop-embedding-rocksdb-in-kafka-streams

Pinterest: Ray Batch Inference at Pinterest

The search quality team at Pinterest saw over a 30x decrease in annual cost for one of their inference jobs after migrating it from Spark(™) to Ray(™).

There is an increased desire to find a computing engine that is less costly but robust than Apache Spark. Pinterest's three-part series about Ray batch inference is an exciting read about running batch jobs in Ray.

https://medium.com/pinterest-engineering/last-mile-data-processing-with-ray-629affbf34ff

https://medium.com/pinterest-engineering/ray-infrastructure-at-pinterest-0248efe4fd52

https://medium.com/pinterest-engineering/ray-batch-inference-at-pinterest-part-3-4faeb652e385

Confluent: Shift Left: Bad Data in Event Streams

Materializing an event stream made of State events into a table.

Confluent writes about the challenges of handling bad data in event streams, specifically focusing on the differences in dealing with bad data in batch processing compared to event streaming. State events, which represent the entire state of an entity, are much easier to correct as they can be compacted, allowing for the deletion of older versions and the propagation of updated data. Delta events, which describe changes or actions, are significantly harder to fix due to their immutable nature.

The blog concludes with a final strategy, "rewind, rebuild, and retry," highlighting the importance of preventative measures like schemas, data quality checks, and robust testing to avoid such costly interventions.

https://www.confluent.io/blog/shift-left-bad-data-in-event-streams-part-2/

Christophe Oudar: Microbatch on dbt?

The article is an excellent narration of the recent dbt support for microbatch incremental workload. I’m happy about the support coming in dbt, as I worked out a similar incremental batch processing strategy that the author explains here in dbt-core with Airflow. I still remember a veteran data engineer at work telling me how dbt got away with it til now without this feature :-) I’m sure we are not alone here. Many companies would have done something similar, so I’m glad it is finally coming.

https://medium.com/@kayrnt/microbatch-on-dbt-93d600ced394

Barbara Galiza: You don’t have a “data problem”

We marketers sometimes think of data as a boolean: either we have it or don’t. However, data problems come in many shapes and forms.

The blog is an excellent summary of data problems. The author breaks down “data problems” into five distinct categories: measurement, quality, accessibility, literacy, and activation. The author further narrates the symptoms of each and possible remedies.

https://www.021newsletter.com/p/you-dont-have-a-data-problem

Shopify: How Shopify improved consumer search intent with real-time ML

Shopify writes about its real-time pipeline design for building the image embedding engine to improve consumer search experience. The blog describes the trade-off between parallel data processing and memory usage while processing images.

https://shopify.engineering/how-shopify-improved-consumer-search-intent-with-real-time-ml

Netflix: Investigation of a Workbench UI Latency Issue

Internal tools, especially data tools, should address performance and user experience. Netflix writes about how the data platform team investigated the UI latency issue in Jupyter Notebook. What I like most about the blog is not only the investigation but also the authors' explanation of how they use LLM in their debug workflow to understand the issue in more depth.

Moreover, after several rounds of discussion with ChatGPT, we learned more about the architecture and realized that, in theory, the usage of pystan and nest_asyncio should not cause the slowness in handling the UI WebSocket

https://netflixtechblog.com/investigation-of-a-workbench-ui-latency-issue-faa017b4653d

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employer” opinions.

Data Engineering Weekly