Data Engineering Weekly #176

The Weekly Data Engineering Newsletter

Jun 16, 2024

Experience Enterprise-Grade Apache Airflow

Astro augments Airflow with enterprise-grade features to enhance productivity, meet scalability and availability demands across your data pipelines, and more.

Learn More →

Databricks: Open Sourcing Unity Catalog

This week brought many exciting developments, with Snowflake and Databricks announcing open-source catalogs. Unity Catalog's source code is available on GitHub, and people have already conducted exciting experiments with it. [Unity Catalog OSS with Hudi, Delta, Iceberg, and EMR + DuckDB].

One of the big benefits of having a Hive Meastore catalog is that it enables many query engines to build executing engines on top of it, which creates a strong ecosystem. I’m excited to see what Unity Catalog and Polaris Catalog bring.

https://www.databricks.com/blog/open-sourcing-unity-catalog

NVIDIA: NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models

nemotron synthetic data generation pipeline diagram

High-quality training data plays a critical role in the performance, accuracy, and quality of responses from a custom LLM. Regulative requirements and privacy concerns are often a big hurdle to training context-rich data. The paper Generative AI for Synthetic Data Generation: Methods, Challenges, and the Future highlights the current challenges and opportunities with synthetic data generation to train LLMs. Along the same line, NVIDIA open sources nmotran4 give developers a free, scalable way to generate synthetic data that can help build powerful LLMs.

https://blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm-training/

Jack Vanlightly: A Cost Analysis Of Replication Vs. S3 Express One Zone In Transactional Data Systems

S3 Express One Zone, with low latency and write ops certainty, is promising. The author demonstrates a few emerging architecture patterns around the S3 Express One Zone and points out that it can offer low-latency writes suitable for high-throughput workloads; it becomes cost-effective mainly for high-throughput scenarios. Replication-based systems remain more economical at low to medium throughputs, especially with significant cross-AZ data transfer discounts.

https://jack-vanlightly.com/blog/2024/6/10/a-cost-analysis-of-replication-vs-s3-express-one-zone-in-transactional-data-systems

Walmart: Reliably Processing Trillions of Kafka Messages Per Day

Consumer rebalancing and Head-of-line (HOL) blocking are some of the most common challenges while operating Kafka at scale. The recent KIP-932 proposal suggests a message proxy service (MPS) for Kafka to decouple the Kafka message_reader thread (i.e., a group of 1 thread) and message_processing_writer threads. The blog explains KIP-932 and its potential benefits. I’ve not read through the KIP-932 proposal, but I remember Uber doing a similar design in the past with Enabling Seamless Kafka Async Queuing with Consumer Proxy.

https://medium.com/walmartglobaltech/reliably-processing-trillions-of-kafka-messages-per-day-23494f553ef9

Yousry Mohamed: Delta Lake Liquid Clustering — A visual explanation

Liquid clustering liberates the hive-style static partitioning and organizes the data layout from the accessing pattern. The author explains the available pattern and provides an in-detail view of the Hilbert Curve Assignment. The official design document for liquid clustering is here.

https://levelup.gitconnected.com/delta-lake-liquid-clustering-a-visual-explanation-b9d8782a9f33

Meta: Maintaining large-scale AI capacity at Meta

Meta uses bespoke training hardware with the newest chips possible and high-performance backend networks that are highly speed-optimized. It discusses the maintenance issues that Meta discusses in the OpsPlanner orchestrator for managing the GPU overlapping workloads.

https://engineering.fb.com/2024/06/12/production-engineering/maintaining-large-scale-ai-capacity-meta/

Sören Brunk: Using DuckDB for Embeddings and Vector Search

The simplicity of DuckDB is an exciting part, and that combines with its advanced capabilities. The author highlights the power and flexibility of DuckDB for handling embeddings and performing vector searches.

https://blog.brunk.io/posts/similarity-search-with-duckdb

Picnic: Open-sourcing dbt-score: lint model metadata with ease!

The more metadata there is, the more readability of the model. It is often challenging as developers are not incentivized to produce quality metadata. Picnic writes about dbt-score, a non-opinionated, configurable linting tool to measure the completeness of dbt model metadata.

https://blog.picnic.nl/picnic-open-sources-dbt-score-linting-model-metadata-with-ease-428278f9f05b

Benchling: A behind-the-scenes look at building interactive analysis capabilities in Benchling

Interactive Analysis in Benchling allows scientists to perform real-time data transformation, visualization, and analysis without transferring it into other systems. The article highlights the architectural pattern and how it handles scalability and data sharing among apps.

https://benchling.engineering/a-behind-the-scenes-look-at-building-interactive-analysis-capabilities-in-benchling-fa6ec1bab1e5

Alibaba: In-depth Application of Flink in Ant Group Real-time Feature Store

Alibaba talks about Ant Group’s real-time feature store, SkyLine. Skyline involves three key stages: computing inference, normalization, and deployment. The article goes in-depth on each layer and the optimization of SkyLine to build efficient feature serving.

https://www.alibabacloud.com/blog/in-depth-application-of-flink-in-ant-group-real-time-feature-store_601288

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employer” opinions.

Data Engineering Weekly