Data Engineering Weekly #125

The Weekly Data Engineering Newsletter

Apr 03, 2023

Contribute to the Rudderstack Transformations Library, Win $1000

RudderStack Transformations lets you customize event data in real-time with your own JavaScript or Python code. Now you can win $1,000 cash by contributing a Transformation to our open-source library.

https://www.rudderstack.com/blog/join-the-transformations-challenge-for-a-chance-to-win/

Conference Alert: Shape the future of real-time analytics

The Real-Time Analytic Summit is on April 25-26 in downtown San Francisco, CA. Come and hear talks from companies like StarTree, Confluent, LinkedIn, DoorDash, Imply, and Uber on how they are advancing the state-of-the-art in user-facing analytics delivered instantly.

Go to rtasummit.com and register with DEW30 for 30% off.

Meta: Presto - A Decade of SQL Analytics at Meta

Presto and Kafka are the two systems that greatly impacted data infrastructure in the last decade. As with any good system, Presto went through many optimizations. Meta writes an exciting paper detailing how the Presto infrastructure is evolving, focusing on three areas.

Latency & Efficiency
Scalability & Reliability
Going Beyond Data Analytics use cases

https://research.facebook.com/publications/presto-a-decade-of-sql-analytics-at-meta/

Click here to read the paper.

Twitter: Twitter's Recommendation Algorithm

Twitter open-source its recommendation engine code. There are some interesting threads on Twitter, but the highlight for me is the design of the Tweet search system. The cluster split approach to store real-time, protected, and archive tweets is an excellent reference model for designing enterprise search engines.

https://blog.twitter.com/engineering/en_us/topics/open-source/2023/twitter-recommendation-algorithm

Tweet Search System (EarlyBird) Design

https://github.com/twitter/the-algorithm/blob/main/src/java/com/twitter/search/README.md

Google AI: Data-centric ML benchmarking - Announcing DataPerf’s 2023 challenges

Data is the new code: it is the training data that determines the maximum possible quality of an ML solution. The model only determines the degree to which that maximum quality is realized; in a sense, the model is a lossy compiler for the data.

Google announces DataPerf, the first community, and platform to build leaderboards for data benchmarks.

I echoed a similar statement here. The Data creation part of a big untapped market in data engineering.

at-ananth-at-data-folks dot m@ st0 dot h0st @ananthdurai

With the recent @Microsoft announcement, LLM will hugely impact the last-mile delivery of analytics. It will be interesting how the semantic layer emerges with the rise of GPT models. However, High-Quality Data Creation and Data collaboration going to remain challenging.

https://ai.googleblog.com/2023/03/data-centric-ml-benchmarking-announcing.html

Microsoft: Building a collaboration platform for a data science team

Microsoft writes about its open-source Data Science collaboration package. The approach focuses on standardizing the folder and the file names to simplify the collaboration.

. 
└── .ds_team/ 
    ├── data_class.py 
    ├── features_class.py 
    ├── model_class.py 
    ├── evaluator_class.py 
    ├── experiment_class.py 
    ├── error_analysis.py 
    ├── experiment.py

https://medium.com/data-science-at-microsoft/building-a-collaboration-platform-for-a-data-science-team-b37d1d4e3a31

Picnic: Using Change Data Capture for Warehouse Analytics

Picnic writes about its Change Data Capture pipeline and the lessons learned while integrating Debezium with Postgres. The highlight of the blog for me is,

However, the JSON data the connector produces is a self-describing JSON, meaning that each event has its schema definition attached to it. This blows up the messages massively in size, which increases storage costs.

After the initial phase of running Debezium with JSON, we migrated the data to Avro as serialization format, because — being a binary format — it is much more compact, efficient, and supports the use of a schema registry.

There is a criticism of how Json loses the context as it travels through the pipelines and the need for self-describing json schema along with the payload. As the author points out, it is simply not a scalable approach. A schema control plan and event-sourcing data plane are more scalable architecture patterns.

https://blog.picnic.nl/using-change-data-capture-for-warehouse-analytics-a1b23c074781

Data Engineering Weekly recently published An Engineering Guide to Data Creation; If you want to chat about Event Creation, please reach out via LinkedIn.

Data Engineering Weekly

An Engineering Guide to Data Creation - A Data Contract perspective - Part 1

Why should we care about Data Creation Process? All Successful Data-Driven organizations have one thing in common; They have a high-quality & efficient data creation process. Data creation is often the differentiator between the success & the failure of a data team…

2 years ago · 16 likes · Ananth Packkildurai

Kaltura: Moving from Redshift-based architecture to Databricks Delta Lake

Well, another week and another moving from the Redshift blog featuring in Data Engineering Weekly. Seriously, come on, Redshift team!!

Kaltura writes about its challenges in maintaining Redshift and the migration strategy to move to Databricks. The blog highlights some of the immediate wins but also highlights the new set of challenges with Databricks.

https://medium.com/kaltura-tech/moving-from-redshift-based-architecture-to-databricks-delta-lake-7a17be6449d7

Nordnet: Our first steps towards Data Mesh on Google Cloud Platform

Nordnet writes about its first journey adopting the Data Mesh concept with the streaming-first, event-driven approach. The blog rightly calls out the challenges of adopting an events-only approach and mitigating it with a “state dump,” aka bootstrapping approach.

https://medium.com/nordnet-tech/our-first-steps-towards-data-mesh-on-google-cloud-platform-e4aa8eb70236

Jun Zhang - Tencent Data Engineer - Why We Go from ClickHouse to Apache Doris?

Til about Apache Doris, it seems an exciting system to explore. The author highlights the following cases of why Tencent migrated from ClickHouse to Apache Doris.

Partial Update
High storage cost
High maintenance cost

I recently had a chance to evaluate ClickHouse and came to a similar conclusion. The “upsert” operation is not well supported in ClickHouse, which increases the cost and decreases the production cluster stability, directly impacting the system's reliability.

I found in these design patterns Apache Pinot does much better optimization than many of the OLAP engines.

https://medium.com/geekculture/tencent-data-engineer-why-we-go-from-clickhouse-to-apache-doris-db120f324290

Black Square: Rethinking Whisky Search in the World of ChatGPT

The possibility of semantic search in product discovery experience is something I’m looking forward to it, and that is why I found this article very interesting. The author writes about how BlackSquare is rethinking whisky discovery using ChatGPT.

https://data.blacksquare.io/rethinking-product-search-in-the-world-of-chatgpt-57b435b5ce3c

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.