Contribute to the Rudderstack Transformations Library, Win $1000
RudderStack Transformations lets you customize event data in real-time with your own JavaScript or Python code. Now you can win $1,000 cash by contributing a Transformation to our open-source library.
https://www.rudderstack.com/blog/join-the-transformations-challenge-for-a-chance-to-win/
Conference Alert: Shape the future of real-time analytics
The Real-Time Analytic Summit is on April 25-26 in downtown San Francisco, CA. Come and hear talks from companies like StarTree, Confluent, LinkedIn, DoorDash, Imply, and Uber on how they are advancing the state-of-the-art in user-facing analytics delivered instantly.
Go to rtasummit.com and register with DEW30 for 30% off.
Meta: Presto - A Decade of SQL Analytics at Meta
Presto and Kafka are the two systems that greatly impacted data infrastructure in the last decade. As with any good system, Presto went through many optimizations. Meta writes an exciting paper detailing how the Presto infrastructure is evolving, focusing on three areas.
Latency & Efficiency
Scalability & Reliability
Going Beyond Data Analytics use cases
https://research.facebook.com/publications/presto-a-decade-of-sql-analytics-at-meta/
Twitter: Twitter's Recommendation Algorithm
Twitter open-source its recommendation engine code. There are some interesting threads on Twitter, but the highlight for me is the design of the Tweet search system. The cluster split approach to store real-time, protected, and archive tweets is an excellent reference model for designing enterprise search engines.
https://blog.twitter.com/engineering/en_us/topics/open-source/2023/twitter-recommendation-algorithm
Tweet Search System (EarlyBird) Design
https://github.com/twitter/the-algorithm/blob/main/src/java/com/twitter/search/README.md
Google AI: Data-centric ML benchmarking - Announcing DataPerf’s 2023 challenges
Data is the new code: it is the training data that determines the maximum possible quality of an ML solution. The model only determines the degree to which that maximum quality is realized; in a sense, the model is a lossy compiler for the data.
Google announces DataPerf, the first community, and platform to build leaderboards for data benchmarks.
I echoed a similar statement here. The Data creation part of a big untapped market in data engineering.
https://ai.googleblog.com/2023/03/data-centric-ml-benchmarking-announcing.html
Microsoft: Building a collaboration platform for a data science team
Microsoft writes about its open-source Data Science collaboration package. The approach focuses on standardizing the folder and the file names to simplify the collaboration.
.
└── .ds_team/
├── data_class.py
├── features_class.py
├── model_class.py
├── evaluator_class.py
├── experiment_class.py
├── error_analysis.py
├── experiment.py
Sponsored: [Virtual Data Panel] Measuring Data Team ROI
As data leaders, one of our top priorities is to measure ROI. From tracking the efficacy of marketing campaigns to understanding the root cause of new spikes in user engagement, we’re tasked with keeping tabs on the health of the business at all levels. But what about the ROI of our own teams? Watch a panel of data leaders as they discuss how to build strategies for measuring data team ROI.
Watch On-demand
Picnic: Using Change Data Capture for Warehouse Analytics
Picnic writes about its Change Data Capture pipeline and the lessons learned while integrating Debezium with Postgres. The highlight of the blog for me is,
However, the JSON data the connector produces is a self-describing JSON, meaning that each event has its schema definition attached to it. This blows up the messages massively in size, which increases storage costs.
After the initial phase of running Debezium with JSON, we migrated the data to Avro as serialization format, because — being a binary format — it is much more compact, efficient, and supports the use of a schema registry.
There is a criticism of how Json loses the context as it travels through the pipelines and the need for self-describing json schema along with the payload. As the author points out, it is simply not a scalable approach. A schema control plan and event-sourcing data plane are more scalable architecture patterns.
https://blog.picnic.nl/using-change-data-capture-for-warehouse-analytics-a1b23c074781
Data Engineering Weekly recently published An Engineering Guide to Data Creation; If you want to chat about Event Creation, please reach out via LinkedIn.
Kaltura: Moving from Redshift-based architecture to Databricks Delta Lake
Well, another week and another moving from the Redshift blog featuring in Data Engineering Weekly. Seriously, come on, Redshift team!!
Kaltura writes about its challenges in maintaining Redshift and the migration strategy to move to Databricks. The blog highlights some of the immediate wins but also highlights the new set of challenges with Databricks.
Sponsored: RudderStack Transformations - Move Faster and Build Data Trust
With Device Mode Transformations, you can transform data sent to downstream integrations running in device mode. When destination integrations are set up in device mode, RudderStack loads that tool's native SDK asynchronously and sends event data directly to the destination from the device itself (i.e., from the browser or mobile app).
RudderStack Product manager, Badri Veeraragavan, details a few big updates to RudderStack's beloved data transformation feature. New features include Python Transformations (including Libraries and Transformations API), Transformation Templates, and Device Mode Transformations. 75% of RudderStack users already leverage Transformations, and now they're even more powerful.
https://www.rudderstack.com/blog/transformations-move-faster-and-build-data-trust/
Nordnet: Our first steps towards Data Mesh on Google Cloud Platform
Nordnet writes about its first journey adopting the Data Mesh concept with the streaming-first, event-driven approach. The blog rightly calls out the challenges of adopting an events-only approach and mitigating it with a “state dump,” aka bootstrapping approach.
Jun Zhang - Tencent Data Engineer - Why We Go from ClickHouse to Apache Doris?
Til about Apache Doris, it seems an exciting system to explore. The author highlights the following cases of why Tencent migrated from ClickHouse to Apache Doris.
Partial Update
High storage cost
High maintenance cost
I recently had a chance to evaluate ClickHouse and came to a similar conclusion. The “upsert” operation is not well supported in ClickHouse, which increases the cost and decreases the production cluster stability, directly impacting the system's reliability.
I found in these design patterns Apache Pinot does much better optimization than many of the OLAP engines.
Black Square: Rethinking Whisky Search in the World of ChatGPT
The possibility of semantic search in product discovery experience is something I’m looking forward to it, and that is why I found this article very interesting. The author writes about how BlackSquare is rethinking whisky discovery using ChatGPT.
https://data.blacksquare.io/rethinking-product-search-in-the-world-of-chatgpt-57b435b5ce3c
All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.