Data Engineering Weekly

Share this post

Data Engineering Weekly #125

www.dataengineeringweekly.com

Data Engineering Weekly #125

The Weekly Data Engineering Newsletter

Ananth Packkildurai
Apr 3, 2023
6
Share
Share this post

Data Engineering Weekly #125

www.dataengineeringweekly.com

Contribute to the Rudderstack Transformations Library, Win $1000

RudderStack Transformations lets you customize event data in real-time with your own JavaScript or Python code. Now you can win $1,000 cash by contributing a Transformation to our open-source library.

https://www.rudderstack.com/blog/join-the-transformations-challenge-for-a-chance-to-win/


Conference Alert: Shape the future of real-time analytics

The Real-Time Analytic Summit is on April 25-26 in downtown San Francisco, CA. Come and hear talks from companies like StarTree, Confluent, LinkedIn, DoorDash, Imply, and Uber on how they are advancing the state-of-the-art in user-facing analytics delivered instantly.

Go to rtasummit.com and register with DEW30 for 30% off.


Meta: Presto - A Decade of SQL Analytics at Meta

Presto and Kafka are the two systems that greatly impacted data infrastructure in the last decade. As with any good system, Presto went through many optimizations. Meta writes an exciting paper detailing how the Presto infrastructure is evolving, focusing on three areas.

  1. Latency & Efficiency

  2. Scalability & Reliability

  3. Going Beyond Data Analytics use cases

https://research.facebook.com/publications/presto-a-decade-of-sql-analytics-at-meta/

Click here to read the paper.


Twitter: Twitter's Recommendation Algorithm

Twitter open-source its recommendation engine code. There are some interesting threads on Twitter, but the highlight for me is the design of the Tweet search system. The cluster split approach to store real-time, protected, and archive tweets is an excellent reference model for designing enterprise search engines.

https://blog.twitter.com/engineering/en_us/topics/open-source/2023/twitter-recommendation-algorithm

Tweet Search System (EarlyBird) Design

https://github.com/twitter/the-algorithm/blob/main/src/java/com/twitter/search/README.md


Google AI: Data-centric ML benchmarking - Announcing DataPerf’s 2023 challenges

Data is the new code: it is the training data that determines the maximum possible quality of an ML solution. The model only determines the degree to which that maximum quality is realized; in a sense, the model is a lossy compiler for the data.

Google announces DataPerf, the first community, and platform to build leaderboards for data benchmarks.

I echoed a similar statement here. The Data creation part of a big untapped market in data engineering.

Twitter avatar for @ananthdurai
at-ananth-at-data-folks dot m@ st0 dot h0st @ananthdurai
With the recent @Microsoft announcement, LLM will hugely impact the last-mile delivery of analytics. It will be interesting how the semantic layer emerges with the rise of GPT models. However, High-Quality Data Creation and Data collaboration going to remain challenging.
Image
5:09 PM ∙ Mar 16, 2023
2Likes1Retweet

https://ai.googleblog.com/2023/03/data-centric-ml-benchmarking-announcing.html


Microsoft: Building a collaboration platform for a data science team

Microsoft writes about its open-source Data Science collaboration package. The approach focuses on standardizing the folder and the file names to simplify the collaboration.

. 
└── .ds_team/ 
    ├── data_class.py 
    ├── features_class.py 
    ├── model_class.py 
    ├── evaluator_class.py 
    ├── experiment_class.py 
    ├── error_analysis.py 
    ├── experiment.py 

https://medium.com/data-science-at-microsoft/building-a-collaboration-platform-for-a-data-science-team-b37d1d4e3a31


Sponsored: [Virtual Data Panel] Measuring Data Team ROI

As data leaders, one of our top priorities is to measure ROI. From tracking the efficacy of marketing campaigns to understanding the root cause of new spikes in user engagement, we’re tasked with keeping tabs on the health of the business at all levels. But what about the ROI of our own teams? Watch a panel of data leaders as they discuss how to build strategies for measuring data team ROI.

Watch On-demand


Picnic: Using Change Data Capture for Warehouse Analytics

Picnic writes about its Change Data Capture pipeline and the lessons learned while integrating Debezium with Postgres. The highlight of the blog for me is,

However, the JSON data the connector produces is a self-describing JSON, meaning that each event has its schema definition attached to it. This blows up the messages massively in size, which increases storage costs.

After the initial phase of running Debezium with JSON, we migrated the data to Avro as serialization format, because — being a binary format — it is much more compact, efficient, and supports the use of a schema registry.

There is a criticism of how Json loses the context as it travels through the pipelines and the need for self-describing json schema along with the payload. As the author points out, it is simply not a scalable approach. A schema control plan and event-sourcing data plane are more scalable architecture patterns.

https://blog.picnic.nl/using-change-data-capture-for-warehouse-analytics-a1b23c074781

Data Engineering Weekly recently published An Engineering Guide to Data Creation; If you want to chat about Event Creation, please reach out via LinkedIn.

Data Engineering Weekly
An Engineering Guide to Data Creation - A Data Contract perspective - Part 1
Why should we care about Data Creation Process? All Successful Data-Driven organizations have one thing in common; They have a high-quality & efficient data creation process. Data creation is often the differentiator between the success & the failure of a data team…
Read more
2 months ago · 16 likes · Ananth Packkildurai

Kaltura: Moving from Redshift-based architecture to Databricks Delta Lake

Well, another week and another moving from the Redshift blog featuring in Data Engineering Weekly. Seriously, come on, Redshift team!!

Kaltura writes about its challenges in maintaining Redshift and the migration strategy to move to Databricks. The blog highlights some of the immediate wins but also highlights the new set of challenges with Databricks.

https://medium.com/kaltura-tech/moving-from-redshift-based-architecture-to-databricks-delta-lake-7a17be6449d7


Sponsored: RudderStack Transformations - Move Faster and Build Data Trust

With Device Mode Transformations, you can transform data sent to downstream integrations running in device mode. When destination integrations are set up in device mode, RudderStack loads that tool's native SDK asynchronously and sends event data directly to the destination from the device itself (i.e., from the browser or mobile app).

RudderStack Product manager, Badri Veeraragavan, details a few big updates to RudderStack's beloved data transformation feature. New features include Python Transformations (including Libraries and Transformations API), Transformation Templates, and Device Mode Transformations. 75% of RudderStack users already leverage Transformations, and now they're even more powerful.


https://www.rudderstack.com/blog/transformations-move-faster-and-build-data-trust/


Nordnet: Our first steps towards Data Mesh on Google Cloud Platform

Nordnet writes about its first journey adopting the Data Mesh concept with the streaming-first, event-driven approach. The blog rightly calls out the challenges of adopting an events-only approach and mitigating it with a “state dump,” aka bootstrapping approach.

https://medium.com/nordnet-tech/our-first-steps-towards-data-mesh-on-google-cloud-platform-e4aa8eb70236


Jun Zhang - Tencent Data Engineer - Why We Go from ClickHouse to Apache Doris?

Til about Apache Doris, it seems an exciting system to explore. The author highlights the following cases of why Tencent migrated from ClickHouse to Apache Doris.

  1. Partial Update

  2. High storage cost

  3. High maintenance cost

I recently had a chance to evaluate ClickHouse and came to a similar conclusion. The “upsert” operation is not well supported in ClickHouse, which increases the cost and decreases the production cluster stability, directly impacting the system's reliability.

I found in these design patterns Apache Pinot does much better optimization than many of the OLAP engines.

https://medium.com/geekculture/tencent-data-engineer-why-we-go-from-clickhouse-to-apache-doris-db120f324290


Black Square: Rethinking Whisky Search in the World of ChatGPT

The possibility of semantic search in product discovery experience is something I’m looking forward to it, and that is why I found this article very interesting. The author writes about how BlackSquare is rethinking whisky discovery using ChatGPT.

https://data.blacksquare.io/rethinking-product-search-in-the-world-of-chatgpt-57b435b5ce3c


All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

6
Share
Share this post

Data Engineering Weekly #125

www.dataengineeringweekly.com
Comments
Top
New
Community

No posts

Ready for more?

© 2023 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing