Data Engineering Weekly #188

The Weekly Data Engineering Newsletter

Sep 09, 2024

Try Fully Managed Apache Airflow for FREE

Run Airflow without the hassle and management complexity. Take Astro (the fully managed Airflow solution) for a test drive today and unlock a suite of features designed to simplify, optimize, and scale your data pipelines. For a limited time, new sign-ups will receive a complimentary Airflow Fundamentals Certification exam (normally $150).

Try For Free →

Gradient Flow: Rethinking Analyst Roles in the Age of Generative AI

The future of analysis isn't about replacement; it's about augmentation.

I saw someone quoted in X that using LLM to code is like using Google Maps to drive. At some point, we would wonder how people coded without code agents. The author explores what the future holds for the analyst and their relationship with GenAI.

https://gradientflow.substack.com/p/rethinking-analyst-roles-in-the-age

Financial Times: Putting the (Gen AI) Puzzle Together

In April 2024, Nikkei(the parent company of FT) issued a press release announcing internally built and trained Nikkei Language Models (NiLM). NiLM is a series of LLMs specializing in economic and financial information, trained on 40 years’ content of Nikkei group titles and among the largest and highest-quality LLMs in Japanese. The blog, like the previous one, explores the role of Gen AI in the publishing industry.

https://medium.com/ft-product-technology/putting-the-gen-ai-puzzle-together-9470c3aaef7a

Sebastian Raschka: New LLM Pre-training and Post-training Paradigms

The previous two blogs show a glimpse of the next wave in Gen-AI: the ability to train internal business/ content with LLM and its business impact. The LLM pre-training and post-training paradigms are rolling out the foundational future to enable organizations to build with internal knowledge. The author explores how leading language models in the market support pre and post-training paradigms.

https://magazine.sebastianraschka.com/p/new-llm-pre-training-and-post-training

Spotify: Are You a Dalia? How We Created Data Science Personas for Spotify’s Analytics Platform

It takes a village to engineer data. It requires balanced teamwork with multiple organizational personalities to cultivate a data-driven culture. Spotify writes an excellent article highlighting the key Data Science persona.

https://engineering.atspotify.com/2024/09/are-you-a-dalia-how-we-created-data-science-personas-for-spotifys-analytics-platform/

Kiran Prakash: Governing data products using fitness functions

How do we know all the published data products met the production quality? The obvious answer is running the fitness function. A fitness function is a test to evaluate how close a given implementation is to its stated design objectives. The blog narrates key design objectives of data products, such as accuracy, completeness, and timeliness, and explains how to run the data product fitness functions with data catalogs.

https://martinfowler.com/articles/fitness-functions-data-products.html

RazorPay: Enhancing Customer Support through AI-powered ticket issue categorization

The traditional ticket routing algorithm often yields a complex rule engine with a combination of customer-miscategorized topics. This inefficiency often results in poor customer experience. RazorPay writes about how it enhances the customer experience with an AI-powered ticket category classification system.

https://engineering.razorpay.com/enhancing-customer-support-through-ai-powered-ticket-issue-categorization-a1795d9f0f3c

Wix: AI for Revolutionizing Customer Care Routing System at Wix

Wix shares the same case study as RazorPay about enriching the routing system to improve the customer experience. The blog narrates the usage of Reinforcement learning-based solutions for customer care routing. It is interesting to see how the trade-off between a higher resolution rate and wait time impacts each other. Wix concluded that a higher resolution rate at the cost of wait time yields more customer satisfaction.

https://www.wix.engineering/post/ai-for-revolutionizing-customer-care-routing-system-at-wix

Tangbao: Flink SQL Development Experience Sharing

It is very rare to see open and honest learning about using technology, and this time, the author shares the development of Flink SQL. The author’s key learnings from working with Flink SQL include optimizing join and aggregation performance using lookup joins and two-phase aggregation to resolve data skew and minimize state overhead. Additionally, handling distinct aggregations more efficiently through Partial/Final aggregation was crucial, as was leveraging Flink's stream-batch integration for unified real-time and batch processing

https://www.alibabacloud.com/blog/flink-sql-development-experience-sharing_601569

StackOverflow: Best practices for cost-efficient Kafka clusters

We often discussed the cost of the data warehouse but missed the elephant in the room: Kafka. The fundamental function of Kafka handling millions of events per second, in combination with replication and tight coupling of the producer and the consumers to the cluster, puts more pressure on network bandwidth. All that leads to operating Kafka being a complex and expensive process. The StackOverflow blog narrates best design practices for running cost-efficient Kafka clusters.

https://stackoverflow.blog/2024/09/04/best-practices-for-cost-efficient-kafka-clusters

Tom Moertel: Sampling with SQL

Sampling is one of the most powerful tools for extracting meaning from large datasets. TABLESAMPLE is my fav, but it has its cons. The author explores various other sampling techniques to implement in SQL.

1. Weighted Sampling Without Replacement (A-ES Algorithm): This technique assigns each row a random key and selects the top rows based on these keys, ensuring fair sampling in proportion to row weights.

2. Deterministic Sampling: Uses pseudorandom functions to generate repeatable random samples.

3. Sampling With Replacement: This variation of the A-ES algorithm allows multiple selections of the same row by considering all possible arrivals from a Poisson process.

https://blog.moertel.com/posts/2024-08-23-sampling-with-sql.html

Xuanwo: Rewrite BigData in Rust

Rust is gaining popularity in data infrastructure, with more new systems built using it. The blog holds the index of popular rust tools in the data infrastructure, and I believe the trend continues to grow from here.

https://xuanwo.io/2024/07-rewrite-bigdata-in-rust/

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employer” opinions.

Data Engineering Weekly