Data Engineering Weekly

Data Engineering Weekly

Share this post

Data Engineering Weekly
Data Engineering Weekly
Data Engineering Weekly #188
User's avatar
Discover more from Data Engineering Weekly
The Weekly Data Engineering Newsletter
Over 37,000 subscribers
Already have an account? Sign in

Data Engineering Weekly #188

The Weekly Data Engineering Newsletter

Ananth Packkildurai's avatar
Ananth Packkildurai
Sep 09, 2024
9

Share this post

Data Engineering Weekly
Data Engineering Weekly
Data Engineering Weekly #188
1
Share

Try Fully Managed Apache Airflow for FREE

Run Airflow without the hassle and management complexity. Take Astro (the fully managed Airflow solution) for a test drive today and unlock a suite of features designed to simplify, optimize, and scale your data pipelines. For a limited time, new sign-ups will receive a complimentary Airflow Fundamentals Certification exam (normally $150).

Try For Free →


Gradient Flow: Rethinking Analyst Roles in the Age of Generative AI

The future of analysis isn't about replacement; it's about augmentation.

I saw someone quoted in X that using LLM to code is like using Google Maps to drive. At some point, we would wonder how people coded without code agents. The author explores what the future holds for the analyst and their relationship with GenAI.

https://gradientflow.substack.com/p/rethinking-analyst-roles-in-the-age


Financial Times: Putting the (Gen AI) Puzzle Together

In April 2024, Nikkei(the parent company of FT) issued a press release announcing internally built and trained Nikkei Language Models (NiLM). NiLM is a series of LLMs specializing in economic and financial information, trained on 40 years’ content of Nikkei group titles and among the largest and highest-quality LLMs in Japanese. The blog, like the previous one, explores the role of Gen AI in the publishing industry.

https://medium.com/ft-product-technology/putting-the-gen-ai-puzzle-together-9470c3aaef7a


Sebastian Raschka: New LLM Pre-training and Post-training Paradigms

The previous two blogs show a glimpse of the next wave in Gen-AI: the ability to train internal business/ content with LLM and its business impact. The LLM pre-training and post-training paradigms are rolling out the foundational future to enable organizations to build with internal knowledge. The author explores how leading language models in the market support pre and post-training paradigms.

https://magazine.sebastianraschka.com/p/new-llm-pre-training-and-post-training


Sponsored: Introduction to GenAI with Apache Airflow®

Interested in building Generative AI applications but not sure where to get started? Astronomer's latest academy module, Introduction to GenAI with Airflow, covers just this! 🎓

This FREE module will cover how Apache Airflow helps power GenAI apps, then move into a hands-on project where you'll build a RAG pipeline that produces a content generation application. You’ll also use some of the most in-demand technologies, including Streamlit, OpenAI, LangChain, and Weaviate!

Start Learning →


Spotify: Are You a Dalia? How We Created Data Science Personas for Spotify’s Analytics Platform

It takes a village to engineer data. It requires balanced teamwork with multiple organizational personalities to cultivate a data-driven culture. Spotify writes an excellent article highlighting the key Data Science persona.

https://engineering.atspotify.com/2024/09/are-you-a-dalia-how-we-created-data-science-personas-for-spotifys-analytics-platform/


Kiran Prakash: Governing data products using fitness functions

How do we know all the published data products met the production quality? The obvious answer is running the fitness function. A fitness function is a test to evaluate how close a given implementation is to its stated design objectives. The blog narrates key design objectives of data products, such as accuracy, completeness, and timeliness, and explains how to run the data product fitness functions with data catalogs.

https://martinfowler.com/articles/fitness-functions-data-products.html


Sponsored: Monte Carlo Demo - September

Data quality is a core pillar of any data governance strategy, but how do you build a data quality program, let alone operationalize it?

Join Monte Carlo for a deep dive session into how data governance leaders can take their data quality strategies to the next level with end-to-end data observability. Learn how data observability platforms like Monte Carlo automate detection and incident management workflows.

montecarlodata.com


RazorPay: Enhancing Customer Support through AI-powered ticket issue categorization

The traditional ticket routing algorithm often yields a complex rule engine with a combination of customer-miscategorized topics. This inefficiency often results in poor customer experience. RazorPay writes about how it enhances the customer experience with an AI-powered ticket category classification system.

https://engineering.razorpay.com/enhancing-customer-support-through-ai-powered-ticket-issue-categorization-a1795d9f0f3c


Wix: AI for Revolutionizing Customer Care Routing System at Wix

Wix shares the same case study as RazorPay about enriching the routing system to improve the customer experience. The blog narrates the usage of Reinforcement learning-based solutions for customer care routing. It is interesting to see how the trade-off between a higher resolution rate and wait time impacts each other. Wix concluded that a higher resolution rate at the cost of wait time yields more customer satisfaction.

https://www.wix.engineering/post/ai-for-revolutionizing-customer-care-routing-system-at-wix


Tangbao: Flink SQL Development Experience Sharing

It is very rare to see open and honest learning about using technology, and this time, the author shares the development of Flink SQL. The author’s key learnings from working with Flink SQL include optimizing join and aggregation performance using lookup joins and two-phase aggregation to resolve data skew and minimize state overhead. Additionally, handling distinct aggregations more efficiently through Partial/Final aggregation was crucial, as was leveraging Flink's stream-batch integration for unified real-time and batch processing

https://www.alibabacloud.com/blog/flink-sql-development-experience-sharing_601569


StackOverflow: Best practices for cost-efficient Kafka clusters

We often discussed the cost of the data warehouse but missed the elephant in the room: Kafka. The fundamental function of Kafka handling millions of events per second, in combination with replication and tight coupling of the producer and the consumers to the cluster, puts more pressure on network bandwidth. All that leads to operating Kafka being a complex and expensive process. The StackOverflow blog narrates best design practices for running cost-efficient Kafka clusters.

https://stackoverflow.blog/2024/09/04/best-practices-for-cost-efficient-kafka-clusters


Tom Moertel: Sampling with SQL

Sampling is one of the most powerful tools for extracting meaning from large datasets. TABLESAMPLE is my fav, but it has its cons. The author explores various other sampling techniques to implement in SQL.

1. Weighted Sampling Without Replacement (A-ES Algorithm): This technique assigns each row a random key and selects the top rows based on these keys, ensuring fair sampling in proportion to row weights.

2. Deterministic Sampling: Uses pseudorandom functions to generate repeatable random samples.

3. Sampling With Replacement: This variation of the A-ES algorithm allows multiple selections of the same row by considering all possible arrivals from a Poisson process.

https://blog.moertel.com/posts/2024-08-23-sampling-with-sql.html


Xuanwo: Rewrite BigData in Rust

Rust is gaining popularity in data infrastructure, with more new systems built using it. The blog holds the index of popular rust tools in the data infrastructure, and I believe the trend continues to grow from here.

https://xuanwo.io/2024/07-rewrite-bigdata-in-rust/


All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employer” opinions.


Subscribe to Data Engineering Weekly

By Ananth Packkildurai · Launched 5 years ago
The Weekly Data Engineering Newsletter
Rakibul Haque's avatar
Swapnil Srivastava's avatar
Nagarjuna c's avatar
Wayne Yaddow's avatar
Chad Isenberg's avatar
9 Likes∙
1 Restack
9

Share this post

Data Engineering Weekly
Data Engineering Weekly
Data Engineering Weekly #188
1
Share

Discussion about this post

User's avatar
Functional Data Engineering - A Blueprint
How to build a Recoverable & Reproducible data pipeline
Dec 22, 2022 • 
Ananth Packkildurai
74

Share this post

Data Engineering Weekly
Data Engineering Weekly
Functional Data Engineering - A Blueprint
3
The Future of Data Engineering: DEW's 2025 Predictions
Emerging Innovations, Evolving Roles, and the Roadmap to Scalable AI-Driven Insights
Dec 19, 2024 • 
Ananth Packkildurai
47

Share this post

Data Engineering Weekly
Data Engineering Weekly
The Future of Data Engineering: DEW's 2025 Predictions
2
Towards Composable Data Infrastructure
A Case for Federated Data Catalog
Apr 11 • 
Ananth Packkildurai
40

Share this post

Data Engineering Weekly
Data Engineering Weekly
Towards Composable Data Infrastructure

Ready for more?

© 2025 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Create your profile

User's avatar

Only paid subscribers can comment on this post

Already a paid subscriber? Sign in

Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.