Try Fully Managed Apache Airflow for FREE
Run Airflow without the hassle and management complexity. Take Astro (the fully managed Airflow solution) for a test drive today and unlock a suite of features designed to simplify, optimize, and scale your data pipelines. For a limited time, new sign-ups will receive a complimentary Airflow Fundamentals Certification exam (normally $150).
Gradient Flow: Rethinking Analyst Roles in the Age of Generative AI
The future of analysis isn't about replacement; it's about augmentation.
I saw someone quoted in X that using LLM to code is like using Google Maps to drive. At some point, we would wonder how people coded without code agents. The author explores what the future holds for the analyst and their relationship with GenAI.
https://gradientflow.substack.com/p/rethinking-analyst-roles-in-the-age
Financial Times: Putting the (Gen AI) Puzzle Together
In April 2024, Nikkei(the parent company of FT) issued a press release announcing internally built and trained Nikkei Language Models (NiLM). NiLM is a series of LLMs specializing in economic and financial information, trained on 40 years’ content of Nikkei group titles and among the largest and highest-quality LLMs in Japanese. The blog, like the previous one, explores the role of Gen AI in the publishing industry.
https://medium.com/ft-product-technology/putting-the-gen-ai-puzzle-together-9470c3aaef7a
Sebastian Raschka: New LLM Pre-training and Post-training Paradigms
The previous two blogs show a glimpse of the next wave in Gen-AI: the ability to train internal business/ content with LLM and its business impact. The LLM pre-training and post-training paradigms are rolling out the foundational future to enable organizations to build with internal knowledge. The author explores how leading language models in the market support pre and post-training paradigms.
https://magazine.sebastianraschka.com/p/new-llm-pre-training-and-post-training
Sponsored: Introduction to GenAI with Apache Airflow®
Interested in building Generative AI applications but not sure where to get started? Astronomer's latest academy module, Introduction to GenAI with Airflow, covers just this! 🎓
This FREE module will cover how Apache Airflow helps power GenAI apps, then move into a hands-on project where you'll build a RAG pipeline that produces a content generation application. You’ll also use some of the most in-demand technologies, including Streamlit, OpenAI, LangChain, and Weaviate!
Spotify: Are You a Dalia? How We Created Data Science Personas for Spotify’s Analytics Platform
It takes a village to engineer data. It requires balanced teamwork with multiple organizational personalities to cultivate a data-driven culture. Spotify writes an excellent article highlighting the key Data Science persona.
Kiran Prakash: Governing data products using fitness functions
How do we know all the published data products met the production quality? The obvious answer is running the fitness function. A fitness function is a test to evaluate how close a given implementation is to its stated design objectives. The blog narrates key design objectives of data products, such as accuracy, completeness, and timeliness, and explains how to run the data product fitness functions with data catalogs.
https://martinfowler.com/articles/fitness-functions-data-products.html
Sponsored: Monte Carlo Demo - September
Data quality is a core pillar of any data governance strategy, but how do you build a data quality program, let alone operationalize it?
Join Monte Carlo for a deep dive session into how data governance leaders can take their data quality strategies to the next level with end-to-end data observability. Learn how data observability platforms like Monte Carlo automate detection and incident management workflows.
RazorPay: Enhancing Customer Support through AI-powered ticket issue categorization
The traditional ticket routing algorithm often yields a complex rule engine with a combination of customer-miscategorized topics. This inefficiency often results in poor customer experience. RazorPay writes about how it enhances the customer experience with an AI-powered ticket category classification system.
Wix: AI for Revolutionizing Customer Care Routing System at Wix
Wix shares the same case study as RazorPay about enriching the routing system to improve the customer experience. The blog narrates the usage of Reinforcement learning-based solutions for customer care routing. It is interesting to see how the trade-off between a higher resolution rate and wait time impacts each other. Wix concluded that a higher resolution rate at the cost of wait time yields more customer satisfaction.
https://www.wix.engineering/post/ai-for-revolutionizing-customer-care-routing-system-at-wix
Tangbao: Flink SQL Development Experience Sharing
It is very rare to see open and honest learning about using technology, and this time, the author shares the development of Flink SQL. The author’s key learnings from working with Flink SQL include optimizing join and aggregation performance using lookup joins and two-phase aggregation to resolve data skew and minimize state overhead. Additionally, handling distinct aggregations more efficiently through Partial/Final aggregation was crucial, as was leveraging Flink's stream-batch integration for unified real-time and batch processing
https://www.alibabacloud.com/blog/flink-sql-development-experience-sharing_601569
StackOverflow: Best practices for cost-efficient Kafka clusters
We often discussed the cost of the data warehouse but missed the elephant in the room: Kafka. The fundamental function of Kafka handling millions of events per second, in combination with replication and tight coupling of the producer and the consumers to the cluster, puts more pressure on network bandwidth. All that leads to operating Kafka being a complex and expensive process. The StackOverflow blog narrates best design practices for running cost-efficient Kafka clusters.
https://stackoverflow.blog/2024/09/04/best-practices-for-cost-efficient-kafka-clusters
Tom Moertel: Sampling with SQL
Sampling is one of the most powerful tools for extracting meaning from large datasets. TABLESAMPLE is my fav, but it has its cons. The author explores various other sampling techniques to implement in SQL.
1. Weighted Sampling Without Replacement (A-ES Algorithm): This technique assigns each row a random key and selects the top rows based on these keys, ensuring fair sampling in proportion to row weights.
2. Deterministic Sampling: Uses pseudorandom functions to generate repeatable random samples.
3. Sampling With Replacement: This variation of the A-ES algorithm allows multiple selections of the same row by considering all possible arrivals from a Poisson process.
https://blog.moertel.com/posts/2024-08-23-sampling-with-sql.html
Xuanwo: Rewrite BigData in Rust
Rust is gaining popularity in data infrastructure, with more new systems built using it. The blog holds the index of popular rust tools in the data infrastructure, and I believe the trend continues to grow from here.
https://xuanwo.io/2024/07-rewrite-bigdata-in-rust/
All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employer” opinions.