Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
Data Contracts for SaaS Developers with Benn Stancil
The Thanksgiving break gives me enough time to catch up on a few podcasts. I'm a fan of the "SaaS Developer Community" podcast and Benn's writing, and I can't miss any conversation about Data Contracts.😎
My thoughts on this conversation, Benn captured very well the overall goal of the Data Contract and the skepticism around it. I have a long list of thoughts on this conversation, which might need a blog post on its own. I want to address one comment in the conversation.
“Developer’s Job is to ship the application code, not to make your dashboard looks good”
I agree that shipping the application code is the priority. But What is an “Application Code”? Let’s take an example of Slack features, “Compose a DM,” Channel Selection," Invite Members,” or “Invite Reminder”? Machine Learning powers every application feature listed above. Maybe Slack is 1% of the company implementing data engineering effectively to drive the product feature, but that is the point of implementing data contract and shifting left for an efficient data creation process.
If you think Data is only for unknown dashboards & back office needs, and data is not part of your product strategy, Sure, you don’t need Data Contract. But if you want to be that 1% of the company that differentiates the product experience and business operation with data, you need to focus on implementing Data Contracts.
Ian Macomber: Data Systems Tend Towards Production
I've seen many data predictions for successive years, but I'm always a fan of folks writing a look back at what happened in the industry to light up the future trend. Possibly one of the best reads I have had recently in Data Engineering, the author highlights three emerging patterns in Data engineering.
Systems Tend Towards Production
Systems Tend Towards Blind Federation
Systems Tend Towards Layerinitis
https://ian-macomber.medium.com/data-systems-tend-towards-production-be5a86f65561
Meta AI: CICERO - An AI agent that negotiates, persuades, and cooperates with people
Did Meta successfully privatize world peace? 🤔
Meta writes about CICERO – the first AI to achieve human-level performance in the popular strategy game Diplomacy. CICERO demonstrated this by playing on webDiplomacy.net, an online version of the game. CICERO achieved more than double the average score of the human players and ranked in the top 10 percent of participants who played more than one game!!!
https://ai.facebook.com/blog/cicero-ai-negotiates-persuades-and-cooperates-with-people/
Airbnb: How AI Text Generation Models Are Reshaping Customer Support at Airbnb
Airbnb takes customer service from a simple customer service response template to the AI text generation model. The real-time agent assistant model is an exciting read.
Sponsored: [Live Webinar] How JetBlue Builds Trust in Data and Improves Model Accuracy
The data team at JetBlue Airways, a leading carrier in the United States, is responsible for powering insights for the entire organization’s operational and customer service activities. Learn how JetBlue’s data engineering and data science teams leverage Monte Carlo and Snowflake together to accelerate data analysis and drive business value.
Myntra: Quicksilver - Near Real Time Platform at Myntra
Myntra writes about its near-real-time streaming platform built on top of Kafka, Flink & Spark. It is a great overview of streaming infrastructure characteristics.
https://medium.com/myntra-engineering/quicksilver-near-real-time-platform-at-myntra-9e8edf6ede91
Airbnb: Building Airbnb Categories with ML and Human-in-the-Loop
Data & Machine Learning are increasingly powering the applications and driving user experience. Airbnb writes one case about building Airbnb, building travel categories with ML and human-in-the-loop.
Sponsored: It’s Time for the Headless CDP
In this piece RudderStack CEO, Soumyadeb Mitra, makes the case for a new approach to the customer data platform—the headless CDP. He defines the headless CDP as a tool with open architecture, purpose built for data and engineering teams, that makes it easy to collect customer data from every source, build your customer 360 in your own warehouse, then make that Data available to your entire stack.
https://www.rudderstack.com/blog/it-s-time-for-the-headless-cdp/
Becket Qin: Apache Flink SQL - Past, Present, and Future
Flink SQL made significant advancements in unifying the batch and the real-time computation. The blog captures the history of Flink SQL, its current state, and the challenges ahead of it. The stream-stream join is still expensive to operate; I’m excited to see the future progress of Flink SQL and how it can simplify operating streaming infrastructure.
https://www.ververica.com/blog/apache-flink-sql-past-present-and-future
LINE Engineering: A story of introducing data lineage into LINE's large-scale data platform
I thought Apache Atlas was largely forgotten at this stage; Line writes an exciting blog about its usage of Apache Atlas for data lineage. Too many data lineage visualizations can also confuse the users, and it is exciting that the Line data team highlighted the edge case and how it solved it.
https://engineering.linecorp.com/en/blog/data-lineage-on-line-big-data-platform
AutoTrader: Real-Time Personalisation of Search Results with Auto Trader's Customer Data Platform
Feature Snippets are a vital technique to elevate the search & discovery experience for the users. AutoTrader writes about the system design of its customer segmentation to drive the Feature Snippet in its search experience.
Adrian Bednarz: DBT repository — to split or not to split?
Should we keep dbt monorepo in an organization or split it as multiple repos? The build systems like Bazel and Pants encourage monorepo, but that comes with operation and implementation costs. The author narrates how the dbt package helps to minimize code duplication and encourages multi-repo patterns.
https://techwithadrian.medium.com/dbt-repository-to-split-or-not-to-split-909d366d0998
Nic Crane: Type inference in readr and arrow
Every engineer has their own horror stories about their work with CSV files. We can write N number of blogs on Why You Don’t Want to Use CSV Files, But CSV format is widely used in data science and the simple human-readable format that is widely known and understood. The simplicity of CSV is its drawback; one such drawback is the lack of a type system. The author narrates how Apache Arrow infers types while reading the CSV file.
https://thisisnic.github.io/2022/11/21/type-inference-in-readr-and-arrow/
All rights reserved ProtoGrowth Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.