Data Engineering Weekly #109

The Weekly Data Engineering Newsletter

Nov 28, 2022

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

Data Contracts for SaaS Developers with Benn Stancil

The Thanksgiving break gives me enough time to catch up on a few podcasts. I'm a fan of the "SaaS Developer Community" podcast and Benn's writing, and I can't miss any conversation about Data Contracts.😎

My thoughts on this conversation, Benn captured very well the overall goal of the Data Contract and the skepticism around it. I have a long list of thoughts on this conversation, which might need a blog post on its own. I want to address one comment in the conversation.

“Developer’s Job is to ship the application code, not to make your dashboard looks good”

I agree that shipping the application code is the priority. But What is an “Application Code”? Let’s take an example of Slack features, “Compose a DM,” Channel Selection," Invite Members,” or “Invite Reminder”? Machine Learning powers every application feature listed above. Maybe Slack is 1% of the company implementing data engineering effectively to drive the product feature, but that is the point of implementing data contract and shifting left for an efficient data creation process.

If you think Data is only for unknown dashboards & back office needs, and data is not part of your product strategy, Sure, you don’t need Data Contract. But if you want to be that 1% of the company that differentiates the product experience and business operation with data, you need to focus on implementing Data Contracts.

Ian Macomber: Data Systems Tend Towards Production

I've seen many data predictions for successive years, but I'm always a fan of folks writing a look back at what happened in the industry to light up the future trend. Possibly one of the best reads I have had recently in Data Engineering, the author highlights three emerging patterns in Data engineering.

Systems Tend Towards Production
Systems Tend Towards Blind Federation
Systems Tend Towards Layerinitis

https://ian-macomber.medium.com/data-systems-tend-towards-production-be5a86f65561

Meta AI: CICERO - An AI agent that negotiates, persuades, and cooperates with people

Did Meta successfully privatize world peace? 🤔

Robert Downey Jr Privatised World Peace GIF - Robert Downey Jr Privatised World Peace Tony Stark - Discover & Share GIFs

Meta writes about CICERO – the first AI to achieve human-level performance in the popular strategy game Diplomacy. CICERO demonstrated this by playing on webDiplomacy.net, an online version of the game. CICERO achieved more than double the average score of the human players and ranked in the top 10 percent of participants who played more than one game!!!

https://ai.facebook.com/blog/cicero-ai-negotiates-persuades-and-cooperates-with-people/

Airbnb: How AI Text Generation Models Are Reshaping Customer Support at Airbnb

Airbnb takes customer service from a simple customer service response template to the AI text generation model. The real-time agent assistant model is an exciting read.

https://medium.com/airbnb-engineering/how-ai-text-generation-models-are-reshaping-customer-support-at-airbnb-a851db0b4fa3

Myntra: Quicksilver - Near Real Time Platform at Myntra

Myntra writes about its near-real-time streaming platform built on top of Kafka, Flink & Spark. It is a great overview of streaming infrastructure characteristics.

https://medium.com/myntra-engineering/quicksilver-near-real-time-platform-at-myntra-9e8edf6ede91

Airbnb: Building Airbnb Categories with ML and Human-in-the-Loop

Data & Machine Learning are increasingly powering the applications and driving user experience. Airbnb writes one case about building Airbnb, building travel categories with ML and human-in-the-loop.

https://medium.com/airbnb-engineering/building-airbnb-categories-with-ml-and-human-in-the-loop-e97988e70ebb

Becket Qin: Apache Flink SQL - Past, Present, and Future

Flink SQL made significant advancements in unifying the batch and the real-time computation. The blog captures the history of Flink SQL, its current state, and the challenges ahead of it. The stream-stream join is still expensive to operate; I’m excited to see the future progress of Flink SQL and how it can simplify operating streaming infrastructure.

https://www.ververica.com/blog/apache-flink-sql-past-present-and-future

LINE Engineering: A story of introducing data lineage into LINE's large-scale data platform

I thought Apache Atlas was largely forgotten at this stage; Line writes an exciting blog about its usage of Apache Atlas for data lineage. Too many data lineage visualizations can also confuse the users, and it is exciting that the Line data team highlighted the edge case and how it solved it.

https://engineering.linecorp.com/en/blog/data-lineage-on-line-big-data-platform

AutoTrader: Real-Time Personalisation of Search Results with Auto Trader's Customer Data Platform

Feature Snippets are a vital technique to elevate the search & discovery experience for the users. AutoTrader writes about the system design of its customer segmentation to drive the Feature Snippet in its search experience.

https://engineering.autotrader.co.uk/2022/11/23/enabling-real-time-personilsation-with-our-in-house-customer-data-platfom.html

Adrian Bednarz: DBT repository — to split or not to split?

Should we keep dbt monorepo in an organization or split it as multiple repos? The build systems like Bazel and Pants encourage monorepo, but that comes with operation and implementation costs. The author narrates how the dbt package helps to minimize code duplication and encourages multi-repo patterns.

https://techwithadrian.medium.com/dbt-repository-to-split-or-not-to-split-909d366d0998

Nic Crane: Type inference in readr and arrow

Every engineer has their own horror stories about their work with CSV files. We can write N number of blogs on Why You Don’t Want to Use CSV Files, But CSV format is widely used in data science and the simple human-readable format that is widely known and understood. The simplicity of CSV is its drawback; one such drawback is the lack of a type system. The author narrates how Apache Arrow infers types while reading the CSV file.

https://thisisnic.github.io/2022/11/21/type-inference-in-readr-and-arrow/

All rights reserved ProtoGrowth Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly

Discussion about this post

Ready for more?