Data Engineering Weekly #107

The Weekly Data Engineering Newsletter

Nov 14, 2022

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

Vicki Boykis: How I learn machine learning

Adopting a new skill is always challenging, but that is where we grow as a programmer. Innovation happens at the intersection of applying learning from one domain to another. I'm a software engineer; how can I transition to a Machine Learning engineer? The author shares the experience of one such transition.

https://vickiboykis.com/2022/11/10/how-i-learn-machine-learning/

Meta: Tulip - Schematizing Meta’s data platform

Numerous heterogeneous services make up a data platform, such as warehouse data storage and various real-time systems. The schematization of data plays a vital role in a data platform. Meta writes about its internal implementation of the Schema management system at scale.

https://engineering.fb.com/2022/11/09/developer-tools/tulip-schematizing-metas-data-platform/

Luke Lin: A PM's thoughts on data contracts

Data Contract is a hot topic in data engineering. It moved from the speculation to the data engineers understanding the benefit of it and asking when we can get the implementation soon.

I met many data leaders about Data Contracts, my project Schemata, and how the extended version we are building can help them create high-quality data. The conversation is mostly should adopt a carrot-or-stick approach.

The author walks through various strategies a data contract platform can adopt to simplify the adoption.

https://pmdata.substack.com/p/a-pms-thoughts-on-data-contracts

If you are a modern data leader and interested in adopting Data Contract or talking to understand what it is, say hi on LinkedIn

https://www.linkedin.com/in/ananthdurai/

Netflix: New Series - Creating Media with Machine Learning

Can ML replace creative content generators, or can it be an excellent assistant to take creativity to a new height? Netflix writes about its ML platform to assist its media production.

https://netflixtechblog.com/new-series-creating-media-with-machine-learning-5067ac110bcd

Wealthfront: Event Tracking System at Wealthfront

A robust event-tracking system is critical for an efficient data management platform. Wealthfront writes about the end-to-end system design of its event tracking system on top of Avro.

https://eng.wealthfront.com/2022/11/07/event-tracking-system-at-wealthfront/

Eric Broda: Data Mesh: Making Climate Data Easy to Find, Use, and Share

OS-C is establishing an Open Source collaboration community to build a data and software platform that will dramatically boost global capital flows into climate change mitigation and resilience. The author narrates how OS-C adopted Data Contract and federated data governance strategy to help fight against climate change.

https://towardsdatascience.com/making-climate-data-easy-to-find-use-and-share-5190a0926407

Etsy: Deep Learning for Search Ranking at Etsy

Etsy writes about its journey from gradient boost decision tree-based search ranking to a neural ranking model. The blog highlights the need for a longer training window time for the neural ranking model compared to the decision tree model and the need for high-quality backward compatible historical events.

https://www.etsy.com/codeascraft/deep-learning-for-search-ranking-at-etsy

Uber: Uber Freight Near-Real-Time Analytics Architecture

Uber writes about its Uber Fright architecture highlighting how it archives data freshness, latency, reliability, and accuracy. The design is a good testimony to Apache Pinot's performance with index optimization techniques like JSON, sorted columns, and Star-tree to accelerate the query's performance.

https://www.infoq.com/news/2022/11/uber-freight-analysis/

Adam Stone: Using SQL to Summarize A/B Experiment Results

If you continue working on a problem, you can soon discover the pattern to abstract the problem. The author shares one such experience where all the A/B experiment analytics can build on top of the basic SQL queries.

https://medium.com/@foundinblank/using-sql-to-summarize-a-b-experiments-d30428edfb55

Astronomer: How to Keep Data Quality in Check with Airflow

Airflow made some leaps of improvement with TaskGroups and dedicated SQLColumnCheckOperator and SQLTableCheckOperator. This blog is an excellent overview of incorporating a data quality check with Airflow.

https://medium.com/@astronomer.io/how-to-keep-data-quality-in-check-with-airflow-f7856443149a

McDonald’s: Enabling advanced sales decomposition at McDonald’s

McDonald's writes its always-on reporting system to empower executive reporting, sales decomposition, and scenario forecasting. The blog narrates the design of the data collection, modeling & visualization layers.

https://medium.com/mcdonalds-technical-blog/enabling-advanced-sales-decomposition-at-mcdonalds-559a7311ac23

Swiggy: Architecture of CDC System

AWS DMS and DynamoDB Stream simplify, enabling a Change Data Capture [CDC] pipeline from zero to one. Swiggy writes its adoption of CDC with Schema evolution and reconciliation engine to handle the late-arriving & unordered data.

https://bytes.swiggy.com/architecture-of-cdc-system-a975a081691f

Coinbase: Kafka infrastructure renovation at Coinbase

Probably this is the first time I have heard the term infrastructure renovation. Coinbase writes about its internal Kafka platform features that support multi-cluster management, streaming SDK, ACL support, and enablement of Kafdrop, a web UI for viewing Kafka topics and browsing consumer groups.

https://www.coinbase.com/blog/kafka-infrastructure-renovation

All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly