Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
Vicki Boykis: How I learn machine learning
Adopting a new skill is always challenging, but that is where we grow as a programmer. Innovation happens at the intersection of applying learning from one domain to another. I'm a software engineer; how can I transition to a Machine Learning engineer? The author shares the experience of one such transition.
https://vickiboykis.com/2022/11/10/how-i-learn-machine-learning/
Meta: Tulip - Schematizing Meta’s data platform
Numerous heterogeneous services make up a data platform, such as warehouse data storage and various real-time systems. The schematization of data plays a vital role in a data platform. Meta writes about its internal implementation of the Schema management system at scale.
https://engineering.fb.com/2022/11/09/developer-tools/tulip-schematizing-metas-data-platform/
Luke Lin: A PM's thoughts on data contracts
Data Contract is a hot topic in data engineering. It moved from the speculation to the data engineers understanding the benefit of it and asking when we can get the implementation soon.
I met many data leaders about Data Contracts, my project Schemata, and how the extended version we are building can help them create high-quality data. The conversation is mostly should adopt a carrot-or-stick approach.
The author walks through various strategies a data contract platform can adopt to simplify the adoption.
https://pmdata.substack.com/p/a-pms-thoughts-on-data-contracts
If you are a modern data leader and interested in adopting Data Contract or talking to understand what it is, say hi on LinkedIn
https://www.linkedin.com/in/ananthdurai/
Netflix: New Series - Creating Media with Machine Learning
Can ML replace creative content generators, or can it be an excellent assistant to take creativity to a new height? Netflix writes about its ML platform to assist its media production.
https://netflixtechblog.com/new-series-creating-media-with-machine-learning-5067ac110bcd
Sponsored: Build SQL Pipelines. Not Endless DAGs!
With Upsolver SQLake, you build a pipeline for data in motion simply by writing a SQL query defining your transformation.
Streaming and batch unified in a single platform
No Airflow - orchestration inferred from the data
$99 / TB of data ingested | transformations free
Wealthfront: Event Tracking System at Wealthfront
A robust event-tracking system is critical for an efficient data management platform. Wealthfront writes about the end-to-end system design of its event tracking system on top of Avro.
https://eng.wealthfront.com/2022/11/07/event-tracking-system-at-wealthfront/
Eric Broda: Data Mesh: Making Climate Data Easy to Find, Use, and Share
OS-C is establishing an Open Source collaboration community to build a data and software platform that will dramatically boost global capital flows into climate change mitigation and resilience. The author narrates how OS-C adopted Data Contract and federated data governance strategy to help fight against climate change.
https://towardsdatascience.com/making-climate-data-easy-to-find-use-and-share-5190a0926407
Sponsored: Why You Should Care About Dimensional Data Modeling
It's easy to overlook all of the magic that happens inside the data warehouse. Here, Brian Lu details the core concepts of dimensional data modeling to give us a better appreciation of all the work that goes on beneath the surface. He covers the speed vs. granularity tradeoff, highlighting the denormalized table and why it's today's technique of choice, and he offers some clarity on how to think about the benefits of immutability.
https://www.rudderstack.com/blog/why-you-should-care-about-dimensional-data-modeling/
Etsy: Deep Learning for Search Ranking at Etsy
Etsy writes about its journey from gradient boost decision tree-based search ranking to a neural ranking model. The blog highlights the need for a longer training window time for the neural ranking model compared to the decision tree model and the need for high-quality backward compatible historical events.
https://www.etsy.com/codeascraft/deep-learning-for-search-ranking-at-etsy
Uber: Uber Freight Near-Real-Time Analytics Architecture
Uber writes about its Uber Fright architecture highlighting how it archives data freshness, latency, reliability, and accuracy. The design is a good testimony to Apache Pinot's performance with index optimization techniques like JSON, sorted columns, and Star-tree to accelerate the query's performance.
https://www.infoq.com/news/2022/11/uber-freight-analysis/
Sponsored: [New eBook] The Ultimate Data Observability Platform Evaluation Guide
Are you considering investing in a data quality solution? Before you add another tool to your data stack, check out our latest guide for 10 things to consider when evaluating data observability platforms, including scalability, time to value, and ease of setup.
Access You Free Copy for Data Engineering Weekly Readers
Adam Stone: Using SQL to Summarize A/B Experiment Results
If you continue working on a problem, you can soon discover the pattern to abstract the problem. The author shares one such experience where all the A/B experiment analytics can build on top of the basic SQL queries.
https://medium.com/@foundinblank/using-sql-to-summarize-a-b-experiments-d30428edfb55
Astronomer: How to Keep Data Quality in Check with Airflow
Airflow made some leaps of improvement with TaskGroups and dedicated SQLColumnCheckOperator and SQLTableCheckOperator. This blog is an excellent overview of incorporating a data quality check with Airflow.
https://medium.com/@astronomer.io/how-to-keep-data-quality-in-check-with-airflow-f7856443149a
McDonald’s: Enabling advanced sales decomposition at McDonald’s
McDonald's writes its always-on reporting system to empower executive reporting, sales decomposition, and scenario forecasting. The blog narrates the design of the data collection, modeling & visualization layers.
Swiggy: Architecture of CDC System
AWS DMS and DynamoDB Stream simplify, enabling a Change Data Capture [CDC] pipeline from zero to one. Swiggy writes its adoption of CDC with Schema evolution and reconciliation engine to handle the late-arriving & unordered data.
https://bytes.swiggy.com/architecture-of-cdc-system-a975a081691f
Coinbase: Kafka infrastructure renovation at Coinbase
Probably this is the first time I have heard the term infrastructure renovation. Coinbase writes about its internal Kafka platform features that support multi-cluster management, streaming SDK, ACL support, and enablement of Kafdrop, a web UI for viewing Kafka topics and browsing consumer groups.
https://www.coinbase.com/blog/kafka-infrastructure-renovation
All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.