Data Engineering Weekly

Share this post

Data Engineering Weekly #107

www.dataengineeringweekly.com

Data Engineering Weekly #107

The Weekly Data Engineering Newsletter

Ananth Packkildurai
Nov 14, 2022
6
Share this post

Data Engineering Weekly #107

www.dataengineeringweekly.com

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.


Vicki Boykis: How I learn machine learning

Adopting a new skill is always challenging, but that is where we grow as a programmer. Innovation happens at the intersection of applying learning from one domain to another. I'm a software engineer; how can I transition to a Machine Learning engineer? The author shares the experience of one such transition.

https://vickiboykis.com/2022/11/10/how-i-learn-machine-learning/


Meta: Tulip - Schematizing Meta’s data platform

Numerous heterogeneous services make up a data platform, such as warehouse data storage and various real-time systems. The schematization of data plays a vital role in a data platform. Meta writes about its internal implementation of the Schema management system at scale.

https://engineering.fb.com/2022/11/09/developer-tools/tulip-schematizing-metas-data-platform/


Luke Lin: A PM's thoughts on data contracts

Data Contract is a hot topic in data engineering. It moved from the speculation to the data engineers understanding the benefit of it and asking when we can get the implementation soon.

I met many data leaders about Data Contracts, my project Schemata, and how the extended version we are building can help them create high-quality data. The conversation is mostly should adopt a carrot-or-stick approach.

The author walks through various strategies a data contract platform can adopt to simplify the adoption.

https://pmdata.substack.com/p/a-pms-thoughts-on-data-contracts

If you are a modern data leader and interested in adopting Data Contract or talking to understand what it is, say hi on LinkedIn

https://www.linkedin.com/in/ananthdurai/


Netflix: New Series - Creating Media with Machine Learning

Can ML replace creative content generators, or can it be an excellent assistant to take creativity to a new height? Netflix writes about its ML platform to assist its media production.

https://netflixtechblog.com/new-series-creating-media-with-machine-learning-5067ac110bcd


Sponsored: Build SQL Pipelines. Not Endless DAGs!

With Upsolver SQLake, you build a pipeline for data in motion simply by writing a SQL query defining your transformation.

  • Streaming and batch unified in a single platform

  • No Airflow - orchestration inferred from the data

  • $99 / TB of data ingested | transformations free

Start Your 30 Day Trial


Wealthfront: Event Tracking System at Wealthfront

A robust event-tracking system is critical for an efficient data management platform. Wealthfront writes about the end-to-end system design of its event tracking system on top of Avro.

https://eng.wealthfront.com/2022/11/07/event-tracking-system-at-wealthfront/


Eric Broda: Data Mesh: Making Climate Data Easy to Find, Use, and Share

OS-C is establishing an Open Source collaboration community to build a data and software platform that will dramatically boost global capital flows into climate change mitigation and resilience. The author narrates how OS-C adopted Data Contract and federated data governance strategy to help fight against climate change.

https://towardsdatascience.com/making-climate-data-easy-to-find-use-and-share-5190a0926407


Sponsored: Why You Should Care About Dimensional Data Modeling

It's easy to overlook all of the magic that happens inside the data warehouse. Here, Brian Lu details the core concepts of dimensional data modeling to give us a better appreciation of all the work that goes on beneath the surface. He covers the speed vs. granularity tradeoff, highlighting the denormalized table and why it's today's technique of choice, and he offers some clarity on how to think about the benefits of immutability.

https://www.rudderstack.com/blog/why-you-should-care-about-dimensional-data-modeling/


Etsy: Deep Learning for Search Ranking at Etsy

Etsy writes about its journey from gradient boost decision tree-based search ranking to a neural ranking model. The blog highlights the need for a longer training window time for the neural ranking model compared to the decision tree model and the need for high-quality backward compatible historical events.

https://www.etsy.com/codeascraft/deep-learning-for-search-ranking-at-etsy


Uber: Uber Freight Near-Real-Time Analytics Architecture

Uber writes about its Uber Fright architecture highlighting how it archives data freshness, latency, reliability, and accuracy. The design is a good testimony to Apache Pinot's performance with index optimization techniques like JSON, sorted columns, and Star-tree to accelerate the query's performance.

https://www.infoq.com/news/2022/11/uber-freight-analysis/


Sponsored: [New eBook] The Ultimate Data Observability Platform Evaluation Guide

Are you considering investing in a data quality solution? Before you add another tool to your data stack, check out our latest guide for 10 things to consider when evaluating data observability platforms, including scalability, time to value, and ease of setup.

Access You Free Copy for Data Engineering Weekly Readers


Adam Stone: Using SQL to Summarize A/B Experiment Results

If you continue working on a problem, you can soon discover the pattern to abstract the problem. The author shares one such experience where all the A/B experiment analytics can build on top of the basic SQL queries.

https://medium.com/@foundinblank/using-sql-to-summarize-a-b-experiments-d30428edfb55


Astronomer: How to Keep Data Quality in Check with Airflow

Airflow made some leaps of improvement with TaskGroups and dedicated SQLColumnCheckOperator and SQLTableCheckOperator. This blog is an excellent overview of incorporating a data quality check with Airflow.

https://medium.com/@astronomer.io/how-to-keep-data-quality-in-check-with-airflow-f7856443149a


McDonald’s: Enabling advanced sales decomposition at McDonald’s

McDonald's writes its always-on reporting system to empower executive reporting, sales decomposition, and scenario forecasting. The blog narrates the design of the data collection, modeling & visualization layers.

https://medium.com/mcdonalds-technical-blog/enabling-advanced-sales-decomposition-at-mcdonalds-559a7311ac23


Swiggy: Architecture of CDC System

AWS DMS and DynamoDB Stream simplify, enabling a Change Data Capture [CDC] pipeline from zero to one. Swiggy writes its adoption of CDC with Schema evolution and reconciliation engine to handle the late-arriving & unordered data.

https://bytes.swiggy.com/architecture-of-cdc-system-a975a081691f


Coinbase: Kafka infrastructure renovation at Coinbase

Probably this is the first time I have heard the term infrastructure renovation. Coinbase writes about its internal Kafka platform features that support multi-cluster management, streaming SDK, ACL support, and enablement of Kafdrop, a web UI for viewing Kafka topics and browsing consumer groups.

https://www.coinbase.com/blog/kafka-infrastructure-renovation


All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Share this post

Data Engineering Weekly #107

www.dataengineeringweekly.com
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing