Data Engineering Weekly #113

The Weekly Data Engineering Newsletter

Jan 09, 2023

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

Editor’s Note: State Of Data 2023 Conference - 18-Jan-2023

The team Seattle Data Guy is running an excellent day-track online conference, “The State of Data 2023”. I’m giving a talk on “The Functional Data Engineering - A Blueprint.” based on my recent article of the same title. Don’t miss out, and click the link below to register.

https://www.eventbrite.com/e/state-of-data-2023-tickets-468776622497

Max Illis: On Data Contracts, Data Products, and Muesli

The article is an excellent overview of the Data Contract platform and how it brings collaboration among multiple stakeholders in the data creation and value generation process. The author narrates what data products are by the example of Muesli is an excellent analogy.

https://medium.com/@maxillis/on-data-contracts-data-products-and-muesli-84fe2d143e2c

Balu Rama Chandra: Excelling at dbt - Jinja & Macros for modular and cleaner SQL Queries

The rise of Apache Airflow & dbt makes Jinja templating a must-know toolchain for analytical engineering. The article is an excellent intro to Jinja templating with dbt to get started.

https://blog.devgenius.io/excelling-at-dbt-jinja-macros-for-modular-and-cleaner-sql-queries-part-1-2-55e29d4b29e2

https://blog.devgenius.io/excelling-at-dbt-jinja-macros-for-modular-and-cleaner-sql-queries-part-2-2-88949c1af46c

Data Science @Microsoft: A layered approach to MLOps

Optimizing the ML workflow is a vital goal for MLOps, as any good developer platform. Microsoft's data science team writes about their observation of the ML workflow with a layered approach.

Data Science code layer
Specification layer
The orchestration layer

https://medium.com/data-science-at-microsoft/a-layered-approach-to-mlops-d935beefca2e

Furcy Pin: 2003–2023: A Brief History of Big Data

Technology advancement leaves traces along the way, and it is important to look back to understand the evolution pattern. It not only helps us understand the pattern but also makes us realize the Black Swan events, which we might very well miss. The author writes an excellent walk back the memory lane of Big Data now and then.

https://towardsdatascience.com/2003-2023-a-brief-history-of-big-data-25712351a6bc

Anna Geller: What I learned from NormConf 2022

NormConf is an online tech conference about things that matter in data and ML but don’t get the spotlight. I enjoyed watching some of the thought-provoking talks from the leading industry experts. The author narrates the NormConf experience with a summary of some of the excellent talks.

https://medium.com/the-prefect-blog/what-i-learned-from-normconf-2022-f8b3c88f0de7

Full Conference YouTube Playlist

https://www.youtube.com/playlist?list=PLYXaKIsOZBsu3h2SSKEovRn7rGy7wkUAV

Netflix: Data Reprocessing Pipeline in Asset Management Platform @ Netflix

Netflix writes about the natural evolution of its Asset Management Platform into a data processing pipeline. It is exciting to read some of the common characteristics of the asset management platform, such as schema validation, versioning, access control, sharing, and triggering configured workflows, which naturally pave the path for the data processing pipeline.

https://netflixtechblog.medium.com/data-reprocessing-pipeline-in-asset-management-platform-netflix-46fe225c35c9

Shopify: Monte Carlo Simulations: Separating Signal from Noise in Sampled Success Metrics

Sometimes, we won’t have the luxury of processing all the data to compute the success metrics and tend to rely on sampled success metrics. How do we separate noise over the success signal over time? Shopify writes about how Monte Carlo simulation helps to separate signal from noise.

https://shopifyengineering.myshopify.com/blogs/engineering/monte-carlo-simulations-sampled-success-metrics

Expedia: AI, Personalization, and Openness: Exploring the Definitive Tech Trends of 2023

Expedia writes about 2023 tech trends and what they mean for the travel industry. The key trends from the predictions.

Tech gets (hyper) personal
Platforms become more open and accessible
Human-centric design is in the spotlight
AI might feel the frost

The (hyper) personalization is an excellent trend to watch; We see patterns of internal recommendation apis to serve various personalization product features.

https://medium.com/expedia-group-tech/ai-personalization-and-openness-exploring-the-definitive-tech-trends-of-2023-d5ba45875c80

Twilio: Presto on AWS at Twilio - Lesson Learned and Optimization

I often feel the impact of Presto in data engineering is very underappreciated. Presto most likely helped systems like Snowflake to win the perception problem

Twilio shares its experience running Presto on AWS with some excellent optimization techniques.

https://prestodb.io/blog/2022/12/28/presto-at-twilio.html

Talk

The National Archives: CSV Schema Language

I recently came to know CSV has schema specification standards. Maybe I’m slow to discover this, but when I found it, I was super excited for an unknown reason. I’m still trying to process it 😊 Nonetheless, I enjoyed reading the draft version of the specification, which combines both the schema and validation specifications.

https://digital-preservation.github.io/

All rights reserved ProtoGrowth Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly

Discussion about this post

Ready for more?