Data Engineering Weekly

Share this post
Data Engineering Weekly #113
www.dataengineeringweekly.com

Data Engineering Weekly #113

The Weekly Data Engineering Newsletter

Ananth Packkildurai
Jan 9
6
Share this post
Data Engineering Weekly #113
www.dataengineeringweekly.com

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.


Editor’s Note: State Of Data 2023 Conference - 18-Jan-2023

The team Seattle Data Guy is running an excellent day-track online conference, “The State of Data 2023”. I’m giving a talk on “The Functional Data Engineering - A Blueprint.” based on my recent article of the same title. Don’t miss out, and click the link below to register.

https://www.eventbrite.com/e/state-of-data-2023-tickets-468776622497


Max Illis: On Data Contracts, Data Products, and Muesli

The article is an excellent overview of the Data Contract platform and how it brings collaboration among multiple stakeholders in the data creation and value generation process. The author narrates what data products are by the example of Muesli is an excellent analogy.

https://medium.com/@maxillis/on-data-contracts-data-products-and-muesli-84fe2d143e2c


Balu Rama Chandra: Excelling at dbt - Jinja & Macros for modular and cleaner SQL Queries

The rise of Apache Airflow & dbt makes Jinja templating a must-know toolchain for analytical engineering. The article is an excellent intro to Jinja templating with dbt to get started.

https://blog.devgenius.io/excelling-at-dbt-jinja-macros-for-modular-and-cleaner-sql-queries-part-1-2-55e29d4b29e2

https://blog.devgenius.io/excelling-at-dbt-jinja-macros-for-modular-and-cleaner-sql-queries-part-2-2-88949c1af46c


Data Science @Microsoft: A layered approach to MLOps

Optimizing the ML workflow is a vital goal for MLOps, as any good developer platform. Microsoft's data science team writes about their observation of the ML workflow with a layered approach.

  1. Data Science code layer

  2. Specification layer

  3. The orchestration layer

https://medium.com/data-science-at-microsoft/a-layered-approach-to-mlops-d935beefca2e


Sponsored: Upsolver - Write a SQL Query, Get a Data-in-Motion Pipeline!

Pipelines for data in motion can quickly turn into DAG hell. Upsolver SQLake lets you process fast-moving data by simply writing a SQL query.

  • Streaming plus batch unified in a single platform.

  • Stateful processing at scale - joins, aggregations, upserts

  • Orchestration auto-generated from the data and SQL

  • Templates with sample data for Kafka/Kinesis/S3 sources -> S3/Athena/Snowflake/Redshift

Try now and get 30 Days Free


Furcy Pin: 2003–2023: A Brief History of Big Data

Technology advancement leaves traces along the way, and it is important to look back to understand the evolution pattern. It not only helps us understand the pattern but also makes us realize the Black Swan events, which we might very well miss. The author writes an excellent walk back the memory lane of Big Data now and then.

https://towardsdatascience.com/2003-2023-a-brief-history-of-big-data-25712351a6bc


Anna Geller: What I learned from NormConf 2022

NormConf is an online tech conference about things that matter in data and ML but don’t get the spotlight. I enjoyed watching some of the thought-provoking talks from the leading industry experts. The author narrates the NormConf experience with a summary of some of the excellent talks.

https://medium.com/the-prefect-blog/what-i-learned-from-normconf-2022-f8b3c88f0de7

Full Conference YouTube Playlist

https://www.youtube.com/playlist?list=PLYXaKIsOZBsu3h2SSKEovRn7rGy7wkUAV


Sponsored: 10 Things to Consider Before Choosing a Data Observability Platform

Ready to stop fighting bad data and explore end-to-end coverage with Data Observability? Learn the 10 most important things to consider when choosing a data observability platform. Get the new platform guide, and take the next step in your journey to data trust.

Get The Guide


Netflix: Data Reprocessing Pipeline in Asset Management Platform @ Netflix

Netflix writes about the natural evolution of its Asset Management Platform into a data processing pipeline. It is exciting to read some of the common characteristics of the asset management platform, such as schema validation, versioning, access control, sharing, and triggering configured workflows, which naturally pave the path for the data processing pipeline.

https://netflixtechblog.medium.com/data-reprocessing-pipeline-in-asset-management-platform-netflix-46fe225c35c9


Shopify: Monte Carlo Simulations: Separating Signal from Noise in Sampled Success Metrics

Sometimes, we won’t have the luxury of processing all the data to compute the success metrics and tend to rely on sampled success metrics. How do we separate noise over the success signal over time? Shopify writes about how Monte Carlo simulation helps to separate signal from noise.

https://shopifyengineering.myshopify.com/blogs/engineering/monte-carlo-simulations-sampled-success-metrics


Sponsored: Take Control of Your Customer Data With RudderStack

Legacy CDPs charge you a premium to keep your data in a black box. RudderStack builds your CDP on top of your data warehouse, giving you a more secure and cost-effective solution. Plus, it gives you more technical controls, so you can fully unlock the power of your customer data. 

Take control of your customer data today.


Expedia: AI, Personalization, and Openness: Exploring the Definitive Tech Trends of 2023

Expedia writes about 2023 tech trends and what they mean for the travel industry. The key trends from the predictions.

  1. Tech gets (hyper) personal

  2. Platforms become more open and accessible

  3. Human-centric design is in the spotlight

  4. AI might feel the frost

The (hyper) personalization is an excellent trend to watch; We see patterns of internal recommendation apis to serve various personalization product features.

https://medium.com/expedia-group-tech/ai-personalization-and-openness-exploring-the-definitive-tech-trends-of-2023-d5ba45875c80


Twilio: Presto on AWS at Twilio - Lesson Learned and Optimization

I often feel the impact of Presto in data engineering is very underappreciated. Presto most likely helped systems like Snowflake to win the perception problem

Twilio shares its experience running Presto on AWS with some excellent optimization techniques.

https://prestodb.io/blog/2022/12/28/presto-at-twilio.html

Talk


The National Archives: CSV Schema Language

I recently came to know CSV has schema specification standards. Maybe I’m slow to discover this, but when I found it, I was super excited for an unknown reason. I’m still trying to process it 😊 Nonetheless, I enjoyed reading the draft version of the specification, which combines both the schema and validation specifications.

https://digital-preservation.github.io/


All rights reserved ProtoGrowth Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Share this post
Data Engineering Weekly #113
www.dataengineeringweekly.com
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing