Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
Editor’s Note: State Of Data 2023 Conference - 18-Jan-2023
The team Seattle Data Guy is running an excellent day-track online conference, “The State of Data 2023”. I’m giving a talk on “The Functional Data Engineering - A Blueprint.” based on my recent article of the same title. Don’t miss out, and click the link below to register.
https://www.eventbrite.com/e/state-of-data-2023-tickets-468776622497
Max Illis: On Data Contracts, Data Products, and Muesli
The article is an excellent overview of the Data Contract platform and how it brings collaboration among multiple stakeholders in the data creation and value generation process. The author narrates what data products are by the example of Muesli is an excellent analogy.
https://medium.com/@maxillis/on-data-contracts-data-products-and-muesli-84fe2d143e2c
Balu Rama Chandra: Excelling at dbt - Jinja & Macros for modular and cleaner SQL Queries
The rise of Apache Airflow & dbt makes Jinja templating a must-know toolchain for analytical engineering. The article is an excellent intro to Jinja templating with dbt to get started.
Data Science @Microsoft: A layered approach to MLOps
Optimizing the ML workflow is a vital goal for MLOps, as any good developer platform. Microsoft's data science team writes about their observation of the ML workflow with a layered approach.
Data Science code layer
Specification layer
The orchestration layer
https://medium.com/data-science-at-microsoft/a-layered-approach-to-mlops-d935beefca2e
Sponsored: Upsolver - Write a SQL Query, Get a Data-in-Motion Pipeline!
Pipelines for data in motion can quickly turn into DAG hell. Upsolver SQLake lets you process fast-moving data by simply writing a SQL query.
Streaming plus batch unified in a single platform.
Stateful processing at scale - joins, aggregations, upserts
Orchestration auto-generated from the data and SQL
Templates with sample data for Kafka/Kinesis/S3 sources -> S3/Athena/Snowflake/Redshift
Furcy Pin: 2003–2023: A Brief History of Big Data
Technology advancement leaves traces along the way, and it is important to look back to understand the evolution pattern. It not only helps us understand the pattern but also makes us realize the Black Swan events, which we might very well miss. The author writes an excellent walk back the memory lane of Big Data now and then.
https://towardsdatascience.com/2003-2023-a-brief-history-of-big-data-25712351a6bc
Anna Geller: What I learned from NormConf 2022
NormConf is an online tech conference about things that matter in data and ML but don’t get the spotlight. I enjoyed watching some of the thought-provoking talks from the leading industry experts. The author narrates the NormConf experience with a summary of some of the excellent talks.
https://medium.com/the-prefect-blog/what-i-learned-from-normconf-2022-f8b3c88f0de7
Full Conference YouTube Playlist
https://www.youtube.com/playlist?list=PLYXaKIsOZBsu3h2SSKEovRn7rGy7wkUAV
Sponsored: 10 Things to Consider Before Choosing a Data Observability Platform
Ready to stop fighting bad data and explore end-to-end coverage with Data Observability? Learn the 10 most important things to consider when choosing a data observability platform. Get the new platform guide, and take the next step in your journey to data trust.
Netflix: Data Reprocessing Pipeline in Asset Management Platform @ Netflix
Netflix writes about the natural evolution of its Asset Management Platform into a data processing pipeline. It is exciting to read some of the common characteristics of the asset management platform, such as schema validation, versioning, access control, sharing, and triggering configured workflows, which naturally pave the path for the data processing pipeline.
Shopify: Monte Carlo Simulations: Separating Signal from Noise in Sampled Success Metrics
Sometimes, we won’t have the luxury of processing all the data to compute the success metrics and tend to rely on sampled success metrics. How do we separate noise over the success signal over time? Shopify writes about how Monte Carlo simulation helps to separate signal from noise.
Sponsored: Take Control of Your Customer Data With RudderStack
Legacy CDPs charge you a premium to keep your data in a black box. RudderStack builds your CDP on top of your data warehouse, giving you a more secure and cost-effective solution. Plus, it gives you more technical controls, so you can fully unlock the power of your customer data.
Take control of your customer data today.
Expedia: AI, Personalization, and Openness: Exploring the Definitive Tech Trends of 2023
Expedia writes about 2023 tech trends and what they mean for the travel industry. The key trends from the predictions.
Tech gets (hyper) personal
Platforms become more open and accessible
Human-centric design is in the spotlight
AI might feel the frost
The (hyper) personalization is an excellent trend to watch; We see patterns of internal recommendation apis to serve various personalization product features.
Twilio: Presto on AWS at Twilio - Lesson Learned and Optimization
I often feel the impact of Presto in data engineering is very underappreciated. Presto most likely helped systems like Snowflake to win the perception problem
Twilio shares its experience running Presto on AWS with some excellent optimization techniques.
https://prestodb.io/blog/2022/12/28/presto-at-twilio.html
Talk
The National Archives: CSV Schema Language
I recently came to know CSV has schema specification standards. Maybe I’m slow to discover this, but when I found it, I was super excited for an unknown reason. I’m still trying to process it 😊 Nonetheless, I enjoyed reading the draft version of the specification, which combines both the schema and validation specifications.
https://digital-preservation.github.io/
All rights reserved ProtoGrowth Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.