Data Engineering Weekly #54

Weekly Data Engineering Newsletter

Aug 30, 2021

Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers

RudderStack Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.

Event: Join Impact 2021 on November 3, 2021: The First-Ever Data Observability Summit. Join Today's Leading Data Pioneers

Hear from data leaders pioneering the technologies & processes shaping data engineering. Featuring First Chief Data Scientist of the U.S., founder of the Data Mesh, and many more!

Click To Get Your Free Ticket For All Data Engineering Weekly Readers

Let’s start this week with a fun meme that shows the importance of data quality and contract-driven data asset creation.

Uber: Building Scalable Streaming Pipelines for Near Real-Time Features

Uber writes an exciting blog about the tuning Flink streaming platform. The blog narrates the business cases for real-time analytics with geospatial and temporal analysis and focuses on Network, CPU, and memory optimization strategies.

https://eng.uber.com/building-scalable-streaming-pipelines/

Confluent: How ksqlDB Works: Internal Architecture and Advanced Features

ksqlDB is the streaming SQL engine for Kafka that enables stream processing tasks using SQL statements. Confluent writes a guide to the ksqlDB internal architecture discussing how stream joins, stateful & stateless computing & fault tolerance works.

https://www.confluent.io/blog/ksqldb-architecture-and-advanced-features/

Data Science at Microsoft: ML program management at scale

Data Science at Microsoft discusses the role and significance of ML program management and the role of the program manager in the end-to-end lifecycle of the data product. As the scale and the breadth of the ML application adoption grow, the Technical Program Manager for Machine Learning is an exciting job profile that will grow in the coming years.

Part 1: https://medium.com/data-science-at-microsoft/ml-program-management-at-scale-part-1-of-3-4816a99ad1bd

Part 2: https://medium.com/data-science-at-microsoft/ml-program-management-at-scale-part-2-of-2-3ab2cc54f36f

Great Expectations: Maximizing Productivity of Analytics Teams

Great Expectations writes a three-part series on maximizing the productivity of the analytics team, focusing on the debugability of the dashboards, reducing the technical debt on the data pipeline, and the role of Great Expectations in the data engineering process.

https://greatexpectations.io/blog/maximizing-productivity-of-analytics-teams-pt1/

https://greatexpectations.io/blog/maximizing-productivity-of-analytics-teams-pt2/

https://greatexpectations.io/blog/maximizing-productivity-of-analytics-teams-pt3/

Shopify: 5 Steps for Building Machine Learning Models for Business

Shopify writes a five-step guideline article on building ML products. The first three-step guidelines focus on asking the necessity of the ML model rather than simple heuristic algorithms. The blog reemphasizes that simplicity is the best ML model strategy.

https://shopifyengineering.myshopify.com/blogs/engineering/building-business-machine-learning-models

Uber: How Data Shapes the Uber Rider App

Product data analytics is the core of lean product development. Uber writes an exciting blog on its rider app using such metrics-driven product development. The blog narrates the data acquisition lifecycle from mobile devices across different OS versions, emphasizes the importance of log standardization, anomaly detection, and data quality standards.

https://eng.uber.com/how-data-shapes-the-uber-rider-app/

DoorDash: Overcoming Rapid Growth Challenges for Datasets in Snowflake

DoorDash writes about the cost-driven optimization techniques it uses in the pipeline to optimize Snowflake usage. The optimization techniques focus on deprecating unused ETL jobs, favoring incremental ETL processing over bulk processing, reducing the number of projections in the SQL queries, clustering keys, and maximize the Snowflake native function usage.

https://doordash.engineering/2021/06/22/overcoming-rapid-growth-challenges-for-datasets-in-snowflake/

Sachin Bansal: Running Timeseries Anomaly Detection at Scale on SQL Data

Anomaly detection is a critical functionality in data engineering for reliable metrics, yet it is no short of challenges to implement and run at scale. The author narrates how CueObserve, an open-source metrics monitoring system, is solving anomaly detection at scale.

https://towardsdatascience.com/running-timeseries-anomaly-detection-at-scale-on-sql-data-4407eb3d3bd3

Picnic: Releasing diepvries, a Data Vault framework for Python

Picnic adopted data vault modeling techniques for its data warehouses. Continue to adapt the data vault modeling technique, Picnic open sources diepvries a simple python library that automates the data loading process for Data Vault and avoids the maintenance of repetitive SQL queries for ETL jobs.

https://blog.picnic.nl/releasing-diepvries-a-data-vault-framework-for-python-3f01a5d46f84

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.

Data Engineering Weekly