Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers
RudderStack
Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.
Event: Join Impact 2021 on November 3, 2021: The First-Ever Data Observability Summit. Join Today's Leading Data Pioneers
Hear from data leaders pioneering the technologies & processes shaping data engineering. Featuring First Chief Data Scientist of the U.S., founder of the Data Mesh, and many more!
Click To Get Your Free Ticket For All Data Engineering Weekly Readers
Let’s start this week with a fun meme that shows the importance of data quality and contract-driven data asset creation.
Uber: Building Scalable Streaming Pipelines for Near Real-Time Features
Uber writes an exciting blog about the tuning Flink streaming platform. The blog narrates the business cases for real-time analytics with geospatial and temporal analysis and focuses on Network, CPU, and memory optimization strategies.
https://eng.uber.com/building-scalable-streaming-pipelines/
Confluent: How ksqlDB Works: Internal Architecture and Advanced Features
ksqlDB is the streaming SQL engine for Kafka that enables stream processing tasks using SQL statements. Confluent writes a guide to the ksqlDB internal architecture discussing how stream joins, stateful & stateless computing & fault tolerance works.
https://www.confluent.io/blog/ksqldb-architecture-and-advanced-features/
Data Science at Microsoft: ML program management at scale
Data Science at Microsoft discusses the role and significance of ML program management and the role of the program manager in the end-to-end lifecycle of the data product. As the scale and the breadth of the ML application adoption grow, the Technical Program Manager for Machine Learning is an exciting job profile that will grow in the coming years.
Part 1:
https://medium.com/data-science-at-microsoft/ml-program-management-at-scale-part-1-of-3-4816a99ad1bd
Part 2:
https://medium.com/data-science-at-microsoft/ml-program-management-at-scale-part-2-of-2-3ab2cc54f36f
Great Expectations: Maximizing Productivity of Analytics Teams
Great Expectations writes a three-part series on maximizing the productivity of the analytics team, focusing on the debugability of the dashboards, reducing the technical debt on the data pipeline, and the role of Great Expectations in the data engineering process.
https://greatexpectations.io/blog/maximizing-productivity-of-analytics-teams-pt1/
https://greatexpectations.io/blog/maximizing-productivity-of-analytics-teams-pt2/
https://greatexpectations.io/blog/maximizing-productivity-of-analytics-teams-pt3/
Shopify: 5 Steps for Building Machine Learning Models for Business
Shopify writes a five-step guideline article on building ML products. The first three-step guidelines focus on asking the necessity of the ML model rather than simple heuristic algorithms. The blog reemphasizes that simplicity is the best ML model strategy.
https://shopifyengineering.myshopify.com/blogs/engineering/building-business-machine-learning-models
Sponsored: RudderStack - Churn Prediction With BigQueryML to Increase Mobile Game Revenue
Here’s an interesting case study on how machine learning can directly impact the bottom line. RudderStack writes an outline of how app developers, Torpedo Labs, use BigQuery ML to identify high-value mobile game players who are dangerously close to churning.
https://rudderstack.com/blog/churn-prediction-with-bigqueryml
Uber: How Data Shapes the Uber Rider App
Product data analytics is the core of lean product development. Uber writes an exciting blog on its rider app using such metrics-driven product development. The blog narrates the data acquisition lifecycle from mobile devices across different OS versions, emphasizes the importance of log standardization, anomaly detection, and data quality standards.
https://eng.uber.com/how-data-shapes-the-uber-rider-app/
DoorDash: Overcoming Rapid Growth Challenges for Datasets in Snowflake
DoorDash writes about the cost-driven optimization techniques it uses in the pipeline to optimize Snowflake usage. The optimization techniques focus on deprecating unused ETL jobs, favoring incremental ETL processing over bulk processing, reducing the number of projections in the SQL queries, clustering keys, and maximize the Snowflake native function usage.
Sachin Bansal: Running Timeseries Anomaly Detection at Scale on SQL Data
Anomaly detection is a critical functionality in data engineering for reliable metrics, yet it is no short of challenges to implement and run at scale. The author narrates how CueObserve, an open-source metrics monitoring system, is solving anomaly detection at scale.
Picnic: Releasing diepvries, a Data Vault framework for Python
Picnic adopted data vault modeling techniques for its data warehouses. Continue to adapt the data vault modeling technique, Picnic open sources diepvries a simple python library that automates the data loading process for Data Vault and avoids the maintenance of repetitive SQL queries for ETL jobs.
https://blog.picnic.nl/releasing-diepvries-a-data-vault-framework-for-python-3f01a5d46f84
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.
Just want to say your article is astounding. The clarity in your post is simply spectacular and I can assume you are an expert on this field.