Welcome to the 37th edition of the data engineering newsletter. This week's release is a new set of articles that focus on Data Council's popular OSS data projects, Uber's real-time infrastructure, Introduction to Pinot's upsert, Facebook's self-supervised forecasting, Databricks' forecasting using Spark & Prophet, Data engineering practices at Wikimedia, Yelp's data infrastructure, Salesforce's strongly consistent global secondary index for HBase, AutoTraders' event tracking validation, Monte Carlo Data's root cause analysis for data engineers, and AlayaLabs' production data pipeline.
Data Council: What are the most popular OSS data projects of 2021?
Data Council published 2021 popular data engineering open source. It's no surprise DBT is leading the list. A notable thing to see is three orchestration engines (Airflow, Dagster & Prefect) in the top 6 list, which shows the orchestration space still lacks consolidation. One vital takeaway from the survey is that data preparation is the hardest part of data analytics on any scale.
https://petesoder.medium.com/what-are-the-most-popular-oss-data-projects-of-2021-84ef021bb5a2
Wikipedia: Wikipedia data engineering practices with Nuria Ruiz
The conversation gives an excellent overview of Wikimedia's data infrastructure. There is a good highlight of the challenges of collecting data from the edge network, principle-based metrics definition than profit-based, and privacy. One awakening moment for me in the conversation, we have sophisticated data computation and management frameworks, and none of them treat data privacy as a first-class citizen.
https://www.speedwins.tech/posts/some-words-with-nuria-ruiz
Uber: Real-time Data Infrastructure at Uber
Uber writes an exciting paper summarizing its real-time infrastructure with Apache Kafka, Apache Flink, Apache Pinot & Presto as a foundational technology stack. The Kafka consumer proxy, the logical separation of Kafka topics, Auto-scaling Flink applications, Pinot's upsert feature, and Pinot integration with the rest of the data ecosystems are some of the exciting read.
https://arxiv.org/pdf/2104.00087.pdf
Apache Pinot: Introduction to Upserts in Apache Pinot
Pinot is an immutable data store, which means that there is no genuine concept of upsert as you stream data into it from Kafka. The blog summarizes the need for upsert support and how it differs from the traditional database upserts.
https://medium.com/apache-pinot-developer-blog/introduction-to-upserts-in-apache-pinot-987c12149d93
Facebook: Large-scale forecasting - self-supervised learning framework for hyperparameter tuning
Forecasting is one of the core data science and machine learning tasks. Providing fast, reliable, and accurate forecasting results with large amounts of time series data is vital for a business operation. Facebook writes about its framework, SSL-HPT, that takes time-series features as inputs and produces optimal hyperparameters in less time without sacrificing accuracy.
Databricks: Fine-Grained Time Series Forecasting at Scale With Facebook Prophet and Apache Spark: Updated for Spark 3
Facebook's Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. Databricks writes about training hundreds of time series forecasting models in parallel with Facebook Prophet and Spark.
Yelp: Powering Messaging Enabledness with Yelp's Data Infrastructure
Yelp writes about its data lake sink connector that enables integrating analytical events. The data pipe CLI tool, Pipeline Studio web UI & schema management tooling, highlights building self-service data ingestion.
Salesforce: The Design of Strongly Consistent Global Secondary Indexes in Apache Phoenix
Secondary indexing, which enables efficient queries on non-primary key fields, is central in many use cases. Apache HBase's ability to read random, real-time read/write access comes with the cost that the access pattern depends on the key. Salesforce writes about how Apache Phoenix supports a strongly consistent global secondary index. The design approach of handling immutable (the secondary index column is immutable) and mutable(the secondary index column is mutable) is an exciting read.
Auto Traders: Reliable tracking: Validating Snowplow events using Cypress & Snowplow Micro
One of the challenging parts of the event instrumentation is how to build automated test suites to ensure that new releases of their websites, mobile apps, and server-side applications do not break tracking. Autotrader writes an exciting article on how it automated the Snowplow event validation with the Cypress test suite.
https://engineering.autotrader.co.uk/2021/04/09/cypress-snowplow-micro-blog.html
Monte Carlo Data: Root Cause Analysis for Data Engineers
Debugging and Root Cause Analysis of a distributed system is very challenging, and the data pipeline is no exception. Standard techniques like error messages, reading the code, or unit & integration tests often misleading. The blog narrates the manual steps required for one such root cause analysis in a data pipeline and emphasizes the need for automation for a faster resolution.
https://towardsdatascience.com/root-cause-analysis-for-data-engineers-782c02351697
AlayaLabs: From Jupyter Notebooks to Production Data Pipelines - Our Framework for Delivering Data Projects
AlyaLabs writes about its data infrastructure using Snowflake, S3, Airflow & Looker and how it converts the prototyping from Jupiter Notebook to a continuous data pipeline.
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.