Welcome to the 36th edition of the data engineering newsletter. This week's release is a new set of articles that focus on PayPal's thread on customer churn problem, Pinterest's opensource Querybook, Capital One's from batch to real-time CDC journey, QuantumBlack's thoughts on data engineering role, Pinterest's Flink infrastructure on detecting image similarity, Shopify's building smart search products, Microsoft's introduction to the time series forecasting, Confluent's first glimpse on Kafka without Zookeeper, Fathom's website analytics infrastructure, Financial Times trending topic infrastructure, Picnic's data warehouse journey, Data Mechanics Apache Spark 3.1 release with Kubernetes support GA, Groupon's customizing Airflow UI.
This week, let's start with an insightful thread from PayPal on how it solved a customer churn problem from first principles. The thread is a blueprint of how to run your data analytics process.
Pinterest: Open sourcing Querybook, Pinterest’s collaborative big data hub
Ad-hoc analytics is the first step for building a data analytics product. The need for ad-hoc analytics evolved from a simple SQL editor to an integrated workflow engine—Pinterest opensource Querybook with enhanced visualization, collaboration, and scheduling feature as a hub for data analytics.
Capital One Tech: The Journey from Batch to Real-time with Change Data Capture
Change Data Capture and event sourcing is the vital component of data infrastructure. Capital One writes about introducing the event sourcing & CDC and an excellent comparison between Debezium and AWS Data migration service.
QuantumBlack: Data Engineering’s Role Is Scaling Beyond Scope — And That Should Be Celebrated
Today’s data engineers are responsible for unlocking data science and analytics in an organization and building well-curated, accessible data foundations. Responsibilities have increased, and expectations are higher than they were even five years ago.
The article is an exciting summary of the emerging importance of Data Engineering and the need for the organization for growing data engineering skills.
Pinterest: Detecting Image Similarity in (Near) Real-time Using Apache Flink
Pinterest writes about its near real-time infrastructure to detect image similarity. The article narrates the design of Flink stream-stream join, LSH (Locality Sensitive Hashing) lookup, and the graph storage need for storing the identified cluster to the member list. Pinterest's approach to propagate the debugging data through the Flink operator is an exciting read on the complex pipeline's operability, which one can adapt to any stream processing pipeline.
Shopify: Building Smarter Search Products: 3 Steps for Evaluating Search Algorithms
Search is a core functionality of most business applications, and it is one of the vital applications of a data product. How to continuously validate the search algorithms? Shopify narrates a three-step approach from collecting the data to evaluating online and offline metrics.
Microsoft: Time series forecasting - Understanding the fundamentals (Part-1)
Time series forecasting operates in a well-defined problem space and expands across different domains. Producing high-quality forecasts is not an easy problem. Microsoft wrote an exciting blog on time series forecasting fundamentals and summarized a few popular Python forecasting packages to get started.
The article reference
Forecasting: Principles and Practice for time series forecasting principles.
Confluent: Apache Kafka Made Simple - A First Glimpse of a Kafka Without ZooKeeper
Apache Kafka community started replacing Zookeeper with a self-managed metadata quorum, and the community potentially gets early access in the upcoming 2.8 release. Confluent writes about how the quorum control works if you opt for Kafka and scaling up & down the Kafka cluster.
KIP-500 An informative read about Kafka as a quorum design.
Fathom: Building the world's fastest website analytics.
Fathom Engineering writes about its analytical database journey from MySQL to SingleStore (MemSQL). The article narrates the scalability challenges with MySQL as an analytical DB and the evaluation process of Elasticsearch, Timescale DB, Rockset & ClickHouse. The article is an excellent reminder of how important to have the documentation well-written and easy to understand.
Financial Times: Predicting FT Trending Topics
Financial Times writes about its trending topic prediction infrastructure and how it helps journalists write more relevant stories. Slack's integration as part of the prediction workflow to send signals to the stakeholders is an exciting design and a good reminder about incorporating the business workflow as part of the prediction system.
Picnic: How we built our Lakeless Data Warehouse
Picnic data team writes about their five-year journey of its data warehouse system. There are many exciting lessons on store time in UTC, the importance of follow up on the stop-gap solutions, start with a low-risk tech stack and scale up as you grow, document the data catalog early, minimize the number of tooling.
Data Mechanics: Apache Spark 3.1 Release: Spark on Kubernetes is now Generally Available
With the Apache Spark 3.1 release in March 2021, the Spark on Kubernetes project is officially declared production-ready and Generally Available. The blog narrates the Apache Spark Kubernetes support journey from version 2.4 to 3.1. The blog highlights some of the key enhancements on Spark 3.0, such as handling graceful executor decommission, supporting the NFS volume option (now it's much simpler to integrate EFS), and stage-level scheduling.
Groupon: How to add custom KPIs to Airflow
The ability to customize Airflow UI with additional task KPI can significantly improve the data team's productivity. Groupon writes an exciting blog on how it did the same with additional KPI with the code example.
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.