Welcome to the 25th edition of the data engineering newsletter. This week's release is a new set of articles that focus on Kleiner Perkins's future of computing and data infrastructure, LinkedIn's fast ingestion with Gobblin, Intuit's data journey, AWS's PyDeequ, Alibaba Cloud's Flink infra with 4 billion events per sec, Expedia's ML deployment pattern, Delta lake vs. Hudi, handling late-arriving dimensions, entity resolution for big data, Airflow 2.0 and Debezium year-in-review 2020.
Looking ahead to the future of computing and data infrastructure
Kleiner Perkins writes an excellent blog about the future of computing and data infrastructure. The cloud data warehouse, serverless architecture, workflow as a (No)code movement, and the lack of an end-to-end solution to optimize the ML infrastructure value chain are some of the exciting trends to watch. The author's take on data security enforces the importance of a metadata management system.
The (data/ security) breaches showing up in the news on a near-weekly basis all seem to be rooted in the same problem - a lack of awareness of what data an organization has, where it is, and who has access to it.
You can read Data Engineering Weekly’s take on data infrastructure trends here.
FastIngest: Low-latency Gobblin with Apache Iceberg and ORC format
LinkedIn writes about the evolution of Apache Gobblin from a batch ingestion framework to a fast ingestion framework minimizing the ingestion latency from 45 minutes to less than 5 minutes. The blog narrates how Gobblin uses Apache Iceberg to guarantee read/ write isolation, the tradeoff with ORC format encoding, and continuous data publishing is an exciting read. Yarn's choice for resource management and scheduling is interesting, and looking forward to reading more on how the Gobblin replanner evolves from stop-the-world rebalance to dynamic rebalance.
The Intuit Data Journey
Clean migration is a sign of effective engineering, and Intuit writes one such clean migration of its data infrastructure from on-premise to the cloud-native. The blog emphasizes data infrastructure fundamentals, such as to treat data as a product, and focuses on data quality, availability, performance, security & cost-effectiveness. It's exciting to read the challenges ahead of Intuit's data platform and the focus on the data mesh approach.
Testing data quality at scale with PyDeequ
AWS introduced Deequ, a data quality library, in early 2019. Deequ is used internally at Amazon to verify the quality of many large production datasets. Dataset producers can add and edit data quality constraints. The system computes data quality metrics regularly (with every new version of a dataset), verifies constraints defined by dataset producers, and publishes datasets to consumers in case of success.
As an evolution of Deequ AWS open source PyDeequ, a python wrapper on top of Deequ can be integrated with PySpark to define and run the test cases.
Four Billion Records per Second! What is Behind Alibaba Double 11 — Flink Stream-Batch Unification Practice during Double 11 for the Very First Time
Alibaba writes a great success story on Apache Flink's scalability and the effectiveness of stream-batch unification. During the last Double 11 Global Shopping Festival, Apache flink pipeline processed an impressive four billion records per second. The data volume also reached an incredible seven TB per second. The Flink-based stream-batch unification has successfully withstood strict tests in terms of stability, performance, and efficiency in Alibaba's core data service scenarios. This article shares the practice experience and reviews the evolvement of stream and batch unification within Alibaba's core data services.
Accelerate Machine Learning with the Optimal Deployment Pattern
Expedia writes about ML model deployment patterns, narrating some of the significant challenges operating the ML model in production and how they differ from the traditional back-end systems. The blog is an exciting read about various deployment patterns and each deployment model's pros & cons.
The ACID table storage layer- thorough conceptual comparisons between Delta Lake and Apache Hudi
The support for ACID on top of the object storage is a significant development in 2020. The blog narrates the data lake approach's drawback and compares the ACID support between Databricks Delta Lake and Apache Hudi.
Handling Late Arriving Dimensions Using a Reconciliation Pattern
Processing facts and dimensions are the core of data engineering. In a typical event sourcing, the producer publishes the facts and dimensions in different streams. The blog narrates some of the design challenges with the late-arriving dimensions, especially with the fast/rapidly changing dimensions(RCD), and how reconciliation pattern helps to solve it.
ACM Computing Survey/ The morning paper:
An overview of end-to-end entity resolution for big data
One of the most critical tasks for improving data quality and increasing data analytics's reliability is Entity Resolution (ER), aiming to identify different descriptions that refer to the same real-world entity. The paper narrates an end-to-end view of ER workflows for Big Data, critically reviews the pros and cons of existing methods, and concludes with the leading open research directions.
The morning paper provides excellent summarization of the paper.
Airflow 2.0 and Why We Are Excited at Databand
Airflow version 2.0 is a significant milestone release for the Airflow community. Databand shares a similar excitement narrates two significant features released with Airflow 2.0; the Decorator Flows & Scheduler performance.
Debezium in 2020 -- The Recap!
Debezium, the defacto open-source distributed platform for change data capture, publishes the year-in-review 2020. The blog post contains a rich consolidation of some exciting articles about the adoption of Debezium.
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.