Data Engineering Weekly #28
Weekly Data Engineering Newsletter
Welcome to the 28th edition of the data engineering newsletter. This week's release is a new set of articles that focus on Google’s ML for computer architecture, Microsoft’s PyTorch vs. TensorFlow, Capital One’s Time travel offline ML evaluation frameworks, Alibaba Cloud’s Data Lake introduction, PayPal’s Next-Gen data movement framework, Apache Pinot’s integration story with Presto, Gradient Flow’s growing importance of Metadata, Metadata Day 2020 overview, Monte Carlo Data’s data pipeline SLA, and TDD with Apache Airflow.
Machine Learning for Computer Architecture
The custom accelerators like Google TPU and Edge CPU significantly advanced the ML workloads. The hardware accelerator ecosystem must continue to innovate in architecture design and acclimate to rapidly evolving ML models and applications to sustain these advances. Google AI writes about blending ML into the high-level system specification and architectural design stage, a pivotal contributing factor to the chip's overall performance.
A tale of two frameworks: PyTorch vs. TensorFlow
TensorFlow and PyTorch are the two most popular Machine Learning framework. Microsft writes a comparison article that illustrates the differences between PyTorch and TensorFlow by focusing on creating and training two simple models, mainly how to use dynamic subclassed models with the Module API from PyTorch 1.x and the Module API from TensorFlow 2.x.
Time Travel is Real-Building Offline Evaluation Frameworks
In a typical system design, multiple services act on an entity to change the state. For an offline ML model evaluation, Adding the temporal view of all the systems' data and processing to identify the state of customer interactions over time with time travel function is challenging. Capital One writes an exciting blog post discussing some of the challenges of building such a system with a high-level reference architecture.
Data Lake: Concepts, Characteristics, Architecture, and Case Studies
Alibaba Cloud writes an excellent overview about Data Lake. The blog is an exciting summary of what is a data lake? What are the characteristics of a data lake? The data architectural pattern differences between Lambda and Kappa architectures, Comparing the commercially available data lake solutions, and a case study from Huwai Data Lake system design.
Next-Gen Data Movement Platform at PayPal
Data that moves is alive and valuable. At rest, data is dead. PayPal writes its journey to build the Next-Generation of Data Movement Platform. The design principles behind the PayPals Risk Analytical Dynamic Datasets(RAAD) pipeline build on top of Apache Gobblin and Apache Airflow is an exciting read about a self-serving unified data platform.
Real-time Analytics with Presto and Apache Pinot
Apache Pinot writes a two-part post about Pinot integration with Presto. The blog narrates various design choices, the trade-off between latency and flexibility, and discusses Pinot's aggregator pushdown implementation with significant performance improvement.
The Growing Importance of Metadata Management Systems
Metadata management is the critical feature of data infrastructure. We've seen several technology companies developed internal metadata management systems and shared the challenges that led them to focus on metadata, including Airbnb's Data portal, Netflix's Metacat, Uber's Databook, LinkedIn's Datahub, Lyft's Amundsen, WeWork's Marquez, Spotify's Lexikon. Gradient Flow writes an exciting blog about the importance of metadata, the current architectural pattern for metadata management, and various vendors for the metadata landscape.
Review: Metadata Day 2020
LinkedIn organized the metadata day 2020 last December as a general forum to discuss the current trend in metadata management. Data Engineering Weekly wrote the chronological order of metadata management systems by various companies. Continuing to echo the metadata day's impact, the author writes an exciting summary of the metadata day 2020.
Data Engineering Weekly’s Metadata Day Special Edition:
Monte Carlo Data:
How to Make Your Data Pipelines More Reliable with SLAs
SLA, SLO, SLI are widely used to measure the reliability of the services. Slack, for instance, provides customer credit if the SLA breaches below 99.99%. The author narrates how a data pipeline can successfully adopt a similar measure to improve the data reliability and minimize data downtime.
In case you missed it, I gave a talk about operating data pipeline on Airflow @Slack a couple of years back contains some best practices on data pipeline reliability.
How to develop data pipeline in Airflow through TDD (test-driven development)
Continuous integration and testing are a vital part of improving the productivity of developing a data pipeline. One of Apache Airflow's critical attributes of success is writing and testing a data pipeline programmatically. The author writes an exciting blog walking through the steps to enable the Test-Driven-Development on data pipeline using Apache Airflow.
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.