Data Engineering Weekly #28

Weekly Data Engineering Newsletter

Feb 07, 2021

Welcome to the 28th edition of the data engineering newsletter. This week's release is a new set of articles that focus on Google’s ML for computer architecture, Microsoft’s PyTorch vs. TensorFlow, Capital One’s Time travel offline ML evaluation frameworks, Alibaba Cloud’s Data Lake introduction, PayPal’s Next-Gen data movement framework, Apache Pinot’s integration story with Presto, Gradient Flow’s growing importance of Metadata, Metadata Day 2020 overview, Monte Carlo Data’s data pipeline SLA, and TDD with Apache Airflow.

Google: Machine Learning for Computer Architecture

The custom accelerators like Google TPU and Edge CPU significantly advanced the ML workloads. The hardware accelerator ecosystem must continue to innovate in architecture design and acclimate to rapidly evolving ML models and applications to sustain these advances. Google AI writes about blending ML into the high-level system specification and architectural design stage, a pivotal contributing factor to the chip's overall performance.

https://ai.googleblog.com/2021/02/machine-learning-for-computer.html

Microsoft: A tale of two frameworks: PyTorch vs. TensorFlow

TensorFlow and PyTorch are the two most popular Machine Learning framework. Microsft writes a comparison article that illustrates the differences between PyTorch and TensorFlow by focusing on creating and training two simple models, mainly how to use dynamic subclassed models with the Module API from PyTorch 1.x and the Module API from TensorFlow 2.x.

https://medium.com/data-science-at-microsoft/a-tale-of-two-frameworks-pytorch-vs-tensorflow-f73a975e733d

Capital One: Time Travel is Real-Building Offline Evaluation Frameworks

In a typical system design, multiple services act on an entity to change the state. For an offline ML model evaluation, Adding the temporal view of all the systems' data and processing to identify the state of customer interactions over time with time travel function is challenging. Capital One writes an exciting blog post discussing some of the challenges of building such a system with a high-level reference architecture.

https://medium.com/capital-one-tech/time-travel-is-real-building-offline-evaluation-frameworks-a78103613ef9

Alibaba Cloud: Data Lake: Concepts, Characteristics, Architecture, and Case Studies

Alibaba Cloud writes an excellent overview about Data Lake. The blog is an exciting summary of what is a data lake? What are the characteristics of a data lake? The data architectural pattern differences between Lambda and Kappa architectures, Comparing the commercially available data lake solutions, and a case study from Huwai Data Lake system design.

https://alibaba-cloud.medium.com/data-lake-concepts-characteristics-architecture-and-case-studies-28be1b265624

PayPal: Next-Gen Data Movement Platform at PayPal

Data that moves is alive and valuable. At rest, data is dead. PayPal writes its journey to build the Next-Generation of Data Movement Platform. The design principles behind the PayPals Risk Analytical Dynamic Datasets(RAAD) pipeline build on top of Apache Gobblin and Apache Airflow is an exciting read about a self-serving unified data platform.

https://medium.com/paypal-engineering/next-gen-data-movement-platform-at-paypal-100f70a7a6b

Apache Pinot: Real-time Analytics with Presto and Apache Pinot

Apache Pinot writes a two-part post about Pinot integration with Presto. The blog narrates various design choices, the trade-off between latency and flexibility, and discusses Pinot's aggregator pushdown implementation with significant performance improvement.

https://medium.com/apache-pinot-developer-blog/real-time-analytics-with-presto-and-apache-pinot-part-i-cc672caea307

https://medium.com/apache-pinot-developer-blog/real-time-analytics-with-presto-and-apache-pinot-part-ii-3d09ff937713

Gradient Flow: The Growing Importance of Metadata Management Systems

Metadata management is the critical feature of data infrastructure. We've seen several technology companies developed internal metadata management systems and shared the challenges that led them to focus on metadata, including Airbnb's Data portal, Netflix's Metacat, Uber's Databook, LinkedIn's Datahub, Lyft's Amundsen, WeWork's Marquez, Spotify's Lexikon. Gradient Flow writes an exciting blog about the importance of metadata, the current architectural pattern for metadata management, and various vendors for the metadata landscape.

https://gradientflow.com/the-growing-importance-of-metadata-management-systems/

Knowledge Technologies: Review: Metadata Day 2020

LinkedIn organized the metadata day 2020 last December as a general forum to discuss the current trend in metadata management. Data Engineering Weekly wrote the chronological order of metadata management systems by various companies. Continuing to echo the metadata day's impact, the author writes an exciting summary of the metadata day 2020.

https://medium.com/knowledge-technologies/review-metadata-day-2020-e38c28c4cf1a

Data Engineering Weekly’s Metadata Day Special Edition:

https://www.dataengineeringweekly.com/p/data-engineering-weekly-21-metadata

Monte Carlo Data: How to Make Your Data Pipelines More Reliable with SLAs

SLA, SLO, SLI are widely used to measure the reliability of the services. Slack, for instance, provides customer credit if the SLA breaches below 99.99%. The author narrates how a data pipeline can successfully adopt a similar measure to improve the data reliability and minimize data downtime.

https://towardsdatascience.com/how-to-make-your-data-pipelines-more-reliable-with-slas-b5eec928e906

In case you missed it, I gave a talk about operating data pipeline on Airflow @Slack a couple of years back contains some best practices on data pipeline reliability.

Marcos Marx: How to develop data pipeline in Airflow through TDD (test-driven development)

Continuous integration and testing are a vital part of improving the productivity of developing a data pipeline. One of Apache Airflow's critical attributes of success is writing and testing a data pipeline programmatically. The author writes an exciting blog walking through the steps to enable the Test-Driven-Development on data pipeline using Apache Airflow.

https://blog.magrathealabs.com/how-to-develop-data-pipeline-in-airflow-through-tdd-test-driven-development-c3333439f358

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.

Data Engineering Weekly

Data Engineering Weekly #28

Weekly Data Engineering Newsletter

Discussion about this post