Data Engineering Weekly #22

Weekly Data Engineering Newsletter

Dec 20, 2020

Welcome to the 22nd edition of the data engineering newsletter. This week's release is a new set of articles that focus on Datakin’s OpenLineage, LinkedIn’s metadata day, Microsoft’s metadata management, the dead of data catalog, Alibaba’s real-time data warehouse, Uber’s no-code workflow, Line’s self-serving compute, Slack’s react logging lib, LinkedIn’s Corel, Netflix’s ML for content decision making, and Intuit’s ML platform.

Datakin: Introducing OpenLineage

2020 is the year we have seen the rise of metadata management. You can read about the chronological order of data management development here. Building on the momentum and unifying the data lineage effort, Datakin and other leading opensource data lineage and the orchestration services Airflow, Amundsen, Datahub, dbt, Egeria, Great Expectations, Iceberg, Marquez, Pandas, Parquet, Prefect, Spark, and Superset announce open lineage initiative.

https://datakin.com/2020/12/18/introducing-openlineage/

Slides: https://www2.slideshare.net/julienledem/open-core-summit-observability-for-data-pipelines-with-openlineage

LinkedIn: Metadata Day 2020 - Metaspeak Meetup

Linkedin organized the Metadata Day 2020 on Dec-14th. The meetup video is now available on YouTube.

https://www.youtube.com/channel/UCDoVCT4j6QmKCnNmmNoWtBw

Microsoft: Partnering for metadata management

Metadata management is concerned with information that is not the data itself, but rather is about the data. Microsoft’s Azure data science team narrates its metadata management journey from an internal Azure Knowledge Graph to the adoption of Azure Purview.

https://medium.com/data-science-at-microsoft/partnering-for-metadata-management-277733911d03

Monte Carlo: Data Catalogs Are Dead; Long Live Data Discovery

Organizations in the past have relied on data catalogs to power data governance. But is that enough? Knowing where your data lives and who has access to it is fundamental to understanding its impact on your business. The article is an exciting read on where the data catalog fails and the need for adopting data discovery services.

https://towardsdatascience.com/data-catalogs-are-dead-long-live-data-discovery-a0dc8d02bd34

Alibaba Cloud: Evolution of the Real-time Data Warehouses of the Alibaba Search and Recommendation Data Platform

Alibaba Search and Recommendation Data Warehouse Platform writes about its real-time data warehouse architecture that supports multiple e-commerce businesses, such as Taobao (Alibaba Group), Taobao Special Edition (Taobao C2M), and Eleme. The blog is an exciting read about the journey of real-time infrastructure, some of the shortfalls of Apache HBase, and the adoption of homegrown Hologres.

https://alibaba-cloud.medium.com/evolution-of-the-real-time-data-warehouses-of-the-alibaba-search-and-recommendation-data-platform-fdb5292a01e2

Alibaba Hologres Paper: https://kai-zeng.github.io/papers/hologres.pdf

Uber: No Code Workflow Orchestrator for Building Batch & Streaming Pipelines at Scale

Apache Airflow reimagines programmatically to orchestrate the data pipeline. The commoditization of computing and storage made the organizations to adopt data at all levels of the business. It also brings challenges on how to empower everyone in the organization to create the data pipeline. Uber writes an exciting blog on how the team got inspired by the No Code systems builds uWorc, a simple drag and drop interface that can manage the entire life cycle of a batch or streaming pipeline without writing a single line of code.

https://eng.uber.com/no-code-workflow-orchestrator/

Line: Introducing Frey: LINE’s new self-service batch ingestion system

Continuing on the self-service data processing systems trend, Line writes about its self-serving batch ingestion service Frey. Frey integrated with Airflow and provided a UI interface for the users to eliminate the learning curve. Once a user's job is created and deployed, the users can get all the information such as execution status and logs and perform operations such as backfill and rerun.

https://engineering.linecorp.com/en/blog/introducing-frey-lines-new-self-service-batch-ingestion-system/

Slack: Creating a React Analytics Logging Library

The domain events instrumentation is the most critical part of building data products. The instrumentation often manual and impacts the developer productivity. Slack writes an excellent blog on how it built the client-side react logging library and improved developer productivity.

https://slack.engineering/creating-a-react-analytics-logging-library-2/

LinkedIn: Coral: A SQL translation, analysis, and rewrite engine for modern data lakehouses

The Big Data computation infrastructure is continuously evolving. The industry came a long way from Map Reduce to Hive, Pig, Spark, and Presto. The evolution also brings interoperability issues among the computation frameworks. LinkedIn developed Dali Catalog to abstract the interoperability complexity and provided a unified data view. LinkedIn writes about Corel, its open-source SQL translation, analysis, and rewrite engine that integrates with Dali and enables Dali view portability across execution engines like Presto, Spark, and Pig.

https://engineering.linkedin.com/blog/2020/coral

Netflix: Supporting content decision makers with machine learning

Netflix is pioneering content creation at an unprecedented scale. The commissioning of a series or film is a creative decision. How to use ML to predict and support the creative process? In this post, Netflix writes about how machine learning and statistical modeling can help creative decision-makers tackle these questions on a global scale.

https://netflixtechblog.com/supporting-content-decision-makers-with-machine-learning-995b7b76006f

P.S: The blog is an exciting read for me personally since it was one of the on-site interview questions for me at Slack :-)

Intuit: Accelerating AI @Intuit With Feature Pipelines and Store

Operating an ML pipeline in production and dealing with complex infrastructure like AWS and stream technologies such as Kafka, Spark Streaming, Flink, etc., is hard. An efficient abstraction of the ML lifecycle management system can accelerate business innovation. Intuit writes about the feature engineering and feature store part of its ML platform, narrates some of the core features.

https://www.linkedin.com/pulse/accelerating-ai-intuit-feature-pipelines-store-simarpal-khaira/

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.

Data Engineering Weekly

Discussion about this post

Ready for more?