Data Engineering Weekly #21: Metadata Edition

Weekly Data Engineering Newsletter

Welcome to the 21st edition of the data engineering newsletter. The 21st edition of the newsletter focuses on the recent breakthroughs in metadata management. I believe the next big set of challenges in data engineering is all about efficient data management.

Data without metadata is like writing a dictionary without meaning.

On this note, Linkedin is organizing the Metadata day 2020 to unify the industry thoughts around metadata management. If you’ve not registered it, please follow the link and register for it.

As a small contribution to Linkedin's efforts, I'm dedicating this week's newsletter as the metadata edition, capturing the timeline of various metadata management systems from Netflix, Lyft, Uber, Airbnb, Linkedin, Datakin, Paypal, Spotify, Shopify, and Facebook.

May 2015: Apache Atlas Joins Apache Incubator

Apache Atlas project from then Hortonworks joins Apache Incubator project focusing on providing open metadata management and governance capabilities for organizations to build a catalog of their data assets, classify and govern these assets and provide collaboration capabilities around these data assets for data scientists, analysts, and the data governance team. Apache Atlas graduated as a top-level Apache project in June-2017. IBM writes an excellent article on the role of Apache Atlas in the open ecosystem.

March 2016: Open Sourcing WhereHows: A Data Discovery and Lineage Portal - LinkedIn

LinkedIn writes about WhereHow, a project of the LinkedIn Data team, works by creating a central repository and portal for the processes, people, and knowledge around the most crucial element of any big data system: the data itself. At the blog publication time, WhereHow already carried an impressive 50 thousand datasets, 14 thousand comments, 35 million job executions, and related lineage information.

March 2017: Democratizing Data at Airbnb - Airbnb

Airbnb developed DataPortal to democratize data and empower Airbnb employees to be data-informed by aiding with data exploration, discovery, and trust. The article is an excellent read detailing the fragmented data landscape and data modeling techniques for the data discovery tooling.

June 2018: Metacat: Making Big Data Discoverable and Meaningful at Netflix - Netflix

Netflix wrote about Metacat, a system that acts as a federated metadata access layer for all data stores. A centralized service for various compute engines could use to access the different data sets. Metacat adopted an interesting architectural pattern where the respective metadata stores are still the source of truth for schema metadata, and Metacat does not materialize it in its storage.

Auguest 2018: Databook: Turning Big Data into Knowledge with Metadata at Uber - Uber

Uber wrote about its Databook journey from static HTML files uploaded regularly with a dynamic, easy to navigate UI. The blog narrates the choice between event-based metadata collection vs. scheduled collection, data modeling strategies, and search engine support.

November 2018: Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-based Triggers - WeWork

Datakin talked about Marquez, an open-source metadata service developed and released by WeWork. Marquez follows a centralized data storage model with a REST API interface to ingest the data, and a MetadataUI for dataset discovery, connecting multiple datasets and exploring their dependency graph.

April 2019: Amundsen — Lyft’s data discovery & metadata engine - Lyft

Lyft wrote about Amundsen, a data discovery system builds on top of the metadata services. The blog narrates the increasing complexity of data growth and how it impacts productivity and compliance. The blog is an excellent read that focuses on the user experience perspective instead of the technology design.

October 2019: Open Sourcing Amundsen: A Data Discovery And Metadata Platform - Lyft

Lyft open-sourced Amundsen and writes in detail about the architecture that powers the data discovery engine. The blog compares the pull vs. push model for ingesting the metadata and how it is beneficial to the pull model. Amundsen consists of a generic data ingestion framework DataBuilder, a frontend service, a Metadata service to handle requests from the frontend, and a search service backed by ElasticSearch.

February 2020: Open sourcing DataHub: LinkedIn’s metadata search and discovery platform - LinkedIn

LinkedIn open-sourced DataHub, its metadata search and discovery platform, and wrote about the journey from WhereHow to DataHub. The blog narrates the difficulty in developing the opensource first generic framework and how DataHub developed tooling and support to open source contributions.

March 2020: How We Improved Data Discovery for Data Scientists at Spotify - Spotify

Spotify wrote about Lexicon, a data discovery service to improve the data discovery experience for data scientists. The discovery focuses on personalization, such as finding the popular dataset across the organization, finding relevant datasets for the team, and suggesting that everyone should be aware.

Note: The Spotify engineering blog down by the time I write this newsletter.

June 2020: Marquez Joins LF AI as New Incubation Project

Marquez joined LF AI Foundation as an incubation project.

July 2020: How We’re Solving Data Discovery Challenges at Shopify - Shopify

Shopify wrote about Artifact, its data discovery, and data management tool to increase productivity, provide greater accessibility to data, and allow for a higher level of data governance. The blog narrates the challenges of building a data discovery service, from acquiring metadata to transforming, modeling, and applying to make it easier for consumption.

Auguest 2020: Amundsen Joins LF AI as New Incubation Project

Almost an year after opensourcing Amundsen joins the LF AI foundation.

October 2020: Nemo: Data discovery at Facebook - Facebook

Facebook wrote about Nemo, its data discovery engine. Nemo has two major components, indexing, and serving, with a front end on top of the serving section. Indexing is in turn divided into bulk indexing, which happens daily, and instant indexing, which updates the index immediately. For Serving, Nemo is particularly interested in adopting a spaCy-based NLP library that performs text parsing and ML approach for post-processing.

November 2020: Turning Metadata Into Insights with Databook - Uber

Uber wrote about a reflection of its experience running Databook and evolution it over time. The blog narrates the importance of well-structured, well-managed metadata, a centralized metadata system that focuses on the user experience, and an extendable data model.

December 2020: DataHub: Popular metadata architectures explained - LinkedIn

Linkedin wrote about DataHub, its third generation evaluation on the learning from WhereHow. The blog narrates the rich learning from the first generation data discovery tooling to the third generation approach. The third generation DataHub adopted the log-oriented metadata sourcing approach and strongly typed domain-oriented metadata models. The adoption of Pegasus schema (PDL) by DataHub’s Generalized Metadata Architecture is an exciting read. Uber’s Databook adopted Dragon, a similar data schema modeling technique.

December 2020: The journey of metadata at PayPal - Paypal

Paypal wrote about the evolution of Universal Data Catalog(UDC), starting 2017 from its incubation. The blog narrates how UDC's growth helped Paypal to deprecate several duplicate infrastructures and why Paypal adopted the pull model to source metadata.

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.