Welcome to the 21st edition of the data engineering newsletter. The 21st edition of the newsletter focuses on the recent breakthroughs in metadata management. I believe the next big set of challenges in data engineering is all about efficient data management.
Data without metadata is like writing a dictionary without meaning.
On this note, Linkedin is organizing the
Metadata day 2020 https://metadataday2020.splashthat.com/ to unify the industry thoughts around metadata management. If you’ve not registered it, please follow the link and register for it.
As a small contribution to Linkedin's efforts, I'm dedicating this week's newsletter as the metadata edition, capturing the timeline of various metadata management systems from Netflix, Lyft, Uber, Airbnb, Linkedin, Datakin, Paypal, Spotify, Shopify, and Facebook.
Apache Atlas Joins Apache Incubator
Apache Atlas project from then Hortonworks joins Apache Incubator project focusing on providing open metadata management and governance capabilities for organizations to build a catalog of their data assets, classify and govern these assets and provide collaboration capabilities around these data assets for data scientists, analysts, and the data governance team. Apache Atlas graduated as a top-level Apache project in June-2017. IBM writes an excellent article on the role of Apache Atlas in the open ecosystem.
Open Sourcing WhereHows: A Data Discovery and Lineage Portal -
LinkedIn writes about WhereHow, a project of the LinkedIn Data team, works by creating a central repository and portal for the processes, people, and knowledge around the most crucial element of any big data system: the data itself. At the blog publication time, WhereHow already carried an impressive 50 thousand datasets, 14 thousand comments, 35 million job executions, and related lineage information.
Democratizing Data at Airbnb -
Airbnb developed DataPortal to democratize data and empower Airbnb employees to be data-informed by aiding with data exploration, discovery, and trust. The article is an excellent read detailing the fragmented data landscape and data modeling techniques for the data discovery tooling.
Metacat: Making Big Data Discoverable and Meaningful at Netflix -
Netflix wrote about Metacat, a system that acts as a federated metadata access layer for all data stores. A centralized service for various compute engines could use to access the different data sets. Metacat adopted an interesting architectural pattern where the respective metadata stores are still the source of truth for schema metadata, and Metacat does not materialize it in its storage.
Databook: Turning Big Data into Knowledge with Metadata at Uber -
Uber wrote about its Databook journey from static HTML files uploaded regularly with a dynamic, easy to navigate UI. The blog narrates the choice between event-based metadata collection vs. scheduled collection, data modeling strategies, and search engine support.
Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-based Triggers - WeWork
Datakin talked about Marquez, an open-source metadata service developed and released by WeWork. Marquez follows a centralized data storage model with a REST API interface to ingest the data, and a MetadataUI for dataset discovery, connecting multiple datasets and exploring their dependency graph.
Amundsen — Lyft’s data discovery & metadata engine -
Lyft wrote about Amundsen, a data discovery system builds on top of the metadata services. The blog narrates the increasing complexity of data growth and how it impacts productivity and compliance. The blog is an excellent read that focuses on the user experience perspective instead of the technology design.
Open Sourcing Amundsen: A Data Discovery And Metadata Platform -
Lyft open-sourced Amundsen and writes in detail about the architecture that powers the data discovery engine. The blog compares the pull vs. push model for ingesting the metadata and how it is beneficial to the pull model. Amundsen consists of a generic data ingestion framework DataBuilder, a frontend service, a Metadata service to handle requests from the frontend, and a search service backed by ElasticSearch.
Open sourcing DataHub: LinkedIn’s metadata search and discovery platform -
LinkedIn open-sourced DataHub, its metadata search and discovery platform, and wrote about the journey from WhereHow to DataHub. The blog narrates the difficulty in developing the opensource first generic framework and how DataHub developed tooling and support to open source contributions.
How We Improved Data Discovery for Data Scientists at Spotify -
Spotify wrote about Lexicon, a data discovery service to improve the data discovery experience for data scientists. The discovery focuses on personalization, such as finding the popular dataset across the organization, finding relevant datasets for the team, and suggesting that everyone should be aware.
Note: The Spotify engineering blog down by the time I write this newsletter.
Marquez Joins LF AI as New Incubation Project
Marquez joined LF AI Foundation as an incubation project.
How We’re Solving Data Discovery Challenges at Shopify -
Shopify wrote about Artifact, its data discovery, and data management tool to increase productivity, provide greater accessibility to data, and allow for a higher level of data governance. The blog narrates the challenges of building a data discovery service, from acquiring metadata to transforming, modeling, and applying to make it easier for consumption.
Amundsen Joins LF AI as New Incubation Project
Almost an year after opensourcing Amundsen joins the LF AI foundation.
Nemo: Data discovery at Facebook -
Facebook wrote about Nemo, its data discovery engine. Nemo has two major components, indexing, and serving, with a front end on top of the serving section. Indexing is in turn divided into bulk indexing, which happens daily, and instant indexing, which updates the index immediately. For Serving, Nemo is particularly interested in adopting a spaCy-based NLP library that performs text parsing and ML approach for post-processing.
Turning Metadata Into Insights with Databook -
Uber wrote about a reflection of its experience running Databook and evolution it over time. The blog narrates the importance of well-structured, well-managed metadata, a centralized metadata system that focuses on the user experience, and an extendable data model.
DataHub: Popular metadata architectures explained -
Linkedin wrote about DataHub, its third generation evaluation on the learning from WhereHow. The blog narrates the rich learning from the first generation data discovery tooling to the third generation approach. The third generation DataHub adopted the log-oriented metadata sourcing approach and strongly typed domain-oriented metadata models. The adoption of Pegasus schema (PDL) by DataHub’s Generalized Metadata Architecture is an exciting read. Uber’s Databook adopted Dragon, a similar data schema modeling technique.
The journey of metadata at PayPal -
Paypal wrote about the evolution of Universal Data Catalog(UDC), starting 2017 from its incubation. The blog narrates how UDC's growth helped Paypal to deprecate several duplicate infrastructures and why Paypal adopted the pull model to source metadata.
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.