Welcome to the 45th edition of the data engineering newsletter. This week's release is a new set of articles that focus on Rohan Goel's ultimate repo of data discovery solutions, NYT's tracking Covid-19 from hundreds of sources, Adevinta's Building a data mesh to support an ecosystem of data products, Nvidia's what is synthetic data, Adobe's migrating to Apache Iceberg, Confluent's consistency & completeness in Kafka stream, Databricks Photon public preview announcement, StarTree's introduction to geospatial queries in Pinot, LinkedIn's text analytics in Pinot, and Myntra's Apache Spark optimization using Sparklens.
Rohan Goel: The Ultimate Repo of Data Discovery Solutions
The data discovery system is a critical infrastructure in data engineering, and there are growing startups to solve the discovery problem. The ultimate repo of data discovery solutions is an excellent work that captures the current data discovery solutions. Thanks, Rohan, for sharing it.
https://www.notion.so/The-Ultimate-Repo-of-Data-Discovery-Solutions-149b0ea2a2ed401d84f2b71681c5a369
Data Engineering Weekly wrote a metadata edition that captures the timeline of data discovery systems development in previous editions.
https://www.dataengineeringweekly.com/p/data-engineering-weekly-21-metadata
New York Times: Tracking Covid-19 From Hundreds of Sources, One Extracted Record at a Time
NYT shared their experience developing a covid-19 application where the tracking work began with a single spreadsheet to more than 9.98 million programmatic requests for Covid-19 data from websites worldwide. It's fascinating to read about the scrapper development where the counties and cities source website frequently changed during the Covid crisis. The blog raised an important point: public data is not open data unless it is well-maintained, documented, and queryable APIs.
Adevinta: Building a data mesh to support an ecosystem of data products at Adevinta
Adevinta writes an exciting blog about its journey towards data mesh architecture and what worked and didn't work. The learning focused on SQL access, Dataset as a Product & Domain data is an excellent blueprint of implementing data mesh architecture.
It is exciting to read the dataset classification on the "Core Datasets" & "Domain Dataset." approach from Adevinta. It is essential to recognize while implementing the Decentralized Data Ownership that data is inherently social, and a standalone domain adds no value unless it can connect with other domains.
Nvidia: What Is Synthetic Data?
Synthetic data is annotated information that computer simulations or algorithms generate as an alternative to real-world data. The importance of synthetic data comes as AI pioneer Andrew Ng calling for a broad shift to a more data-centric approach to machine learning. The blog narrates the history of synthetic data and comparing it with the augmented and anonymized data.
https://blogs.nvidia.com/blog/2021/06/08/what-is-synthetic-data/
Adobe: Migrating to Apache Iceberg at Adobe Experience Platform
Adobe shared its experience migrating to Apache Iceberg for faster data access and reducing the dependency on catalogs. In addition, the blog narrates the pros & cons of in-place upgrade vs. shadow migration, and the decision matrix to decide on the migration strategy is a practice that one can adapt in any migration projects.
https://medium.com/adobetech/migrating-to-apache-iceberg-at-adobe-experience-platform-40fa80f8b8de
Confluent: Consistency and Completeness - Rethinking Distributed Stream Processing in Apache Kafka
It is a vital feature of a stream processing engine to guarantee that it can recover from failures to a consistent state. Thus, the final results will not contain duplicates or lose any data & completeness and do not generate incomplete, partial outputs as final results even when input stream records may arrive out of order. Confluent writes an exciting blog that narrates how Kafka Stream guarantees such stream processing semantics on consistency and completeness.
https://www.confluent.io/blog/rethinking-distributed-stream-processing-in-kafka/
Databricks: Announcing Photon Public Preview - The Next Generation Query Engine on the Databricks Lakehouse Platform
Databricks announced a brand new query engine, Photon, to run SQL & Spark SQL queries on top of the Delta Lake. Photon does not yet support all Spark features; a single query can run partially in Photon and partially in Spark!!!. It would be an exciting paper to read how the optimizer work when the task is preferred over Photon vs. Spark SQL, and most importantly, how the data serialization works!!!
StarTree: Introduction to Geospatial Queries in Apache Pinot
One of the exciting features that I liked about Pinot is the customizable indexes for each dimension and provide interactive analytics in real-time. StarTree Data shared how one can run Geospatial analytical using Apache Pinot.
LinkedIn: Text analytics on LinkedIn Talent Insights using Apache Pinot
LinkedIn shared a similar usage of Pinot, narrating how Linkedin runs text analytics using Pinot.
Myntra Engineering: Optimisation using Sparklens
Sparklens is a profiling and performance prediction tool for Spark with a built-in Spark Scheduler simulator. It helps identify the bottlenecks that a Spark application is facing and provides us with critical path time. Myntra shared its experience in utilizing Sparklens to optimize the Apache Spark jobs.
https://medium.com/myntra-engineering/optimisation-using-sparklens-59477440bdd8
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.