Welcome to the 26th edition of the data engineering newsletter. This week's release is a new set of articles that focus on Unusual Venture's data lineage forum, Uber's metrics standardization journey, Adobe's Apache Iceberg usage, Databricks talk on Lakehouse, Pinterest's realtime search engine, Intuit's take on the data lake, Microsoft's take on cost management, and Grab's realtime workflow engine.
Unusual Ventures:
Unusual Roundtable Takeaways: Data Lineage and its Role in Data Unification
The Unusual Ventures writes an excellent summary of data lineage and how essential the lineage democratize the data. I had a great time with the folks at Unusual Ventures talking about the data lineage and am excited about its growth.
Uber:
The Journey Towards Metric Standardization
How many of us in a meeting where multiple versions of unique users mentioned as business metrics?!! Uber wrote an exciting blog discussing the consequence of data democratization and the importance of metrics standardization. uMetric, Uber's internal unified metric platform that powers the full lifecycle of a metric from the definition, discovery, planning, computation, and quality to consumers, is an excellent case study for balancing standards and productivity.
Adobe:
Taking Query Optimizations to the Next Level with Iceberg
Data Inconsistency, scalability issues with the metadata stores, and inefficient data access where partition pruning can give only minimal optimization are some of the significant pain points Apache Iceberg designed to fix. Adobe writes an exciting blog about optimization for Iceberg, discussing vectorized reading, Nested schema pruning & predicate pushdown, Manifest tooling, and snapshot expiration.
https://medium.com/adobetech/taking-query-optimizations-to-the-next-level-with-iceberg-6c968b83cd6f
Databricks@CIDR DB 2021 Conf:
Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics
Databricks presented the emerging lakehouse platforms at the CIDR conference. The talk is an excellent summarization of lakehouse platforms like Databricks Delta, Apache Iceberg, Apache Hudi, and Hive ACID. The Data Engineering Weekly’s one prediction is,
The Lakehouse platforms will define the next generation data architecture.
It is great to see Adobe’s case study and Databricks CIDR talk on the same.
Paper: http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf
Pinterest:
Manas Realtime — Enabling changes to be searchable in a blink of an eye
The flush interval is one of the most significant pain points while using Lucene based search engines. The constant tuning it requires to balance the latency and throughput for the real-time indexing requires significant benchmarking. Pinterest writes about Manas, an internal search infrastructure, and balances the batch indexing and the real-time indexing. It's great to see emerging designs coming up in the search infrastructure, which is often underserved compares to the OLTP engine developments.
Intuit:
Is Your Data Lake More like a Used Book Store or a Public Library
The data catalog and the data discovery is the differentiator between productive and efficient data infrastructure vs. confused and inefficient data infra. Intuit writes an excellent analogy of used book stores and the public library to demonstrate the efficient data infrastructure strategy with data catalog systems.
The Data Engineering Weekly used a similar analogy to demonstrate the importance of metadata management systems. It's great to see some exciting, successful case studies emerging.
Data Engineering Weekly’s discussion about Data Mesh
Data Mesh Simplified: A Reflection Of My Thoughts On Data Mesh
Microsoft:
Your analytics platform has gone rogue, Part 1: Unforeseen costs
The modern data processing engines can process petabytes of data in parallel, and the cloud platforms enable the ease of scalability. It also brings the challenge of infrastructure cost management as a core engineering skill. Microsoft writes about uncontrolled costs and strategies to account while architecting and managing the data at scale.
Grab:
Trident - Real-time event processing at scale
Grab writes about Trident, it's in-house real-time IFTTT engine, which processes events and operates business mechanisms on a massive scale. The blog is an exciting read on workload distribution, datastore scaling, and query optimization strategy
https://engineering.grab.com/trident-real-time-event-processing-at-scale
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.