Data Engineering Weekly #49

Weekly Data Engineering Newsletter

Jul 26, 2021

Welcome to the 49th edition of the data engineering newsletter. This week's release is a new set of articles that focus on Netflix's designing better ML systems learnings, James Serra's take on centralized vs. decentralized ownership, Uber's containerizing Apache Hadoop, LinkedIn's journey from the daily dashboard to enterprise-grade data pipeline, Alibaba Cloud's CDC analysis with Apache Flink & Apache Iceberg, RudderStack's why its harder for engineers to support marketing, Uber's geospatial indexing adoption with Apache Pinot, Salesforce's data pipeline with Kotlin, Pinterest's near real-time indexing for Apache HBase, and Grab's processing ETL tasks with Ratchet.

Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.

Event Alert: Airbnb - The Journey Toward High-Quality Data

Airbnb hosts its first virtual tech talk focusing on data quality Wednesday, July 28th, 12:00 PM-1:00 PM PST. SignUp here

https://journeytowardhighqualitydata.splashthat.com/

Netflix: Designing Better ML Systems - Learnings from Netflix

Netflix shares its design principles on building the recommender ML infrastructure. The article unbundles the three core parts of the orchestration engine from Netflix's Metaflow,

The user DAG code defines what needs to execute
The job scheduler defines how the code will execute
Compute infrastructure that defines where the code will execute

https://www.infoq.com/presentations/designing-ml-systems-netflix/

James Serra: Data Mesh - Centralized ownership vs. decentralized ownership

The data as a product is a robust design thought introduced from the data mesh principles. Yet, there is still some confusion around the feasibility of adopting the data mesh principles, mainly because of the lack of toolings.

The author raised some valid concerns & constraints on adopting the data mesh's decentralized ownership, and I tend to agree with a few of them. Are we collectively underestimating the complexity of the data engineering, or is that an idea ahead of time since the tooling is not ready? Nonetheless, it is great to see the Data Mesh principles pushing the boundaries of the data engineering toolings.

https://www.jamesserra.com/archive/2021/07/data-mesh-centralized-ownership-vs-decentralized-ownership/

Uber: Containerizing Apache Hadoop Infrastructure at Uber

Uber writes about its experience on the instability of running a mutable infrastructure and the experience of adopting immutable containerized Apache Hadoop infrastructure. The implementation of pre-fetching the docker image to reduce the bootstrap failures, Kerberos integration, and the complexity analysis on adopting the internal service mesh vs. DNS solutions is an informative read.

https://eng.uber.com/hadoop-container-blog/

LinkedIn: From daily dashboards to enterprise-grade data pipelines

The Daily Executive Dashboard (DED) dashboards contain critical growth, engagement, and success metrics that indicate the health of a company. LinkedIn writes an exciting blog that narrates its executive dashboard pipeline journey from the incubation of Microstrategy -> Teradata -> integration with LinkedIn's data infrastructure stack.

https://engineering.linkedin.com/blog/2021/from-daily-dashboards-to-enterprise-grade-data-pipelines

Alibaba Cloud: How to Analyze CDC Data in Iceberg Data Lake Using Flink

The real-time analytics on the change data capture events are critical for business operations. The blog narrates the historical approach of analyzing the CDC events by various systems like HBase, Kudu, Hive incremental tables, Spark Delta, and narrates the reasoning to adopt Apache Iceberg + Flink solution.

https://www.alibabacloud.com/blog/how-to-analyze-cdc-data-in-iceberg-data-lake-using-flink_597838

Uber: ‘Orders Near You’ and User-Facing Analytics on Real-Time Geospatial Data

Uber writes about the criticality of real-time geospatial analytics for its business and how it uses Apache Pinot's geospatial indexing based on Uber's H3 indexing system helped to solve some of the business cases for Uber Eats. The article narrates how Pinot's geospatial indexing support helped solve the scalability issue with the previous Cassandra-based solution, from 120 db calls to 1.

https://eng.uber.com/orders-near-you/

Salesforce: Building Data Pipelines Using Kotlin

Salesforce writes about its choice of adopting Kotlin for building the data pipeline. The null pointer safety, presence of the data classes to reduce the boilerplate codes, flexible branching expression, and the fact that it seamlessly integrates with java to utilize the java library ecosystems are some of the exciting features in Kotlin as a data pipeline language.

https://engineering.salesforce.com/building-data-pipelines-using-kotlin-2d70edc0297c

Pinterest: Building scalable near-real time indexing on HBase

The lack of seamless secondary indexing support is one of the design constraints of adopting Apache HBase. Pinterest writes about Ixia, its internal generic search interface on top of HBase to provide near-real-time secondary indexing support.

https://medium.com/pinterest-engineering/building-scalable-near-real-time-indexing-on-hbase-7b5eeb411888

Grab: Processing ETL tasks with Ratchet

Grab writes about its Lending platform adoption of Ratchet library for performing data pipeline & ETL tasks in Go. It's exciting to see a couple of articles sharing their experience building data pipelines in Kotlin & Go, diverging from the usual Python, Java, or Scala.

https://engineering.grab.com/processing-etl-tasks-with-ratchet

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.

Data Engineering Weekly