Data Engineering Weekly #34

Weekly Data Engineering Newsletter

Welcome to the 34th edition of the data engineering newsletter. This week's release is a new set of articles that focus on Google’s massively parallel graph computation, Uber’s data journey, Hyperight’s is data mesh right for your organization, Lyft’s ML feature infrastructure, Flyte joins LF Data & AI, PayPal’s secure data movement, Data pipeline @ Samsara, Gousto data teams’ best of 2020, Cloudflare’s anomaly detection, Instacart’s take on large-scale labeling, Dagster 0.11 release note, and why Kafka is fast.


Google: Massively Parallel Graph Computation - From Theory to Practice

Graph computation is widely used for various data science purposes, from ranking web pages by popularity and mapping out social networks. Google AI discusses MapReduce's limitations in graph processing and introduces Adaptive Massively Parallel Computation Model using a distributed hash table.

https://ai.googleblog.com/2021/03/massively-parallel-graph-computation.html

Papers:

Parallel graph algorithms in constant adaptive rounds: theory meets practice

Massively Parallel Computation via Remote Memory Access

Unconditional Lower Bounds for Adaptive Massively Parallel Computation


Uber: Uber’s Journey Toward Better Data Culture From First Principles

Uber writes an exciting blog on the challenges of operating a data platform at scale. Self-serving analytics is a north star dream of many businesses. However, it also brings multiple challenges such as data duplication, data discovery issues, disconnected tooling, logging inconsistency, lack of process, and lack of SLA and ownership. 

The blog narrates how Uber is solving the problem by adapting the fundamental data platform principles.

  1. Data as Code

  2. Data is Owned

  3. Data Quality is Known for each dataset. 

  4. Accelerate data productivity with data tools optimized for collaboration.

  5. Organize the data with local data ownership

https://eng.uber.com/ubers-journey-toward-better-data-culture-from-first-principles/


Hyperight: Is Data Mesh right for your organization?

Does Data Mesh make sense for all types of organizations? The captures the collective thoughts on data mesh principles on when to apply them and the future outlook of data mesh and DataOps.

https://read.hyperight.com/is-data-mesh-right-for-your-organisation/


Lyft: ML Feature Serving Infrastructure at Lyft

A vital requirement for the ML model's feature computation needs to be made available via batch queries for model training and via low-latency online inference. Lyft writes about its feature service consist of feature definition, feature ingestion & processing, and retrieval.

https://eng.lyft.com/ml-feature-serving-infrastructure-at-lyft-d30bf2d3c32a


Lyft: Flyte Joins LF AI & Data

Continuing on Lyft’s ML feature serving infrastructure, Flyte, the core platform for orchestrating the machine learning job, joins the Data & AI chapter of the Linux Foundation.

https://eng.lyft.com/flyte-joins-lf-ai-data-48c9b4b60eec


PayPal: How PayPal moves secure and encrypted data across security zones

Paypal writes an exciting article on the challenges of secure data movement across data centers. The article narrates how it uses Apache Gobblin, Kerberos, and KMS to handle secure transfer, encryption at rest, and the prevention of unauthorized & unauthenticated access.

https://medium.com/paypal-tech/how-paypal-moves-secure-and-encrypted-data-across-security-zones-10010c1788ce


Samsara: Data Pipelines @ Samsara

Samsara writes about its data pipeline infrastructure builds with a data transformation DSL and AWS step function. One of the complicated challenges of a data pipeline that depends on the tasks than the model (data) requires significant engineering effort to resolve duplications. Samsara narrates an exciting read on how it handles the task dependency and deduplication of the tasks using DynamoDB to store the data transformation metadata.

https://medium.com/samsara-engineering/data-pipelines-samsara-64596dbc2137


Gousto: Gousto Data Team — Best of 2020

Gousto writes an excellent summary highlighting some of the data teams’ projects 2020, design choices, and decision factors. I wish every team publishes their yearly summary as a guide.

https://medium.com/gousto-engineering-techbrunch/gousto-data-team-best-of-2020-8a731837ace2


Cloudflare: Lessons Learned from Scaling Up Cloudflare’s Anomaly Detection Platform

Cloudflare writes about anomaly detection for bot management using Redis, Kafka, and ClickHouse. The blog narrates the overall architecture, the adoption of microservices, and the Redis performance tuning.

https://blog.cloudflare.com/lessons-learned-from-scaling-up-cloudflare-anomaly-detection-platform/


Instacart: 7 steps to get started with large-scale labeling

Data collections often require human labeling to annotate the datasets. Crowdsourcing has emerged as one of the possible ways to collect labels at scale. Instacart writes a “Pre-flight Checklist” of tasks for implementing large-scale crowdsourcing tasks.

https://tech.instacart.com/7-steps-to-get-started-with-large-scale-labeling-1a1eb2bf8141


Dagster: Dagster 0.11.0 Lucky Star Version Release

Dagster released version 0.11.0, codenamed “Lucky Star,” with MySQL backend support, better backfill management, and experimental support for data lineage.

https://github.com/dagster-io/dagster/releases/tag/0.11.0


Emil Koutanov: Why Kafka Is so Fast

The author narrates some of Kafka's foundational design principles and demonstrates why it becomes the central nerve of data processing and management.

https://medium.com/swlh/why-kafka-is-so-fast-bde0d987cd03


Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.