Data Engineering Weekly #82

The Weekly Data Engineering Newsletter

Apr 11, 2022

Apoorva Pandhi and Chad Sanderson: Modern Data Stack - Looking into the Crystal Ball

What is the next stage of disruption in the data ecosystem? The authors predicted Self-serving AI/ Ml infrastructure, "Data Contracts," decoupling data infrastructure where the distinction between the data warehouse and data lake is becoming increasingly obscure. Streamlining became mainstream adoption and the standardization of data practices.

I'm pretty bullish on "Data Contracts," which will significantly impact the data landscape. It's an exciting time to be in data engineering.

https://www.linkedin.com/pulse/modern-data-stack-looking-crystal-ball-apoorva-pandhi/

Transform: Introducing MetricFlow - Your powerful, open-source metrics framework

The promise of the metrics layer as a semantic abstraction between the storage & compute to define the metrics once and reuse them across the data ecosystem is groundbreaking. Transform took its first step in that mission by open-sourcing MetricFlow, an open-source metrics framework.

https://blog.transform.co/product-news/introducing-metricflow-your-powerful-open-source-metrics-framework/

Github: https://github.com/transform-data/metricflow

Singularity Data: RisingWave - A Cloud-Native Streaming Database

The barrier to entry for building & operating the real-time infrastructure with high reliability is still rigid. This week, Singularity Data's open-source RisingWave, a cloud-native streaming database, continues the shower of the open-source announcement.

https://singularity-data.com/blog/risingwave-A-Cloud-Native-Streaming-Database/

Github: https://github.com/singularity-data/risingwave

Shopify: The Magic of Merlin - Shopify's New Machine Learning Platform

Shopify writes about Merlin, its Machine Learning platform on top of open-source technologies such as Kubernetes, Ray, Airflow, Oozie & Jupyter Notebook. It's exciting to see the increasing adoption of Ray, which is something I'm looking to try this month.

https://shopifyengineering.myshopify.com/blogs/engineering/merlin-shopify-machine-learning-platform

DoorDash: Using Gamma Distribution to Improve Long-Tail Event Predictions

The long-tail events are always challenging in computing & prediction. DoorDash writes about how the long-tail prediction impacts the delivery estimates and uses gamma distribution to improve the model performance.

https://doordash.engineering/2022/04/06/using-gamma-distribution-to-improve-long-tail-event-predictions/

LinkedIn: The journey to building an explainable AI-driven recommendation system to help scale sales efficiency across LinkedIn

I'm thrilled to read the LinkedIn post about an explainable AI-driven recommendation system. The goal of the analytics function is to give a context-rich narration to empower decision making, not just metrics & dashboard. I have written about it in the past.

Ananth Packkildurai@ananthdurai

The business insights, in a way, are LinkedIn news feeds but delivered in a finite interval. Is it sufficient enough to build a habit of consuming insights in decision-makers? The number of zombie dashboards is clear evidence that the BI Dashboard model is not working. 3/4

11:48 PM · Jan 10, 2022

Whenever I raise the question, it is always the job of the data catalog as a documentation solution. I wouldn't be surprised if the next generation of startups focused on explainable, news feed-driven business analytics.

https://engineering.linkedin.com/blog/2022/the-journey-to-build-an-explainable-ai-driven-recommendation-sys

Dagster: Introducing Software-Defined Assets

Last month we went through multiple discussions of bundling and unbundling, and sooner we started to see a few M&A in the data ecosystem as a sign of consolidation. However, the core of the problem is an efficient & programmatic approach to managing data assets and asset lifecycle. Dagster writes about its approach to software-defined assets and how it accelerates data management efficiency.

https://dagster.io/blog/software-defined-assets

Nikhil Sachdeva: The role of a technical program manager in AI projects

A typical AI project is a cross-functional effort from the incubation to operating in production. The author discusses the skills required to be an AI Technical Product Manager (TPM).

https://medium.com/data-science-at-microsoft/the-role-of-a-technical-program-manager-in-ai-projects-8f1ff41905b0

Peter Gao: Lessons From Deploying Deep Learning To Production

Some great practical tips on deploying deep learning applications in production start with an iterative development model, leveraging domain-specific feedback loops, a human-in-the-loop review process, etc.

https://medium.com/aquarium-learning/lessons-from-deploying-deep-learning-to-production-9b7a3576881d

Ben Rogojan: Why Building Data Reliability Systems Is Hard

There is always a trade-off between correctness and speed in a data pipeline. The author explains why data quality is more than SQL queries and the trade-off between operational reliability and data quality.

https://medium.com/coriers/why-building-data-reliability-systems-is-hard-e0bf25c5ee36

Uber: Securing Kafka® Infrastructure at Uber

Uber writes about essential components to enable security features on a Kafka cluster and how it secures it. The incremental rollout of the security features for 500+ Kafka topics and the performance tuning of the Kafka clusters are some of the exciting system design reading.

https://eng.uber.com/securing-kafka-infrastructure-at-uber/

Halodoc: Key Learnings on Using Apache HUDI in building Lakehouse Architecture @ Halodoc

Hudi is a rich platform to build streaming data lakes with incremental data pipelines on a self-managing database layer while being optimized for lake engines and regular batch processing. Halodoc writes an exciting blog sharing its experience in adopting and optimizing the Apache Hudi lakehouse infrastructure.

https://blogs.halodoc.io/key-learnings-on-using-apache-hudi-in-building-lakehouse-architecture-halodoc/

Redpanda: Evaluating Graviton 2 for data-intensive applications - an Arm vs. Intel comparison

Cost optimization is always top of the mind for most data teams. Redpanda publishes the performance of Graviton 2 using Redpanda as a target application, with results that show Arm at about a 20% price/performance advantage over Intel instances for a high-throughput message ingestion scenario.

https://redpanda.com/blog/aws-graviton-2-arm-vs-x86-comparison/

Peng Wang: What Skills Do You Need to Become a Data Engineer

An outstanding data-driven finding of what skillsets the data engineers require: the author extracted 550 United States data engineer jobs from indeed.com and did some quick analyses using job description, location, and salary range.

https://www.linkedin.com/pulse/what-skills-do-you-need-become-data-engineer-peng-wang/

All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly

Discussion about this post

Ready for more?