Data Engineering Weekly #108

The Weekly Data Engineering Newsletter

Nov 21, 2022

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

Google AI: The Data Cards Playbook: A Toolkit for Transparency in Dataset Documentation

Google published Data Cards, a dataset documentation framework aimed at increasing transparency across dataset lifecycles. Data Cards include the following:

upstream sources
Data collection and annotation methods
Training and evaluation methods
Intended use of the dataset
Decisions affecting model performance

The data cards approach is fascinating, especially As machine learning (ML) research moves toward large-scale models capable of numerous downstream tasks. A shared understanding of a dataset’s origin, development, intent, and evolution becomes increasingly essential for responsible and informed development.

https://ai.googleblog.com/2022/11/the-data-cards-playbook-toolkit-for.html

The short YouTube video gives a nice overview of the Data Cards.

Daniel Buschek: What makes user interfaces intelligent?

I found this article a bit late, but an exciting read for this week. We often think of AI/ ML as a complex data processing problem, but it doesn’t make any use until it is exposed to an end user or an application. So what makes a user interface intelligent? The author walks through what is intelligent in the UI and what it does for users.

https://uxdesign.cc/what-makes-user-interfaces-intelligent-9f63b27ca39

Luke Lin: Types of data products

Data as a product is a trending phrase and has begun mainstream adoption integrating with the organization's product strategy. But what are the types of product strategies? The author classifies the data products as

Data Platform as a product
Data Insight as a product
Data Activation as a product

https://pmdata.substack.com/p/types-of-data-products

Mikkel Dengsøe: The important purple people outside the data team

The Data team's prime mission is to educate and empower data-driven product & business operations across an org. The most critical work of data happens outside the data team. The author narrates a few practical tips for creating success with people outside the data team.

https://mikkeldengsoe.substack.com/p/purple-people-outside-data

DataScience @ Microsoft: Industrial Metaverse: A software and data perspective

Gartner Predicts 25% of People Will Spend At Least One Hour Per Day in the Metaverse by 2026. I’ve no idea how it will play out yet, but as a data engineer, what the Metaverse means to us? The author narrates what an industrial metaverse is, its key components, and some practical applications of an industrial metaverse.

https://medium.com/data-science-at-microsoft/industrial-metaverse-a-software-and-data-perspective-d09950a453f6

Atlassian: Learn How to Prepare For New European Data Privacy Requirements

Data privacy and regulatory requirements mostly won’t feature in a developer blog, and I found pleasantly surprised with this article from Atlassian. Kudos to the author and the Atlassian team. The blog narrates the European Commission’s updated version of the European Standard Contractual Clauses (EU SCCS) and how to prepare to handle the privacy laws.

https://blog.developer.atlassian.com/learn-how-to-prepare-for-new-european-data-privacy-requirements/

DoorDash: Balancing Velocity and Confidence in Experimentation

In a data pipeline, there is always a conflict between correctness (Trust) and speed (Velocity). The trade-off plays a critical role in a critical system like Experimentation. The author narrates balancing the velocity and confidence in online Experimentation.

https://doordash.engineering/2022/11/15/balancing-velocity-and-confidence-in-experimentation/

Netflix: For your eyes only: improving Netflix video quality with neural networks

I'm delighted to see more Netflix engineering blogs coming out in recent days talking about the impact of AI/ ML in media production. The blog narrates one such application that uses video quality with neural networks. I don't know if Netflix will be a thing in the next ten years, but the impact it will make on media production integrated with the advancement of AI/ ML will be significant.

https://netflixtechblog.com/for-your-eyes-only-improving-netflix-video-quality-with-neural-networks-5b8d032da09c

Myntra: Janus - Data processing framework at Myntra

Myntra writes about its data processing framework - Janus. The blog narrates the requirements and motivation to design Janus and critical pieces of the architecture, such as data catalogs, pipeline modeling, pipeline deployment, and pipeline execution. I'm a little more curious to understand the design in detail to see the data catalog as an integral part of the pipeline design.

https://medium.com/myntra-engineering/janus-data-processing-framework-at-myntra-980ba8cb15a5

99.co: BigQuery’s schema auto-detection does not work perfectly as we want it to, so we build our own

Structured data from source systems can significantly reduce data management's complexity; however, it is not uncommon to encounter source systems without such capabilities. The author narrates why Google BigQuery's schema auto-detection system fails them, which leads to building a custom schema detection tool.

https://medium.com/99dotco/bigquerys-schema-auto-detection-does-not-work-perfectly-like-we-want-it-to-so-we-build-our-own-93a5f1a1f67

GumGum: Replacing Apache Druid with Snowflake Snowpipe

Many real-time pipelines are micro-batch pipelines. I often tell my team real-time & batch systems are data processing pipelines with different window functions. I found the blog exciting as the first blog I've seen using Snowflake for a near-real-time pipeline that replaces OLAP systems like Apache Druid.

https://medium.com/gumgum-tech/replacing-apache-druid-with-snowflake-snowpipe-74c8d7c9b9c3

All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly