Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
Google AI: The Data Cards Playbook: A Toolkit for Transparency in Dataset Documentation
Google published Data Cards, a dataset documentation framework aimed at increasing transparency across dataset lifecycles. Data Cards include the following:
upstream sources
Data collection and annotation methods
Training and evaluation methods
Intended use of the dataset
Decisions affecting model performance
The data cards approach is fascinating, especially As machine learning (ML) research moves toward large-scale models capable of numerous downstream tasks. A shared understanding of a dataset’s origin, development, intent, and evolution becomes increasingly essential for responsible and informed development.
https://ai.googleblog.com/2022/11/the-data-cards-playbook-toolkit-for.html
The short YouTube video gives a nice overview of the Data Cards.
Daniel Buschek: What makes user interfaces intelligent?
I found this article a bit late, but an exciting read for this week. We often think of AI/ ML as a complex data processing problem, but it doesn’t make any use until it is exposed to an end user or an application. So what makes a user interface intelligent? The author walks through what is intelligent in the UI and what it does for users.
https://uxdesign.cc/what-makes-user-interfaces-intelligent-9f63b27ca39
Luke Lin: Types of data products
Data as a product is a trending phrase and has begun mainstream adoption integrating with the organization's product strategy. But what are the types of product strategies? The author classifies the data products as
Data Platform as a product
Data Insight as a product
Data Activation as a product
https://pmdata.substack.com/p/types-of-data-products
Mikkel Dengsøe: The important purple people outside the data team
The Data team's prime mission is to educate and empower data-driven product & business operations across an org. The most critical work of data happens outside the data team. The author narrates a few practical tips for creating success with people outside the data team.
https://mikkeldengsoe.substack.com/p/purple-people-outside-data
Sponsored: Build SQL Pipelines. Not Endless DAGs!
With Upsolver SQLake, you build a pipeline for data in motion simply by writing a SQL query defining your transformation.
Streaming and batch unified in a single platform
No Airflow - orchestration inferred from the data
$99 / TB of data ingested | transformations free
DataScience @ Microsoft: Industrial Metaverse: A software and data perspective
Gartner Predicts 25% of People Will Spend At Least One Hour Per Day in the Metaverse by 2026. I’ve no idea how it will play out yet, but as a data engineer, what the Metaverse means to us? The author narrates what an industrial metaverse is, its key components, and some practical applications of an industrial metaverse.
Atlassian: Learn How to Prepare For New European Data Privacy Requirements
Data privacy and regulatory requirements mostly won’t feature in a developer blog, and I found pleasantly surprised with this article from Atlassian. Kudos to the author and the Atlassian team. The blog narrates the European Commission’s updated version of the European Standard Contractual Clauses (EU SCCS) and how to prepare to handle the privacy laws.
Sponsored: It’s Time for the Headless CDP
In this piece RudderStack CEO, Soumyadeb Mitra, makes the case for a new approach to the customer data platform—the headless CDP. He defines the headless CDP as a tool with open architecture, purpose built for data and engineering teams, that makes it easy to collect customer data from every source, build your customer 360 in your own warehouse, then make that data available to your entire stack.
https://www.rudderstack.com/blog/it-s-time-for-the-headless-cdp/
DoorDash: Balancing Velocity and Confidence in Experimentation
In a data pipeline, there is always a conflict between correctness (Trust) and speed (Velocity). The trade-off plays a critical role in a critical system like Experimentation. The author narrates balancing the velocity and confidence in online Experimentation.
https://doordash.engineering/2022/11/15/balancing-velocity-and-confidence-in-experimentation/
Netflix: For your eyes only: improving Netflix video quality with neural networks
I'm delighted to see more Netflix engineering blogs coming out in recent days talking about the impact of AI/ ML in media production. The blog narrates one such application that uses video quality with neural networks. I don't know if Netflix will be a thing in the next ten years, but the impact it will make on media production integrated with the advancement of AI/ ML will be significant.
Sponsored: [New eBook] The Ultimate Data Observability Platform Evaluation Guide
Are you considering investing in a data quality solution? Before you add another tool to your data stack, check out our latest guide for 10 things to consider when evaluating data observability platforms, including scalability, time to value, and ease of setup.
Access You Free Copy for Data Engineering Weekly Readers
Myntra: Janus - Data processing framework at Myntra
Myntra writes about its data processing framework - Janus. The blog narrates the requirements and motivation to design Janus and critical pieces of the architecture, such as data catalogs, pipeline modeling, pipeline deployment, and pipeline execution. I'm a little more curious to understand the design in detail to see the data catalog as an integral part of the pipeline design.
https://medium.com/myntra-engineering/janus-data-processing-framework-at-myntra-980ba8cb15a5
99.co: BigQuery’s schema auto-detection does not work perfectly as we want it to, so we build our own
Structured data from source systems can significantly reduce data management's complexity; however, it is not uncommon to encounter source systems without such capabilities. The author narrates why Google BigQuery's schema auto-detection system fails them, which leads to building a custom schema detection tool.
GumGum: Replacing Apache Druid with Snowflake Snowpipe
Many real-time pipelines are micro-batch pipelines. I often tell my team real-time & batch systems are data processing pipelines with different window functions. I found the blog exciting as the first blog I've seen using Snowflake for a near-real-time pipeline that replaces OLAP systems like Apache Druid.
https://medium.com/gumgum-tech/replacing-apache-druid-with-snowflake-snowpipe-74c8d7c9b9c3
All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.