Data Engineering Weekly

Share this post

Data Engineering Weekly #108

www.dataengineeringweekly.com

Data Engineering Weekly #108

The Weekly Data Engineering Newsletter

Ananth Packkildurai
Nov 21, 2022
5
Share this post

Data Engineering Weekly #108

www.dataengineeringweekly.com
Share

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.


Google AI: The Data Cards Playbook: A Toolkit for Transparency in Dataset Documentation

Google published Data Cards, a dataset documentation framework aimed at increasing transparency across dataset lifecycles. Data Cards include the following:

  1. upstream sources

  2. Data collection and annotation methods

  3. Training and evaluation methods

  4. Intended use of the dataset

  5. Decisions affecting model performance

The data cards approach is fascinating, especially As machine learning (ML) research moves toward large-scale models capable of numerous downstream tasks. A shared understanding of a dataset’s origin, development, intent, and evolution becomes increasingly essential for responsible and informed development.

https://ai.googleblog.com/2022/11/the-data-cards-playbook-toolkit-for.html

The short YouTube video gives a nice overview of the Data Cards.


Daniel Buschek: What makes user interfaces intelligent?

I found this article a bit late, but an exciting read for this week. We often think of AI/ ML as a complex data processing problem, but it doesn’t make any use until it is exposed to an end user or an application. So what makes a user interface intelligent? The author walks through what is intelligent in the UI and what it does for users.

https://uxdesign.cc/what-makes-user-interfaces-intelligent-9f63b27ca39


Luke Lin: Types of data products

Data as a product is a trending phrase and has begun mainstream adoption integrating with the organization's product strategy. But what are the types of product strategies? The author classifies the data products as

  1. Data Platform as a product

  2. Data Insight as a product

  3. Data Activation as a product

https://pmdata.substack.com/p/types-of-data-products


Mikkel Dengsøe: The important purple people outside the data team

The Data team's prime mission is to educate and empower data-driven product & business operations across an org. The most critical work of data happens outside the data team. The author narrates a few practical tips for creating success with people outside the data team.

https://mikkeldengsoe.substack.com/p/purple-people-outside-data


Sponsored: Build SQL Pipelines. Not Endless DAGs!

With Upsolver SQLake, you build a pipeline for data in motion simply by writing a SQL query defining your transformation.

  • Streaming and batch unified in a single platform

  • No Airflow - orchestration inferred from the data

  • $99 / TB of data ingested | transformations free

Start Your 30 Day Trial


DataScience @ Microsoft: Industrial Metaverse: A software and data perspective

Gartner Predicts 25% of People Will Spend At Least One Hour Per Day in the Metaverse by 2026. I’ve no idea how it will play out yet, but as a data engineer, what the Metaverse means to us? The author narrates what an industrial metaverse is, its key components, and some practical applications of an industrial metaverse.

https://medium.com/data-science-at-microsoft/industrial-metaverse-a-software-and-data-perspective-d09950a453f6


Atlassian: Learn How to Prepare For New European Data Privacy Requirements

Data privacy and regulatory requirements mostly won’t feature in a developer blog, and I found pleasantly surprised with this article from Atlassian. Kudos to the author and the Atlassian team. The blog narrates the European Commission’s updated version of the European Standard Contractual Clauses (EU SCCS) and how to prepare to handle the privacy laws.

https://blog.developer.atlassian.com/learn-how-to-prepare-for-new-european-data-privacy-requirements/


Sponsored: It’s Time for the Headless CDP

In this piece RudderStack CEO, Soumyadeb Mitra, makes the case for a new approach to the customer data platform—the headless CDP. He defines the headless CDP as a tool with open architecture, purpose built for data and engineering teams, that makes it easy to collect customer data from every source, build your customer 360 in your own warehouse, then make that data available to your entire stack.

https://www.rudderstack.com/blog/it-s-time-for-the-headless-cdp/


DoorDash: Balancing Velocity and Confidence in Experimentation

In a data pipeline, there is always a conflict between correctness (Trust) and speed (Velocity). The trade-off plays a critical role in a critical system like Experimentation. The author narrates balancing the velocity and confidence in online Experimentation.

https://doordash.engineering/2022/11/15/balancing-velocity-and-confidence-in-experimentation/


Netflix: For your eyes only: improving Netflix video quality with neural networks

I'm delighted to see more Netflix engineering blogs coming out in recent days talking about the impact of AI/ ML in media production. The blog narrates one such application that uses video quality with neural networks. I don't know if Netflix will be a thing in the next ten years, but the impact it will make on media production integrated with the advancement of AI/ ML will be significant.

https://netflixtechblog.com/for-your-eyes-only-improving-netflix-video-quality-with-neural-networks-5b8d032da09c


Sponsored: [New eBook] The Ultimate Data Observability Platform Evaluation Guide

Are you considering investing in a data quality solution? Before you add another tool to your data stack, check out our latest guide for 10 things to consider when evaluating data observability platforms, including scalability, time to value, and ease of setup.

Access You Free Copy for Data Engineering Weekly Readers


Myntra: Janus - Data processing framework at Myntra

Myntra writes about its data processing framework - Janus. The blog narrates the requirements and motivation to design Janus and critical pieces of the architecture, such as data catalogs, pipeline modeling, pipeline deployment, and pipeline execution. I'm a little more curious to understand the design in detail to see the data catalog as an integral part of the pipeline design.

https://medium.com/myntra-engineering/janus-data-processing-framework-at-myntra-980ba8cb15a5


99.co: BigQuery’s schema auto-detection does not work perfectly as we want it to, so we build our own

Structured data from source systems can significantly reduce data management's complexity; however, it is not uncommon to encounter source systems without such capabilities. The author narrates why Google BigQuery's schema auto-detection system fails them, which leads to building a custom schema detection tool.

https://medium.com/99dotco/bigquerys-schema-auto-detection-does-not-work-perfectly-like-we-want-it-to-so-we-build-our-own-93a5f1a1f67


GumGum: Replacing Apache Druid with Snowflake Snowpipe

Many real-time pipelines are micro-batch pipelines. I often tell my team real-time & batch systems are data processing pipelines with different window functions. I found the blog exciting as the first blog I've seen using Snowflake for a near-real-time pipeline that replaces OLAP systems like Apache Druid.

https://medium.com/gumgum-tech/replacing-apache-druid-with-snowflake-snowpipe-74c8d7c9b9c3


All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

5
Share this post

Data Engineering Weekly #108

www.dataengineeringweekly.com
Share
Comments
Top
New
Community

No posts

Ready for more?

© 2023 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing