Data Engineering Weekly #76

Weekly Data Engineering Newsletter

Feb 28, 2022

Data Council - Austin 2022

Data Council published the Austin 2022 schedule here. The data engineering weekly readers can get a 20% discount using promo code DataWeekly20

https://www.datacouncil.ai/austin

Let’s start this week’s edition with this excellent thread on the challenges of collecting SaaS metrics and how to approach solving them.

Gwen (Chen) Shapira@gwenshap

SaaS Data - It Shouldn't be This Hard! @sriramsubram explains the challenges in collecting and using usage metrics from a SaaS product... and then he explains his idea for a data platform that will make it so easy.

youtu.be

SaaS Data - It Shouldn’t be This Hard!

9:04 PM · Feb 25, 2022

5 Reposts · 21 Likes

Bence Arató: Fundraising by data companies in 2021

2021 is an exciting year for data startups. The blog is a compilation of fund rounds raised by data startups.

https://adat.blog/2022/02/fundraising-by-data-companies-in-2021/

John Cutler: The Data-Informed Product Cycle

How does the data-informed product loop look? The author narrates its lifecycle.

Have a strategy
Translate that into models
Add minimally viable measurement.
Identify leverage points
Explore options
Run experiments

https://cutlefish.substack.com/p/tbm-852-the-data-informed-product

Vortexa: Choosing an Analytics Tool. Metabase Vs. Superset Vs. Redash

Vortexa writes about its selection process of choosing the analytical tool and why they decided on Metabase.

https://medium.com/vortechsa/choosing-an-analytics-tool-metabase-vs-superset-vs-redash-afd88e028ba9

Sarah Krasnik: Choosing a Data Quality Tool

The blog is an excellent overview of the data quality tool's landscape. It is a good reference article if you're in the process of choosing a data quality tool.

https://sarahsnewsletter.substack.com/p/choosing-a-data-quality-tool

Vimeo: Monitoring data quality at scale using Monte Carlo

Staying on the data quality story, Vimeo writes about monitoring data quality with Monte Carlo. The CI/CD flow for the data quality is check is an exciting read; I'm curious to read more about the feedback loop for error correction.

https://medium.com/vimeo-engineering-blog/monitoring-data-quality-at-scale-using-monte-carlo-934577e45ab0

Adam Marcus: Data diffs- Algorithms for explaining what changed in a dataset

What changed and how much changed is the first question we ask when looking at the data. The blog narrates the explanatory algorithms for finding the difference between two datasets. The author implemented an open-source version of the paper Diff and explained how that could help solve the problem.

Diff Paper: http://www.bailis.org/papers/diff-vldb2019.pdf

https://blog.marcua.net/2022/02/20/data-diffs-algorithms-for-explaining-what-changed-in-a-dataset.html

Netflix: Robust Foundation for Data Pipeline at Scale - Lessons learned from Netflix

The QCon talks about the Netflix data pipeline now available. The workflow support for the event-driven and the scheduled trigger is an exciting approach for an orchestration engine.

Stephen Bailey: Kicking the tires on dbt Metrics

The metrics layer is an exciting development, and curious to see how it progresses. The author shared the experience of trying the dbt metrics layer with a concern; there is a lot of YAML!!

https://stkbailey.substack.com/p/kicking-the-tires-on-dbt-metrics

Dream 11: Data Feast — A Highly Scalable Feature Store at Dream11

Dream 11 writes about its feature store Data Feast. The choice of HBase as a feature store is interesting. TIL about RonDB, and looking forward to reading more on it.

https://blog.dream11engineering.com/data-feast-a-highly-scalable-feature-store-at-dream11-69b8ed5289fd

PyTorch: Introducing TorchRec, a library for modern production recommendation systems

PyTorch announced TorchRec, a PyTorch domain library for Recommendation Systems. TorchRec library provides common sparsity and parallelism primitives, enabling researchers to build state-of-the-art personalization models and deploy them in production.

Github: https://github.com/pytorch/torchrec

https://pytorch.org/blog/introducing-torchrec/

Outerbounds: Notebooks In Production With Metaflow

Metaflow, an orchestration engine for ML pipeline, popularized deploying notebooks in production. Outbounds introduce Notebook Cards, which allow data scientists to use notebooks to visualize and debug production workflows and help bridge the MLOps divide between prototype and production.

https://outerbounds.com/blog/notebooks-in-production-with-metaflow/

Expedia: Practical Schema Evolution with Avro

The article is an excellent compilation of Avro schema evolution with practical advice. It is a comprehensive guide to educate users on Avro schema evolution to simplify managing schema changes.

https://medium.com/expedia-group-tech/practical-schema-evolution-with-avro-c07af8ba1725

Data Engineering Weekly

Discussion about this post

Ready for more?