Data Engineering Weekly

Share this post
Data Engineering Weekly #76
www.dataengineeringweekly.com

Data Engineering Weekly #76

Weekly Data Engineering Newsletter

Ananth Packkildurai
Feb 28, 2022
Share this post
Data Engineering Weekly #76
www.dataengineeringweekly.com

Data Council - Austin 2022

Data Council published the Austin 2022 schedule here. The data engineering weekly readers can get a 20% discount using promo code DataWeekly20

https://www.datacouncil.ai/austin


Let’s start this week’s edition with this excellent thread on the challenges of collecting SaaS metrics and how to approach solving them.

Twitter avatar for @gwenshap
Gwen (Chen) Shapira @gwenshap
SaaS Data - It Shouldn't be This Hard! @sriramsubram explains the challenges in collecting and using usage metrics from a SaaS product... and then he explains his idea for a data platform that will make it so easy.
youtu.beSaaS Data - It Shouldn’t be This Hard!Measuring user adoption on Confluent Cloud - how hard can it possibly be?As Ram Subramanian and Confluent Cloud engineering discovered - there are challenges...
9:04 PM ∙ Feb 25, 2022
21Likes5Retweets

Bence Arató: Fundraising by data companies in 2021

2021 is an exciting year for data startups. The blog is a compilation of fund rounds raised by data startups.

https://adat.blog/2022/02/fundraising-by-data-companies-in-2021/


John Cutler: The Data-Informed Product Cycle

How does the data-informed product loop look? The author narrates its lifecycle.

  1. Have a strategy

  2. Translate that into models

  3. Add minimally viable measurement.

  4. Identify leverage points

  5. Explore options

  6. Run experiments

https://cutlefish.substack.com/p/tbm-852-the-data-informed-product


Vortexa: Choosing an Analytics Tool. Metabase Vs. Superset Vs. Redash

Vortexa writes about its selection process of choosing the analytical tool and why they decided on Metabase.

https://medium.com/vortechsa/choosing-an-analytics-tool-metabase-vs-superset-vs-redash-afd88e028ba9


Sarah Krasnik: Choosing a Data Quality Tool

The blog is an excellent overview of the data quality tool's landscape. It is a good reference article if you're in the process of choosing a data quality tool.

https://sarahsnewsletter.substack.com/p/choosing-a-data-quality-tool


Vimeo: Monitoring data quality at scale using Monte Carlo

Staying on the data quality story, Vimeo writes about monitoring data quality with Monte Carlo. The CI/CD flow for the data quality is check is an exciting read; I'm curious to read more about the feedback loop for error correction.

https://medium.com/vimeo-engineering-blog/monitoring-data-quality-at-scale-using-monte-carlo-934577e45ab0


Adam Marcus: Data diffs- Algorithms for explaining what changed in a dataset

What changed and how much changed is the first question we ask when looking at the data. The blog narrates the explanatory algorithms for finding the difference between two datasets. The author implemented an open-source version of the paper Diff and explained how that could help solve the problem.

Diff Paper: http://www.bailis.org/papers/diff-vldb2019.pdf

https://blog.marcua.net/2022/02/20/data-diffs-algorithms-for-explaining-what-changed-in-a-dataset.html


Sponsored: RudderStack - Data Modeling In the Warehouse For Data Engineers

Many companies still struggle to answer even basic questions with their data. These data modeling best practices from RudderStack will help you build a well-defined core data layer, enabling teams to answer harder questions while ensuring a better experience for every end-user.

https://www.rudderstack.com/blog/data-modeling-in-the-warehouse-for-data-engineers


Netflix: Robust Foundation for Data Pipeline at Scale - Lessons learned from Netflix

The QCon talks about the Netflix data pipeline now available. The workflow support for the event-driven and the scheduled trigger is an exciting approach for an orchestration engine.


Stephen Bailey: Kicking the tires on dbt Metrics

The metrics layer is an exciting development, and curious to see how it progresses. The author shared the experience of trying the dbt metrics layer with a concern; there is a lot of YAML!!

https://stkbailey.substack.com/p/kicking-the-tires-on-dbt-metrics


Dream 11: Data Feast — A Highly Scalable Feature Store at Dream11

Dream 11 writes about its feature store Data Feast. The choice of HBase as a feature store is interesting. TIL about RonDB, and looking forward to reading more on it.

https://blog.dream11engineering.com/data-feast-a-highly-scalable-feature-store-at-dream11-69b8ed5289fd


PyTorch: Introducing TorchRec, a library for modern production recommendation systems

PyTorch announced TorchRec, a PyTorch domain library for Recommendation Systems. TorchRec library provides common sparsity and parallelism primitives, enabling researchers to build state-of-the-art personalization models and deploy them in production.

Github: https://github.com/pytorch/torchrec

https://pytorch.org/blog/introducing-torchrec/


Outerbounds: Notebooks In Production With Metaflow

Metaflow, an orchestration engine for ML pipeline, popularized deploying notebooks in production. Outbounds introduce Notebook Cards, which allow data scientists to use notebooks to visualize and debug production workflows and help bridge the MLOps divide between prototype and production.

https://outerbounds.com/blog/notebooks-in-production-with-metaflow/


Expedia: Practical Schema Evolution with Avro

The article is an excellent compilation of Avro schema evolution with practical advice. It is a comprehensive guide to educate users on Avro schema evolution to simplify managing schema changes.

https://medium.com/expedia-group-tech/practical-schema-evolution-with-avro-c07af8ba1725


Sponsored: Rudderstack - The Data Stack Show Live: Is Reverse ETL Just Another Data Pipeline?

You’ve heard about Reverse ETL. Here’s your chance to learn all about the tooling from the folks who are creating it. Join Hosts Eric and Kostas for a live recording of The Data Stack Show on March 9th to get insights from experts at Census, Hightouch, and Workato.

https://datastackshow.com/livestream-registration-reverse-etl/


Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Share this post
Data Engineering Weekly #76
www.dataengineeringweekly.com
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing