Data Council - Austin 2022
Data Council published the Austin 2022 schedule here
. The data engineering weekly readers can get a 20% discount using promo code DataWeekly20
https://www.datacouncil.ai/austin
Let’s start this week’s edition with this excellent thread on the challenges of collecting SaaS metrics and how to approach solving them.
Bence Arató: Fundraising by data companies in 2021
2021 is an exciting year for data startups. The blog is a compilation of fund rounds raised by data startups.
https://adat.blog/2022/02/fundraising-by-data-companies-in-2021/
John Cutler: The Data-Informed Product Cycle
How does the data-informed product loop look? The author narrates its lifecycle.
Have a strategy
Translate that into models
Add minimally viable measurement.
Identify leverage points
Explore options
Run experiments
https://cutlefish.substack.com/p/tbm-852-the-data-informed-product
Vortexa: Choosing an Analytics Tool. Metabase Vs. Superset Vs. Redash
Vortexa writes about its selection process of choosing the analytical tool and why they decided on Metabase.
https://medium.com/vortechsa/choosing-an-analytics-tool-metabase-vs-superset-vs-redash-afd88e028ba9
Sarah Krasnik: Choosing a Data Quality Tool
The blog is an excellent overview of the data quality tool's landscape. It is a good reference article if you're in the process of choosing a data quality tool.
https://sarahsnewsletter.substack.com/p/choosing-a-data-quality-tool
Vimeo: Monitoring data quality at scale using Monte Carlo
Staying on the data quality story, Vimeo writes about monitoring data quality with Monte Carlo. The CI/CD flow for the data quality is check is an exciting read; I'm curious to read more about the feedback loop for error correction.
Adam Marcus: Data diffs- Algorithms for explaining what changed in a dataset
What changed and how much changed is the first question we ask when looking at the data. The blog narrates the explanatory algorithms for finding the difference between two datasets. The author implemented an open-source version of the paper Diff and explained how that could help solve the problem.
Diff Paper: http://www.bailis.org/papers/diff-vldb2019.pdf
Sponsored: RudderStack - Data Modeling In the Warehouse For Data Engineers
Many companies still struggle to answer even basic questions with their data. These data modeling best practices from RudderStack will help you build a well-defined core data layer, enabling teams to answer harder questions while ensuring a better experience for every end-user.
https://www.rudderstack.com/blog/data-modeling-in-the-warehouse-for-data-engineers
Netflix: Robust Foundation for Data Pipeline at Scale - Lessons learned from Netflix
The QCon talks about the Netflix data pipeline now available. The workflow support for the event-driven and the scheduled trigger is an exciting approach for an orchestration engine.
Stephen Bailey: Kicking the tires on dbt Metrics
The metrics layer is an exciting development, and curious to see how it progresses. The author shared the experience of trying the dbt metrics layer
with a concern; there is a lot of YAML!!
https://stkbailey.substack.com/p/kicking-the-tires-on-dbt-metrics
Dream 11: Data Feast — A Highly Scalable Feature Store at Dream11
Dream 11 writes about its feature store Data Feast. The choice of HBase as a feature store is interesting. TIL about RonDB
,
and looking forward to reading more on it.
PyTorch: Introducing TorchRec, a library for modern production recommendation systems
PyTorch announced TorchRec, a PyTorch domain library for Recommendation Systems. TorchRec library provides common sparsity and parallelism primitives, enabling researchers to build state-of-the-art personalization models and deploy them in production.
Github: https://github.com/pytorch/torchrec
https://pytorch.org/blog/introducing-torchrec/
Outerbounds: Notebooks In Production With Metaflow
Metaflow, an orchestration engine for ML pipeline, popularized deploying notebooks in production. Outbounds introduce Notebook Cards, which allow data scientists to use notebooks to visualize and debug production workflows and help bridge the MLOps divide between prototype and production.
https://outerbounds.com/blog/notebooks-in-production-with-metaflow/
Expedia: Practical Schema Evolution with Avro
The article is an excellent compilation of Avro schema evolution with practical advice. It is a comprehensive guide to educate users on Avro schema evolution to simplify managing schema changes.
https://medium.com/expedia-group-tech/practical-schema-evolution-with-avro-c07af8ba1725
Sponsored: Rudderstack - The Data Stack Show Live: Is Reverse ETL Just Another Data Pipeline?
You’ve heard about Reverse ETL. Here’s your chance to learn all about the tooling from the folks who are creating it. Join Hosts Eric and Kostas for a live recording of The Data Stack Show on March 9th to get insights from experts at Census, Hightouch, and Workato.
https://datastackshow.com/livestream-registration-reverse-etl/
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.