Data Engineering Weekly #80

Weekly Data Engineering Newsletter

Mar 28, 2022

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

Techcrunch: Astronomer ready for its next mission after Datakin acquisition, $213M Series C

It's been an eventful week at the Data Council conference in Austin. I plan to reflect on my experience later this week, so stay tuned. One of the big news on the conference day is that Astronomer acquired DataKin, the company behind the open-source Marquez project. I predicted the consolidation on data lineage and orchestration engines, but I never thought this would happen this fast.

Ananth Packkildurai@ananthdurai

@sarahmk125 I guess MDP is still maturing, and many vendors are still in the early stages. So customers don't have a choice as of now. Hopefully, as the market grows, there will be consolidation and M&A, for example, the data orchestration & lineage tools will merge into one system

9:35 PM · Jan 6, 2022

1 Repost · 8 Likes

https://techcrunch.com/2022/03/23/astronomer-ready-for-its-next-mission-after-datakin-acquisition-213m-series-c/

Jon Loyens: How Should We Be Thinking about Data Lineage?

Why is data lineage so crucial in data management? The author gives an overview of what a comprehensive data lineage can bring into data management.

https://towardsdatascience.com/how-should-we-be-thinking-about-data-lineage-541ca5ab83d0

Ron Berman & Ayelet Israeli: The Value of Descriptive Analytics - Evidence from Online Retailers

Companies invest a lot in analytics - but are these investments valuable? The study found that using a descriptive dashboard increased their weekly revenues by 4%-10%.

Ron Berman@marketsensei

Companies invest a lot in analytics - but are these investments valuable? @IsraeliAyelet and I studied ~1,500 online retailers and found that using a descriptive dashboard increased their weekly revenues by 4%-10%. >> #MarTech #BigData #Analytics #ecommerce #DataScience

SynthDiD estimate of ATT of adopting analytics dashboard by ecommerce retailers

1:49 PM · Mar 26, 2022

108 Reposts · 551 Likes

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3745748

Lorien Pratt: A Framework for How Data Informs Decisions

Staying on data and business decisions; Data storage and computation become less expensive and commoditized, but how does the data bridge a business decision? Are we asking the right business question? TIL about Decision Intelligence framework for data-informed decisions.

https://www.lorienpratt.com/a-framework-for-how-data-informs-decisions/

I'm looking forward to learning more about this in the coming weeks. I found this short video a thought-provoking one.

Shopify: A Data Scientist’s Guide To Measuring Product Success

How do we measure the success of a data product? The engineering approach measures the data freshness, pipeline speed, and the model's accuracy. Shopify wries an informative blog that narrates why measuring the success from the business goal and customer perspective is vital for the success of a data product.

https://shopifyengineering.myshopify.com/blogs/engineering/a-data-scientist-s-guide-to-measuring-product-success

Future: Data50 - The World’s Top Data Startups

a16z's Future released Data50, the top data startup with a funding and location analysis. I'm thrilled to see all the Data Engineering Weekly sponsors Rudderstack, Monte Carlo, & Firebolt featured in the Data50 startup list.

https://future.a16z.com/data50/

Pardis Noorzad: Challenges in data sharing and transfer

Buy vs. Build is always an ongoing architectural decision in an organization. I've seen folks underestimate the "cost to integrate" off-the-shelf solutions. The author captured the challenges in validating and integrating MLOps and DataOps products. I have written about the emerging patterns of data sharing in data engineering weekly [Omicron Paradigm: Architectural patterns for the Infinite Data Logistic]. It is an exciting data engineering challenge to solve.

https://djpardis.medium.com/data-sharing-and-transfer-challenges-2e87e18a1167

Zan Armstrong: Stop aggregating away the signal in your data

Aggregation is the standard best practice for analyzing time series data, but it can create problems by stripping away crucial context. The author narrates the consequence of uninformed data aggregation.

But every time you aggregate, you make a decision about which features of your data matter and which ones you are willing to drop. Informed aggregation simplifies and prioritizes. Uninformed aggregation means you’ll never know what insights you lost.

https://stackoverflow.blog/2022/03/03/stop-aggregating-away-the-signal-in-your-data/

Spotify: Comparing quantiles at scale in online A/B-testing

Spotify writes about how it uses properties of the Poisson bootstrap algorithm and quantile estimators to reduce the computation complexity for efficient bootstrap confidence intervals.

https://engineering.atspotify.com/2022/03/comparing-quantiles-at-scale-in-online-a-b-testing/

Lyft: Orchestrating Data Pipelines at Lyft - comparing Flyte and Airflow

Last week we saw Spotify moving away from Luigi to Flyte. Lyft writes about its incubation of Flyte and the difference between Airflow. However, I can't stop wondering why a new system instead of adding the features in Airflow! Nonetheless, it is excellent to see event-driven dependency management rather than the polling approach in Airflow.

https://eng.lyft.com/orchestrating-data-pipelines-at-lyft-comparing-flyte-and-airflow-72c40d143aad

Miro: Miro Data Engineering team’s journey to monitoring

I've not seen many engineering blogs talking about the developer workflow after an alert or incident in the data pipeline. DataOps is my favorite part of data engineering, and glad to see Miro's developer workflow of DataOps.

https://medium.com/miro-engineering/our-journey-to-data-engineering-monitoring-c14d6ff20351

Confluent: Why ZooKeeper Was Replaced with KRaft – The Log of All Logs

KIP-500 probably widely read Kafka RFC, and Confluent writes an excellent summary of replacing ZooKeeper with KRaft.

https://www.confluent.io/blog/why-replace-zookeeper-with-kafka-raft-the-log-of-all-logs/

All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly

Discussion about this post

Ready for more?