Data Engineering Weekly

Share this post
Data Engineering Weekly #80
www.dataengineeringweekly.com

Data Engineering Weekly #80

Weekly Data Engineering Newsletter

Ananth Packkildurai
Mar 28, 2022
5
Share this post
Data Engineering Weekly #80
www.dataengineeringweekly.com

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.


Techcrunch: Astronomer ready for its next mission after Datakin acquisition, $213M Series C

It's been an eventful week at the Data Council conference in Austin. I plan to reflect on my experience later this week, so stay tuned. One of the big news on the conference day is that Astronomer acquired DataKin, the company behind the open-source Marquez project. I predicted the consolidation on data lineage and orchestration engines, but I never thought this would happen this fast.

Twitter avatar for @ananthdurai
Ananth Packkildurai @ananthdurai
@sarahmk125 I guess MDP is still maturing, and many vendors are still in the early stages. So customers don't have a choice as of now. Hopefully, as the market grows, there will be consolidation and M&A, for example, the data orchestration & lineage tools will merge into one system
9:35 PM ∙ Jan 6, 2022
8Likes1Retweet

https://techcrunch.com/2022/03/23/astronomer-ready-for-its-next-mission-after-datakin-acquisition-213m-series-c/


Jon Loyens: How Should We Be Thinking about Data Lineage?

Why is data lineage so crucial in data management? The author gives an overview of what a comprehensive data lineage can bring into data management.

https://towardsdatascience.com/how-should-we-be-thinking-about-data-lineage-541ca5ab83d0


Ron Berman & Ayelet Israeli: The Value of Descriptive Analytics - Evidence from Online Retailers

Companies invest a lot in analytics - but are these investments valuable? The study found that using a descriptive dashboard increased their weekly revenues by 4%-10%.

Twitter avatar for @marketsensei
Ron Berman @marketsensei
Companies invest a lot in analytics - but are these investments valuable? @IsraeliAyelet and I studied ~1,500 online retailers and found that using a descriptive dashboard increased their weekly revenues by 4%-10%. >> #MarTech #BigData #Analytics #ecommerce #DataScience
SynthDiD estimate of ATT of adopting analytics dashboard by ecommerce retailers
1:49 PM ∙ Mar 26, 2022
551Likes108Retweets

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3745748


Lorien Pratt: A Framework for How Data Informs Decisions

Staying on data and business decisions; Data storage and computation become less expensive and commoditized, but how does the data bridge a business decision? Are we asking the right business question? TIL about Decision Intelligence framework for data-informed decisions.

https://www.lorienpratt.com/a-framework-for-how-data-informs-decisions/

I'm looking forward to learning more about this in the coming weeks. I found this short video a thought-provoking one.


Shopify: A Data Scientist’s Guide To Measuring Product Success

How do we measure the success of a data product? The engineering approach measures the data freshness, pipeline speed, and the model's accuracy. Shopify wries an informative blog that narrates why measuring the success from the business goal and customer perspective is vital for the success of a data product.

https://shopifyengineering.myshopify.com/blogs/engineering/a-data-scientist-s-guide-to-measuring-product-success


Sponsored: Firebolt - The Big Data Game

Play The Big Data Game – Because even a simple query can send you on an unexpected journey...

https://www.firebolt.io/big-data-game


Future: Data50 - The World’s Top Data Startups

a16z's Future released Data50, the top data startup with a funding and location analysis. I'm thrilled to see all the Data Engineering Weekly sponsors Rudderstack, Monte Carlo, & Firebolt featured in the Data50 startup list.

https://future.a16z.com/data50/


Sponsored: Rudderstack - Announcing RudderStack Reverse ETL

Rudderstack's Warehouse Actions is now RudderStack Reverse ETL. The rebranded product launched with new features to make your data engineering workflows easier, including enhanced processing and scheduling, Visual Data Mapper, and Custom SQL Models. The product complements RudderStack's Event Stream offering, sharing all 150+ integrations and, in most cases, its Transformation and Data Governance capabilities.

https://www.rudderstack.com/blog/announcing-rudderstack-reverse-etl


Pardis Noorzad: Challenges in data sharing and transfer

Buy vs. Build is always an ongoing architectural decision in an organization. I've seen folks underestimate the "cost to integrate" off-the-shelf solutions. The author captured the challenges in validating and integrating MLOps and DataOps products. I have written about the emerging patterns of data sharing in data engineering weekly [Omicron Paradigm: Architectural patterns for the Infinite Data Logistic]. It is an exciting data engineering challenge to solve.

https://djpardis.medium.com/data-sharing-and-transfer-challenges-2e87e18a1167


Zan Armstrong: Stop aggregating away the signal in your data

Aggregation is the standard best practice for analyzing time series data, but it can create problems by stripping away crucial context. The author narrates the consequence of uninformed data aggregation.

But every time you aggregate, you make a decision about which features of your data matter and which ones you are willing to drop. Informed aggregation simplifies and prioritizes. Uninformed aggregation means you’ll never know what insights you lost.

https://stackoverflow.blog/2022/03/03/stop-aggregating-away-the-signal-in-your-data/


Spotify: Comparing quantiles at scale in online A/B-testing

Spotify writes about how it uses properties of the Poisson bootstrap algorithm and quantile estimators to reduce the computation complexity for efficient bootstrap confidence intervals.

https://engineering.atspotify.com/2022/03/comparing-quantiles-at-scale-in-online-a-b-testing/


Lyft: Orchestrating Data Pipelines at Lyft - comparing Flyte and Airflow

Last week we saw Spotify moving away from Luigi to Flyte. Lyft writes about its incubation of Flyte and the difference between Airflow. However, I can't stop wondering why a new system instead of adding the features in Airflow! Nonetheless, it is excellent to see event-driven dependency management rather than the polling approach in Airflow.

https://eng.lyft.com/orchestrating-data-pipelines-at-lyft-comparing-flyte-and-airflow-72c40d143aad


Miro: Miro Data Engineering team’s journey to monitoring

I've not seen many engineering blogs talking about the developer workflow after an alert or incident in the data pipeline. DataOps is my favorite part of data engineering, and glad to see Miro's developer workflow of DataOps.

https://medium.com/miro-engineering/our-journey-to-data-engineering-monitoring-c14d6ff20351


Confluent: Why ZooKeeper Was Replaced with KRaft – The Log of All Logs

KIP-500 probably widely read Kafka RFC, and Confluent writes an excellent summary of replacing ZooKeeper with KRaft.

https://www.confluent.io/blog/why-replace-zookeeper-with-kafka-raft-the-log-of-all-logs/


All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Share this post
Data Engineering Weekly #80
www.dataengineeringweekly.com
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing