Data Engineering Weekly Is Brought to You by RudderStack
RudderStack Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
Techcrunch: Astronomer ready for its next mission after Datakin acquisition, $213M Series C
It's been an eventful week at the Data Council conference in Austin. I plan to reflect on my experience later this week, so stay tuned. One of the big news on the conference day is that Astronomer acquired DataKin, the company behind the open-source Marquez project. I predicted the consolidation on data lineage and orchestration engines, but I never thought this would happen this fast.
Jon Loyens: How Should We Be Thinking about Data Lineage?
Why is data lineage so crucial in data management? The author gives an overview of what a comprehensive data lineage can bring into data management.
https://towardsdatascience.com/how-should-we-be-thinking-about-data-lineage-541ca5ab83d0
Ron Berman & Ayelet Israeli: The Value of Descriptive Analytics - Evidence from Online Retailers
Companies invest a lot in analytics - but are these investments valuable? The study found that using a descriptive dashboard increased their weekly revenues by 4%-10%.
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3745748
Lorien Pratt: A Framework for How Data Informs Decisions
Staying on data and business decisions; Data storage and computation become less expensive and commoditized, but how does the data bridge a business decision? Are we asking the right business question? TIL about Decision Intelligence framework for data-informed decisions.
https://www.lorienpratt.com/a-framework-for-how-data-informs-decisions/
I'm looking forward to learning more about this in the coming weeks. I found this short video a thought-provoking one.
Shopify: A Data Scientist’s Guide To Measuring Product Success
How do we measure the success of a data product? The engineering approach measures the data freshness, pipeline speed, and the model's accuracy. Shopify wries an informative blog that narrates why measuring the success from the business goal and customer perspective is vital for the success of a data product.
Sponsored: Firebolt - The Big Data Game
Play The Big Data Game – Because even a simple query can send you on an unexpected journey...
https://www.firebolt.io/big-data-game
Future: Data50 - The World’s Top Data Startups
a16z's Future released Data50, the top data startup with a funding and location analysis. I'm thrilled to see all the Data Engineering Weekly sponsors Rudderstack
, Monte Carlo
, & Firebolt
featured in the Data50 startup list.
https://future.a16z.com/data50/
Sponsored: Rudderstack - Announcing RudderStack Reverse ETL
Rudderstack's Warehouse Actions is now RudderStack Reverse ETL. The rebranded product launched with new features to make your data engineering workflows easier, including enhanced processing and scheduling, Visual Data Mapper, and Custom SQL Models. The product complements RudderStack's Event Stream offering, sharing all 150+ integrations and, in most cases, its Transformation and Data Governance capabilities.
https://www.rudderstack.com/blog/announcing-rudderstack-reverse-etl
Pardis Noorzad: Challenges in data sharing and transfer
Buy vs. Build is always an ongoing architectural decision in an organization. I've seen folks underestimate the "cost to integrate" off-the-shelf solutions. The author captured the challenges in validating and integrating MLOps and DataOps products. I have written about the emerging patterns of data sharing in data engineering weekly [Omicron Paradigm: Architectural patterns for the Infinite Data Logistic
]. It is an exciting data engineering challenge to solve.
https://djpardis.medium.com/data-sharing-and-transfer-challenges-2e87e18a1167
Zan Armstrong: Stop aggregating away the signal in your data
Aggregation is the standard best practice for analyzing time series data, but it can create problems by stripping away crucial context. The author narrates the consequence of uninformed data aggregation.
But every time you aggregate, you make a decision about which features of your data matter and which ones you are willing to drop. Informed aggregation simplifies and prioritizes. Uninformed aggregation means you’ll never know what insights you lost.
https://stackoverflow.blog/2022/03/03/stop-aggregating-away-the-signal-in-your-data/
Spotify: Comparing quantiles at scale in online A/B-testing
Spotify writes about how it uses properties of the Poisson bootstrap algorithm and quantile estimators to reduce the computation complexity for efficient bootstrap confidence intervals.
https://engineering.atspotify.com/2022/03/comparing-quantiles-at-scale-in-online-a-b-testing/
Lyft: Orchestrating Data Pipelines at Lyft - comparing Flyte and Airflow
Last week we saw Spotify moving away from Luigi to Flyte. Lyft writes about its incubation of Flyte and the difference between Airflow. However, I can't stop wondering why a new system instead of adding the features in Airflow! Nonetheless, it is excellent to see event-driven dependency management rather than the polling approach in Airflow.
https://eng.lyft.com/orchestrating-data-pipelines-at-lyft-comparing-flyte-and-airflow-72c40d143aad
Miro: Miro Data Engineering team’s journey to monitoring
I've not seen many engineering blogs talking about the developer workflow after an alert or incident in the data pipeline. DataOps is my favorite part of data engineering, and glad to see Miro's developer workflow of DataOps.
https://medium.com/miro-engineering/our-journey-to-data-engineering-monitoring-c14d6ff20351
Confluent: Why ZooKeeper Was Replaced with KRaft – The Log of All Logs
KIP-500 probably widely read Kafka RFC, and Confluent writes an excellent summary of replacing ZooKeeper with KRaft.
https://www.confluent.io/blog/why-replace-zookeeper-with-kafka-raft-the-log-of-all-logs/
All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.