Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up today to get 5M events/month for free through the end of 2022.
Sponsored: The Data Stack Show Live - What is the Modern Data Stack?
Join The Data Stack Show Live for a special panel with experts from Databricks, dbt, Fivetran, Essence VC, and Hinge. The panel will look at the modern stack from all angles and discuss the future of data tooling.
https://rudderstack.com/video-library/the-data-stack-show-live-what-is-the-modern-data-stack/
Data Engineering Weekly - A Year in Review of 2021
Yes, this week's edition is slightly different from the usual. As we are nearing the end of 2021, I wanted to step back and review the milestones the data engineering weekly reached and some insights from the analysis.
You can see the complete analysis & code here
https://share.streamlit.io/ananthdurai/dataengweekly-analysis/main/dataengweekly_analysis/app.py
Github:
https://github.com/ananthdurai/dataengweekly-analysis
Growth of Data Engineering Weekly in 2021
The Data Engineering Weekly came a long way this year. We got an impressive growth rate of 293% increase in the number of subscribers
.🎉🥳🎈
Thank you all for your kind support, reading & sharing the newsletter.
Community contribution in 2021
In 2021, We created a GitHub repo for our readers to contribute articles and share their views about the latest thing happening in data engineering. If you have not done before, here is the repo to contribute your articles.
https://github.com/ananthdurai/dataengineeringweekly
Sponsorship & Support
Data Engineering Weekly also got our title sponsorship from Rudderstack and Link sponsorship from MonteCarloData. Thank you both, Rudderstack & MonteCarloData.
Data Engineering Trends Prediction
Towards the end of 2020, Data Engineering Weekly published Back To The Future: Data Engineering Trends 2020 & Beyond: The lookback of the latest development in data engineering 2020 & thoughts on 2021 and beyond. If you’ve not read it, here is the link
I recently gave a talk on the emerging trends at the Crunch Conf - 2021. Here are the slides of the talk.
https://speakerdeck.com/vananth22/back-to-the-future-emerging-trends-in-data-engineering
What happened to the trends prediction?
Here are the top 3 trends summary that we talked about in the data engineering newsletter.
Metadata management will become mainstream. The data lineage, quality, and discovery tools will merge into a unified data management platform.
Data Mesh principles will get adopted more and drive a unified data management platform.
Lakehouse systems like Hudi, Iceberg, and Deltalake will significantly shape the data engineering architecture.
As we are approaching the end of 2021, let’s step back and see what happened to the predictions?
How can we measure the trends?
The idea is simple, Let’s take a look at all the articles shared in data engineering weekly and run through some simple N-Gram analysis to see if we can discover any insights. The N-Gram analysis runs on three-part of the content.
Domain Name Analysis
The purpose is to find which company publishes the most data engineering articles.
N-Gram Analysis on URL
The articles often contain the keywords as part of the URLs. The NGram analysis on URL leads to some exciting trend insights.
N-Gram Analysis on Blog Contents
Lastly, we crawled all the blog content published to understand the keyword trends.
How do we do it?
Thanks to the open-source libraries, text analytics became much simpler to run through. The libraries I used are
Yake
YAKE! is a lightweight, unsupervised automatic keyword extraction method based on statistical text extracted from single documents to select the most critical keywords.
GitHub: https://github.com/LIAAD/yake
Streamlit.io
Streamlit.io is the fastest way to build and share data apps. Streamlit turns data scripts into shareable web apps in minutes. All in Python, and we host & share it for free!!! I hosted the data engineering analytics result in streamlit
!!! If you've not tried streamlit.io, I highly recommend trying it out.
aiostream
aiostream provides a collection of stream operators that can combine to create asynchronous pipelines of operations. It tremendously reduces the boilerplate code of writing async code in Python and helps to increase the productivity of the analytics since it is a time-consuming task to run each link scrapping sequentially.
GitHub: https://github.com/vxgmichel/aiostream
What are the Findings of the Trend Analysis?
The 1-gram & 2-gram analytics shows mostly the technology & tools the data engineering teams are using. Note: You can see the complete analysis here https://share.streamlit.io/ananthdurai/dataengweekly-analysis/main/dataengweekly_analysis/app.py
🏃🏽 Prediction #1: Metadata management will become mainstream
We've seen several new companies coming in data discovery & metadata management. There were eight editions in the 2020 data engineering edition featuring stories about companies adopting data discovery solutions, but in 2021 it is limited to 3 editions. Data Engineering Weekly published an metadata edition 2020
. My interpretation is that the data discovery reaches "Late majority"
in the adoption curve. Collibra raised $250M funding
Doubling the valuation in one year is a classic sign of the industry's maturity.
🧑🏽🦯 Prediction #2: Data Mesh principles will get adopted more
We have indeed seen companies as Adventa & Intuit write about their adoption story of Data Mesh principles. We've seen the principles redefined and adopted from the individual implementor's perspective. The lack of tooling & standards makes the data mesh principles a loosely held idealistic view than a breakthrough in data engineering. So I would give this still on the "Early Adopters"
stage.
🐣 Prediction #3: LakeHouse Architecture plays a significant role in data architecture
One of the (not) surprising findings from the N-Gram analysis is the number of mentions of the modern data stack. The modern data stack is a collection of cloud-hosted data platforms that run on cloud databases like Snowflake, Redshift & BigQuery. Though we have seen companies like Adobe, Uber & Netflix talk about its adoption of Iceberg & Apache Hudi, the adoption of modern data stack shows companies more prefer commercial solutions. We've seen commercialization of LakeHouse infrastructure in 2021 with companies like Tabular for Iceberg
, the EMR offering of Apache Hudi
, and the benchmark street fights between Snowflake & Databricks.
I would say LakeHouse is still in the "Innovator Phase"
and still, a long way to go.
Top 10 most featured companies:
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.