Data Engineering Weekly - A Year in Review of 2021

Trends, Predictions, Analysis & More!!!

Nov 28, 2021

Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up today to get 5M events/month for free through the end of 2022.

Data Engineering Weekly - A Year in Review of 2021

Yes, this week's edition is slightly different from the usual. As we are nearing the end of 2021, I wanted to step back and review the milestones the data engineering weekly reached and some insights from the analysis.

You can see the complete analysis & code here
https://share.streamlit.io/ananthdurai/dataengweekly-analysis/main/dataengweekly_analysis/app.py
Github: https://github.com/ananthdurai/dataengweekly-analysis

Growth of Data Engineering Weekly in 2021

The Data Engineering Weekly came a long way this year. We got an impressive growth rate of 293% increase in the number of subscribers.🎉🥳🎈 Thank you all for your kind support, reading & sharing the newsletter.

Community contribution in 2021

In 2021, We created a GitHub repo for our readers to contribute articles and share their views about the latest thing happening in data engineering. If you have not done before, here is the repo to contribute your articles.

https://github.com/ananthdurai/dataengineeringweekly

Sponsorship & Support

Data Engineering Weekly also got our title sponsorship from Rudderstack and Link sponsorship from MonteCarloData. Thank you both, Rudderstack & MonteCarloData.

Data Engineering Trends Prediction

Towards the end of 2020, Data Engineering Weekly published Back To The Future: Data Engineering Trends 2020 & Beyond: The lookback of the latest development in data engineering 2020 & thoughts on 2021 and beyond. If you’ve not read it, here is the link

Data Engineering Weekly

Back To The Future: Data Engineering Trends 2020 & Beyond

Welcome to the 23rd edition of data engineering weekly. This week's edition is a yearend special edition where we will take a more in-depth look at the trends and emerging patterns in data engineering 2020. I divided the trends into the following categories…

5 years ago · 2 likes · Ananth Packkildurai

I recently gave a talk on the emerging trends at the Crunch Conf - 2021. Here are the slides of the talk.

https://speakerdeck.com/vananth22/back-to-the-future-emerging-trends-in-data-engineering

What happened to the trends prediction?

Here are the top 3 trends summary that we talked about in the data engineering newsletter.

Metadata management will become mainstream. The data lineage, quality, and discovery tools will merge into a unified data management platform.
Data Mesh principles will get adopted more and drive a unified data management platform.
Lakehouse systems like Hudi, Iceberg, and Deltalake will significantly shape the data engineering architecture.

As we are approaching the end of 2021, let’s step back and see what happened to the predictions?

How can we measure the trends?

The idea is simple, Let’s take a look at all the articles shared in data engineering weekly and run through some simple N-Gram analysis to see if we can discover any insights. The N-Gram analysis runs on three-part of the content.

Domain Name Analysis

The purpose is to find which company publishes the most data engineering articles.

N-Gram Analysis on URL

The articles often contain the keywords as part of the URLs. The NGram analysis on URL leads to some exciting trend insights.

N-Gram Analysis on Blog Contents

Lastly, we crawled all the blog content published to understand the keyword trends.

How do we do it?

Thanks to the open-source libraries, text analytics became much simpler to run through. The libraries I used are

Yake

YAKE! is a lightweight, unsupervised automatic keyword extraction method based on statistical text extracted from single documents to select the most critical keywords.

GitHub: https://github.com/LIAAD/yake

Streamlit.io

Streamlit.io is the fastest way to build and share data apps. Streamlit turns data scripts into shareable web apps in minutes. All in Python, and we host & share it for free!!! I hosted the data engineering analytics result in streamlit!!! If you've not tried streamlit.io, I highly recommend trying it out.

aiostream

aiostream provides a collection of stream operators that can combine to create asynchronous pipelines of operations. It tremendously reduces the boilerplate code of writing async code in Python and helps to increase the productivity of the analytics since it is a time-consuming task to run each link scrapping sequentially.

GitHub: https://github.com/vxgmichel/aiostream

What are the Findings of the Trend Analysis?

The 1-gram & 2-gram analytics shows mostly the technology & tools the data engineering teams are using. Note: You can see the complete analysis here https://share.streamlit.io/ananthdurai/dataengweekly-analysis/main/dataengweekly_analysis/app.py

https://share.streamlit.io/ananthdurai/dataengweekly-analysis/main/dataengweekly_analysis/app.py

🏃🏽 Prediction #1: Metadata management will become mainstream

We've seen several new companies coming in data discovery & metadata management. There were eight editions in the 2020 data engineering edition featuring stories about companies adopting data discovery solutions, but in 2021 it is limited to 3 editions. Data Engineering Weekly published an metadata edition 2020. My interpretation is that the data discovery reaches "Late majority" in the adoption curve. Collibra raised $250M fundingDoubling the valuation in one year is a classic sign of the industry's maturity.

🧑🏽‍🦯 Prediction #2: Data Mesh principles will get adopted more

We have indeed seen companies as Adventa & Intuit write about their adoption story of Data Mesh principles. We've seen the principles redefined and adopted from the individual implementor's perspective. The lack of tooling & standards makes the data mesh principles a loosely held idealistic view than a breakthrough in data engineering. So I would give this still on the "Early Adopters" stage.

🐣 Prediction #3: LakeHouse Architecture plays a significant role in data architecture

One of the (not) surprising findings from the N-Gram analysis is the number of mentions of the modern data stack. The modern data stack is a collection of cloud-hosted data platforms that run on cloud databases like Snowflake, Redshift & BigQuery. Though we have seen companies like Adobe, Uber & Netflix talk about its adoption of Iceberg & Apache Hudi, the adoption of modern data stack shows companies more prefer commercial solutions. We've seen commercialization of LakeHouse infrastructure in 2021 with companies like Tabular for Iceberg, the EMR offering of Apache Hudi, and the benchmark street fights between Snowflake & Databricks.

I would say LakeHouse is still in the "Innovator Phase" and still, a long way to go.

Top 10 most featured companies:

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly

Data Engineering Weekly - A Year in Review of 2021

Trends, Predictions, Analysis & More!!!

Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers

Sponsored: The Data Stack Show Live - What is the Modern Data Stack?

Data Engineering Weekly - A Year in Review of 2021

Growth of Data Engineering Weekly in 2021

Community contribution in 2021

Sponsorship & Support

Data Engineering Trends Prediction

What happened to the trends prediction?

How can we measure the trends?

Domain Name Analysis

N-Gram Analysis on URL

N-Gram Analysis on Blog Contents

How do we do it?

Yake

Streamlit.io

aiostream

What are the Findings of the Trend Analysis?

🏃🏽 Prediction #1: Metadata management will become mainstream

🧑🏽‍🦯 Prediction #2: Data Mesh principles will get adopted more

🐣 Prediction #3: LakeHouse Architecture plays a significant role in data architecture

Top 10 most featured companies:

Discussion about this post