Back To The Future: Data Engineering Trends 2020 & Beyond
The look back of the latest development in data engineering 2020 & thoughts on 2021 and beyond
Welcome to the 23rd edition of data engineering weekly. This week's edition is a yearend special edition where we will take a more in-depth look at the trends and emerging patterns in data engineering 2020. I divided the trends into the following categories.
I hope you enjoy the data engineering trends 2020, and please share your thoughts in the comments.
2021 Top 3 Predictions
I know this is a long article with so many links and references. TL;DR If I had to make a top 3 prediction for 2021 and beyond, here are they.
Metadata management will become mainstream. The data lineage, data quality, and data discovery tools will merge into a unified data management platform.
Data Mesh principles will get adopted more and drive a unified data management platform.
Lakehouse systems like Hudi, Iceberg, Deltalake will play a significant role in shaping the data engineering architecture.
Now, let’s go deep into each category and see the trends and predictions. Happy reading.
Managed Data Infrastructure & Serverless Computing
In 2020, We saw the cloud platforms continue adopting the open-source data infrastructure solutions—the adoption growing from AWS's EMR, Google Cloud data proc, and Azure HDInsight to the recent AWS managed Airflow.
Introducing Amazon Managed Workflows for Apache Airflow (MWAA)
Though opinions differ on cloud platforms packaging the opensource, the cloud-managed infrastructure certainly carry many advantages for the consumers to quickly adopt complex infrastructure and focus on the business problems.
2021 & Beyond
The rise of serverless architecture particularly very interesting trend in data engineering. The article summarizes the serverless data ops trend
Dawn of DataOps: Can We Build a 100% Serverless ETL Following CI/CD Principles?
Google Cloud launched a Google Cloud Workflow as a serverless orchestration engine.
Get to know Workflows, Google Cloud’s serverless orchestration engine
In 2021, It is an exciting space to watch how managed data infrastructure and the rise of serverless computing merge.
At the beginning of 2010, tightly coupled computing and storage is a strategy to run large scale data processing engines. 2019 is when the industry finally declared the old way of thinking data processing no longer working and acknowledge the cloud datawarehouse system is the way to go.
Hadoop is Dead. Long live Hadoop.
In 2020, Snowflake's successful IPO reassured the cloud datawarehouse systems are the future. The S3 strong read-after-write consistency guarantee is a significant step in adopting object storage for the cloud datawarehouse system, if not already.
Amazon S3 Update – Strong Read-After-Write Consistency
2021 & Beyond
The cloud datawarehouse system will continue to dominate and increase the adoption in 2021 and beyond. It will be interesting to watch how the cloud datawarehouse systems are tightly integrating with the data management systems.
The cloud datawarehouse systems and the managed data infrastructure adds pressure on optimizing the cost of operating the datawarehouse systems. Netflix writes about cost optimization strategies for its data warehouse system.
Byte Down: Making Netflix’s Data Infrastructure Cost-Effective
At the same time, the GPU accelerated workload can provide a strategic business advantage. Pinterest and NVIDIA shared how Pinterest using GPU acceleration for visual search.
Pinterest Trains Visual Search Faster with Optimized Architecture on NVIDIA GPUs
2021 & Beyond
I added cost optimization as a separate section since cost optimization is often an afterthought. The unpredictability of the object storage engines egress and storage cost, handling cold vs. hot data & the need for specialized hardware for a specific workload will be the norm of 2021 and beyond.
Alluxio is one solution that I am aware of providing tiered data processing capabilities, though not tuned for cost optimization.
Accelerate Spark and Hive Jobs on AWS S3 by 10x with Alluxio as a Tiered Storage Solution
It will be interesting to see how data processing frameworks like Spark, Flink adopting cost optimization as the first class optimization model, cache frequently used datasets and aware of specialized workloads.
The separation of computing and storage and the scalable object storage like S3 increased the adoption of data lake principles in early 2019. One of the challenges remains to adopt object storage on the lack of transaction guarantees. The support for ACID transactions, data versioning, auditing, indexing, caching, and query optimization are vital characteristics to build large scale data systems.
In 2020, We noticed the emerging lakehouse frameworks like DataBricks Delta Lake, Apache Hudi, and Apache Iceberg.
Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics
Adobe shared its Iceberg adoption story
Iceberg at Adobe
Uber writes about its journey with Apache Hudi, and EMR now offers Hudi part of EMR
Building a Large-scale Transactional Data Lake at Uber Using Apache Hudi
Apply record level changes from relational databases to Amazon S3 data lake using Apache Hudi on Amazon EMR and AWS Database Migration Service
2021 & Beyond
Icerbeg’s version 2 to support row level upsert is another interesting development to watch in 2021.
The Lakehouse systems continue to mature and will play a major role in shaping the data engineering architecture. It will be interesting to watch how lakehouse complement or compete with the likes of Snowflake and Redshift.
Lambda vs. Kappa vs. Lambda-less
Managing the real-time and batch computing and providing one integrated dataset view remains the main challenge in data processing.
Pinterest writes about some of the complication of Lambda Architecture and its migration journey to the Kappa architecture
Pinterest Visual Signals Infrastructure: Evolution from Lambda to Kappa Architecture
LinkedIn took an interesting approach of Lambda-Less model
From Lambda to Lambda-less: Lessons learned
2021 & Beyond
There is no real-time vs. batch, it is all about the window that we process, but that is easier to say than the reality. In 2021 and beyond, I hope we will have a more innovative solution in this space.
Apache Beam is an excellent attempt to bring the model closer. The development of Spark Streaming and the recent Apache Flink’s
batch computing support are some of the trends to watch.
Streaming SQL Engines & OLAP Engines
Real-time computing and insights are critical for many businesses. Event sourcing is a well-established design pattern, and that brings the question of the decade. Can we join streams and compute business metrics or feed everything into OLAP databases and query it?
Confluent writes about the KSQL materialization process.
How Real-Time Materialized Views Work with ksqlDB, Animated
Materialize writes about joins in detail
On the OLAP engine space Druid, Click House and Pinot adding multiple OLAP features and improves the operational efficiency. Apache Pinot is an impressive OLAP engine gaining momentum in 2020. Uber shared its experience operating Apache Pinot at scale.
Operating Apache Pinot @ Uber Scale
2021 & Beyond
Though streaming SQL engines and OLAP engines solve similar problems, I think there is a fundamental difference. Streaming SQL engines are good for pre-defined analytics, write once, and run workloads continuously. OLAP engines are good for interactive analytics when analytical queries are unknown while building the datasets.
In 2021 and beyond I expect tighter integration among the Streaming SQL like KSQL and OLAP engines like Pinot.
Data Quality & Metadata Management
The poor data quality costs an estimated $3.1 trillion per year in the USA alone, equating to 16.5% of the GDP.!! The data quality is critical for developing a data pipeline, and your ML model is as efficient as the quality of the data.
Why data quality is key to successful ML Ops
We’ve seen both Microsoft and Airbnb writes about how data quality effort improved its org decision-making process.
Data Quality at Airbnb - Part 1 — Rebuilding at Scale
Data Quality at Airbnb - Part 2 — A New Gold Standard
We have seen multiple tools and systems emerged on Data Quality, and this is a pretty good summarization f the data quality ecosystem.
One of the most remarkable trends of 2020 in data engineering is the emerging tooling and infrastructure to manage metadata at scale. I shared some of my thoughts about the importance of metadata in the past.
In 2020, we have seen many great articles from companies across the industry that shared their data discovery and metadata management. Data Engineering Weekly dedicated a week’s edition to focus on metadata management.
Data Engineering Weekly #21: Metadata Edition
Metadata Day 2020 - Metaspeak Meetup as an attempt to unify people working on metadata management. Datakin announced the
Open Lineage initiative
2021 & Beyond
I’ve included the data quality and the metadata management in the same section for a reason. In 2020 we saw isolated solutions to solve data lineage, data quality, and data discovery. Data Pipeline is a complex inter-dependent creation process of one dataset from another. Data lineage and data quality are two tightly coupled metadata systems that power the data discovery system.
In 2021 and beyond, I expect all three problem spaces to merge and emerge as one unified data management platform that can provide data quality, lineage, and discovery service out of the box.
In 2020, Data Mesh emerged as de-facto principles for scale data management as the organization grows. Thoughtworks writes about the data mesh principles in the past.
How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh
Data Mesh Principles and Logical Architecture
We saw number of companies started to adopt the data mesh principles and wrote about it.
Data Mesh in Practice - How Europe’s Leading Online Platform For Fashion Goes Beyond Data Lake
Is Data Mesh principle to all levels of organization? I wrote a simplified explanation for the data mesh.
Data Mesh Simplified: A Reflection Of My Thoughts On Data Mesh
2021 & Beyond
The Data Domain Ownership narrated in the Data Mesh principle the scalable approach for data management at scale.
In 2021 we will see accelerated adoption of the data mesh principles, and it will further push the vision of one integrated data management system.
DBT & Workflow Orchestration
I added DBT as a trend on its own. Still, the fundamental pattern behind the success of DBT is that the industry comes to appreciate and embrace SQL as the best data abstraction for most of the data engineering workload. The success of DBT is also primarily driven by the success of the cloud datawarehouse systems and the emerging data lake 3.0 systems.
I tweeted sometime back the significant advantage of DBT as a data processing orchestrator.
On the same line here are some of the articles shares their experience with DBT.
Making your dbt models more useful with Census
Understanding the scopes of dbt tags
How to Build a Production Grade Workflow with SQL Modelling
2021 & Beyond
In 2021, I expect the trends to continue, and we will see the likes of Databricks, AWS launch their version of DBT or adopt it. The general purpose data orchestration engines like Airflow, Dagster, and Prefect already integrated well with DBT.
Dagster and dbt: Better Together
It will be interesting to see if the general-purpose orchestration engines come with their DBT version.
You may not need Airflow…. yet
shows how to build the data pipeline without Airflow, and
Building a Scalable Analytics Architecture with Airflow and dbt
makes me think is it worth to go through all the hacks to make it work.
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.