Back To The Future: Data Engineering Trends 2020 & Beyond

The look back of the latest development in data engineering 2020 & thoughts on 2021 and beyond

Dec 27, 2020

Welcome to the 23rd edition of data engineering weekly. This week's edition is a yearend special edition where we will take a more in-depth look at the trends and emerging patterns in data engineering 2020. I divided the trends into the following categories.

Data Infrastructure
Data Architecture
Data Management

I hope you enjoy the data engineering trends 2020, and please share your thoughts in the comments.

2021 Top 3 Predictions

I know this is a long article with so many links and references. TL;DR If I had to make a top 3 prediction for 2021 and beyond, here are they.

Metadata management will become mainstream. The data lineage, data quality, and data discovery tools will merge into a unified data management platform.
Data Mesh principles will get adopted more and drive a unified data management platform.
Lakehouse systems like Hudi, Iceberg, Deltalake will play a significant role in shaping the data engineering architecture.

Now, let’s go deep into each category and see the trends and predictions. Happy reading.

Data Infrastructure

`Managed Data Infrastructure & Serverless Computing`

In 2020, We saw the cloud platforms continue adopting the open-source data infrastructure solutions—the adoption growing from AWS's EMR, Google Cloud data proc, and Azure HDInsight to the recent AWS managed Airflow.

Introducing Amazon Managed Workflows for Apache Airflow (MWAA)

Though opinions differ on cloud platforms packaging the opensource, the cloud-managed infrastructure certainly carry many advantages for the consumers to quickly adopt complex infrastructure and focus on the business problems.

2021 & Beyond

The rise of serverless architecture particularly very interesting trend in data engineering. The article summarizes the serverless data ops trend

Dawn of DataOps: Can We Build a 100% Serverless ETL Following CI/CD Principles?

Google Cloud launched a Google Cloud Workflow as a serverless orchestration engine.

Get to know Workflows, Google Cloud’s serverless orchestration engine

In 2021, It is an exciting space to watch how managed data infrastructure and the rise of serverless computing merge.

`Cloud datawarehouse`

At the beginning of 2010, tightly coupled computing and storage is a strategy to run large scale data processing engines. 2019 is when the industry finally declared the old way of thinking data processing no longer working and acknowledge the cloud datawarehouse system is the way to go.

Hadoop is Dead. Long live Hadoop.

In 2020, Snowflake's successful IPO reassured the cloud datawarehouse systems are the future. The S3 strong read-after-write consistency guarantee is a significant step in adopting object storage for the cloud datawarehouse system, if not already.

Amazon S3 Update – Strong Read-After-Write Consistency

2021 & Beyond

The cloud datawarehouse system will continue to dominate and increase the adoption in 2021 and beyond. It will be interesting to watch how the cloud datawarehouse systems are tightly integrating with the data management systems.

`Cost Optimization`

The cloud datawarehouse systems and the managed data infrastructure adds pressure on optimizing the cost of operating the datawarehouse systems. Netflix writes about cost optimization strategies for its data warehouse system.

Byte Down: Making Netflix’s Data Infrastructure Cost-Effective

At the same time, the GPU accelerated workload can provide a strategic business advantage. Pinterest and NVIDIA shared how Pinterest using GPU acceleration for visual search.

Pinterest Trains Visual Search Faster with Optimized Architecture on NVIDIA GPUs

2021 & Beyond

I added cost optimization as a separate section since cost optimization is often an afterthought. The unpredictability of the object storage engines egress and storage cost, handling cold vs. hot data & the need for specialized hardware for a specific workload will be the norm of 2021 and beyond.

Alluxio is one solution that I am aware of providing tiered data processing capabilities, though not tuned for cost optimization.

Accelerate Spark and Hive Jobs on AWS S3 by 10x with Alluxio as a Tiered Storage Solution

It will be interesting to see how data processing frameworks like Spark, Flink adopting cost optimization as the first class optimization model, cache frequently used datasets and aware of specialized workloads.

Data Architecture

`Lakehouse`

The separation of computing and storage and the scalable object storage like S3 increased the adoption of data lake principles in early 2019. One of the challenges remains to adopt object storage on the lack of transaction guarantees. The support for ACID transactions, data versioning, auditing, indexing, caching, and query optimization are vital characteristics to build large scale data systems.

In 2020, We noticed the emerging lakehouse frameworks like DataBricks Delta Lake, Apache Hudi, and Apache Iceberg.

Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics

Adobe shared its Iceberg adoption story Iceberg at Adobe

Uber writes about its journey with Apache Hudi, and EMR now offers Hudi part of EMR

Building a Large-scale Transactional Data Lake at Uber Using Apache Hudi

Apply record level changes from relational databases to Amazon S3 data lake using Apache Hudi on Amazon EMR and AWS Database Migration Service

2021 & Beyond

Icerbeg’s version 2 to support row level upsert is another interesting development to watch in 2021.

The Lakehouse systems continue to mature and will play a major role in shaping the data engineering architecture. It will be interesting to watch how lakehouse complement or compete with the likes of Snowflake and Redshift.

`Lambda vs. Kappa vs. Lambda-less`

Managing the real-time and batch computing and providing one integrated dataset view remains the main challenge in data processing.

Pinterest writes about some of the complication of Lambda Architecture and its migration journey to the Kappa architecture

Pinterest Visual Signals Infrastructure: Evolution from Lambda to Kappa Architecture

LinkedIn took an interesting approach of Lambda-Less model

From Lambda to Lambda-less: Lessons learned

2021 & Beyond

There is no real-time vs. batch, it is all about the window that we process, but that is easier to say than the reality. In 2021 and beyond, I hope we will have a more innovative solution in this space.

Apache Beam is an excellent attempt to bring the model closer. The development of Spark Streaming and the recent Apache Flink’s batch computing support are some of the trends to watch.

`Streaming SQL Engines & OLAP Engines`

Real-time computing and insights are critical for many businesses. Event sourcing is a well-established design pattern, and that brings the question of the decade. Can we join streams and compute business metrics or feed everything into OLAP databases and query it?

Confluent writes about the KSQL materialization process.

How Real-Time Materialized Views Work with ksqlDB, Animated

Materialize writes about joins in detail

Joins in Materialize

On the OLAP engine space Druid, Click House and Pinot adding multiple OLAP features and improves the operational efficiency. Apache Pinot is an impressive OLAP engine gaining momentum in 2020. Uber shared its experience operating Apache Pinot at scale.

Operating Apache Pinot @ Uber Scale

2021 & Beyond

Though streaming SQL engines and OLAP engines solve similar problems, I think there is a fundamental difference. Streaming SQL engines are good for pre-defined analytics, write once, and run workloads continuously. OLAP engines are good for interactive analytics when analytical queries are unknown while building the datasets.

In 2021 and beyond I expect tighter integration among the Streaming SQL like KSQL and OLAP engines like Pinot.

Data Management

`Data Quality & Metadata Management`

The poor data quality costs an estimated $3.1 trillion per year in the USA alone, equating to 16.5% of the GDP.!! The data quality is critical for developing a data pipeline, and your ML model is as efficient as the quality of the data.

Why data quality is key to successful ML Ops

We’ve seen both Microsoft and Airbnb writes about how data quality effort improved its org decision-making process.

Partnering for data quality

Data Quality at Airbnb - Part 1 — Rebuilding at Scale

Data Quality at Airbnb - Part 2 — A New Gold Standard

We have seen multiple tools and systems emerged on Data Quality, and this is a pretty good summarization f the data quality ecosystem.

Data Quality — A Primer

One of the most remarkable trends of 2020 in data engineering is the emerging tooling and infrastructure to manage metadata at scale. I shared some of my thoughts about the importance of metadata in the past.

Ananth Packkildurai @ananthdurai

Data engineering is all about metadata, the more mature your metadata, the more mature your data infrastructure. #dataengineering101

Ananth Packkildurai @ananthdurai

Don’t trust data, ask for it’s lineage. Data lies, lineage exposes the lies #dataengineering101

In 2020, we have seen many great articles from companies across the industry that shared their data discovery and metadata management. Data Engineering Weekly dedicated a week’s edition to focus on metadata management.

Data Engineering Weekly #21: Metadata Edition

LinkedIn organized Metadata Day 2020 - Metaspeak Meetup as an attempt to unify people working on metadata management. Datakin announced the Open Lineage initiative to standardize the data lineage and the discovery effects.

2021 & Beyond

I’ve included the data quality and the metadata management in the same section for a reason. In 2020 we saw isolated solutions to solve data lineage, data quality, and data discovery. Data Pipeline is a complex inter-dependent creation process of one dataset from another. Data lineage and data quality are two tightly coupled metadata systems that power the data discovery system.

In 2021 and beyond, I expect all three problem spaces to merge and emerge as one unified data management platform that can provide data quality, lineage, and discovery service out of the box.

`Data Mesh`

In 2020, Data Mesh emerged as de-facto principles for scale data management as the organization grows. Thoughtworks writes about the data mesh principles in the past.

How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh

Data Mesh Principles and Logical Architecture

We saw number of companies started to adopt the data mesh principles and wrote about it.

Data Mesh in Practice - How Europe’s Leading Online Platform For Fashion Goes Beyond Data Lake

Data Mesh @ Yelp - 2019

Is Data Mesh principle to all levels of organization? I wrote a simplified explanation for the data mesh.

Data Mesh Simplified: A Reflection Of My Thoughts On Data Mesh

2021 & Beyond

The Data Domain Ownership narrated in the Data Mesh principle the scalable approach for data management at scale.

In 2021 we will see accelerated adoption of the data mesh principles, and it will further push the vision of one integrated data management system.

`DBT & Workflow Orchestration`

I added DBT as a trend on its own. Still, the fundamental pattern behind the success of DBT is that the industry comes to appreciate and embrace SQL as the best data abstraction for most of the data engineering workload. The success of DBT is also primarily driven by the success of the cloud datawarehouse systems and the emerging data lake 3.0 systems.

I tweeted sometime back the significant advantage of DBT as a data processing orchestrator.

Ananth Packkildurai @ananthdurai

Here are the Top 5 reasons why I consider @getdbt groundbreaking. 1. The DBT data model creates a logical separation of tables or the metadata engines like Hive meta store from the data pipeline. @ApacheAirflow tried with the tasks & operators, but not wholly successful.

On the same line here are some of the articles shares their experience with DBT.

Why is dbt so important?

Making your dbt models more useful with Census

Understanding the scopes of dbt tags

How to Build a Production Grade Workflow with SQL Modelling

2021 & Beyond

In 2021, I expect the trends to continue, and we will see the likes of Databricks, AWS launch their version of DBT or adopt it. The general purpose data orchestration engines like Airflow, Dagster, and Prefect already integrated well with DBT.

Dagster and dbt: Better Together

It will be interesting to see if the general-purpose orchestration engines come with their DBT version. You may not need Airflow…. yet shows how to build the data pipeline without Airflow, and Building a Scalable Analytics Architecture with Airflow and dbt makes me think is it worth to go through all the hacks to make it work.

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.

Data Engineering Weekly

Back To The Future: Data Engineering Trends 2020 & Beyond

The look back of the latest development in data engineering 2020 & thoughts on 2021 and beyond

2021 Top 3 Predictions

Data Infrastructure

Managed Data Infrastructure & Serverless Computing

2021 & Beyond

Cloud datawarehouse

2021 & Beyond

Cost Optimization

2021 & Beyond

Data Architecture

Lakehouse

2021 & Beyond

Lambda vs. Kappa vs. Lambda-less

2021 & Beyond

Streaming SQL Engines & OLAP Engines

2021 & Beyond

Data Management

Data Quality & Metadata Management

2021 & Beyond

Data Mesh

2021 & Beyond

DBT & Workflow Orchestration

2021 & Beyond

Discussion about this post

`Managed Data Infrastructure & Serverless Computing`

`Cloud datawarehouse`

`Cost Optimization`

`Lakehouse`

`Lambda vs. Kappa vs. Lambda-less`

`Streaming SQL Engines & OLAP Engines`

`Data Quality & Metadata Management`

`Data Mesh`

`DBT & Workflow Orchestration`