Welcome to the 20th edition of the data engineering newsletter. This week's release is a new set of articles that focus on S3 strong read-on-writes consistency, Apache Pinot 0.6.0 release, ThoughtWorks thoughts on Data Mesh principles, Adobe’s experience with Iceberg, Linkedin’s journey from Lambda to Lambda-less architecture, The Financial Times data platform journey, Shopify’s SQL workflow modeling, The data producer-consumer problem, Picnic’s data engineering, Teads Spark 3.0 migration, and the rise of serverless orchestration engines.
A big announcement this week in AWS re-invent is S3 now writes strong read-after-write consistency for the GET, PUT, and LIST operation, as well as operations that change object tags, ACLs, or metadata. Netflix writes in the past how the lack of strong read-after-write consistency causing significant business issues. S3Mper, EMRFS, and S3Guard are some of the frameworks trying to fix the S3 consistency issue. I hope all the nightmare over now, and safely delete all hacks!!!
Apache Pinot releases version 0.6 this week. The support for upsert operation, an optimized Apache Spark connector, and tiered storage support are some of the exciting features to watch.
Thoughtworks writes a followup article about Data Mesh Principles and the Logical Architecture. The previous article on How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh would be a highly recommended read if you missed it. I believe Domain ownership and the Data as a Product are the future of democratizing the organization's data.
Data Mesh is a great idea, but one should adopt it with caution. Here is an excellent Twitter thread on the same.
Martin Fowler @martinfowlerLast year @zhamakd introduced the idea of replacing a centralized data lake with a distributed data mesh. After a year's experience she's preparing a couple of articles to expand on this. This first one focuses on four foundational principles https://t.co/7I9dpwijFP
Adobe’s experience platform data lake currently processing ~1 million batches per day, which equates roughly to 13TB of data and 32 billion events. Data management at scale brings unique challenges of data reliability, read reliability, and scalability. Adobe writes an excellent post with an overview of the data lake and the effective usage of Apache Iceberg to manages the data lake.
The Lambda architecture has become a popular architectural style that promises both speed and accuracy in data processing using a hybrid approach of both batch processing and stream processing methods. But it also has some drawbacks, such as complexity and additional development/operational overheads. LinkedIn writes an excellent post on some of the lessons learned in operating this system in the Lambda architecture, the decisions made in transitioning to Lambda-less.
The Financial Times, one of the world’s leading business news organizations, has been around for more than 130 years. Financial Times writes an excellent article narrates its data platform journey from 2008. It’s an exciting read to see the evaluation from the external provider/ SQL Server to real-time analytics with EKS infrastructure.
Shopify narrates how it builds a production-grade workflow with SQL modeling on top of DBT. SQL withstand as the best abstract for ad-hoc exploration and analytical thinking tool. Shopify embraces the principle builds a testing and documentation framework for SQL workflow.
The article nailed some of the unsolved problems with modern data management platforms, the gap between the dataset producer and the consumer on data discovery, trust, and data governance. A great read that narrates the importance of data discovery tools similar to Amundsen.
Data Engineering is often considered as the “backend of the backend” in a few organizations. Picnic writes about business challenges that Picnic’s data engineering team solve together with other teams. It’s exciting to see how Picnic efficiently integrated data engineering practice to its core business processes such as supply chain management, delivery, and online store.
The serverless workflow adoption is increasing as the complexity of operating systems such as Apache Airflow requires specialized skills. Both AWS & Google offer Airflow as a service and offer alternative serverless workflow engines like AWS step function and Google Workflow. Google Cloud writes about its workflow, a serverless orchestration engine in comparison with Airflow.
Spark 3.0 introduced some exciting optimizations such as Dynamically coalescing shuffle partitions, Dynamically switching join strategies, and Dynamically optimizing skew joins. Teads writes about its journey to upgrade Spark 3.0 and share the observation of how Spark 3.0 improves performance compares to Spark 2.X.
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.