Welcome to the 38th edition of the data engineering newsletter. This week's release is a new set of articles that focus onOrphaned Analytics, Macro Trends in the tech, a comprehensive guide on data quality, SQL engine benchmarking, Lake file format comparison, Iceberg’s ACID transaction support, cloud cost management, forecasting algorithms, Apache Pinot’s theta sketches, and layering DBT project
Bill Schmarzo: Orphaned Analytics: The Great Destroyers of Economic Value
Does the number of developed ML models is an indicator of a company's analytics prowess and maturity? What is the cost of orphaned analytics? The author narrates the cost of orphaned analytics, representing a significant operational and regulatory risk and walkthrough the role of Hypothesis Development Canvas to Prevent Orphaned Analytics.
ThoughtWorks: Macro trends in the technology industry
ThoughtWorks writes about the macro trends in the technology industry. The fall and rise of SQL and mainstream machine learning trends are an exciting read.
https://www.thoughtworks.com/insights/blog/macro-trends-technology-industry-april-2021
Chau Vinh Loi: A Comprehensive Framework for Data Quality Management
Data quality shows the extent to which data meets users’ standards of excellence or expectations. In a comprehensive guide for data quality, the author narrates data quality dimensions, including Accuracy, Completeness, Timeliness, Consistency, and Uniqueness, and the formulas for metrics calculation.
https://towardsdatascience.com/a-comprehensive-framework-for-data-quality-management-b110a0465e83
Explorium: Benchmarking SQL engines for Data Serving - PrestoDb, Trino, and Redshift
Explorium writes an informative benchmark comparing Redshift, Trino & Presto. Redshift Spectrum outperforming Trino & Presto adjusted to the cost is an interesting finding.
LakeFS: Hudi, Iceberg & Delta Lake - Data Lake Table Format Compared
LakeFS writes an exciting blog comparing the lake formats Hudi, Iceberg, and Delta Lake on their platform compatibility, performance & throughput, and concurrency. The recommendations are, If you are also already a Databricks customer, Delta Engine brings significant improvements. If your primary pain points are managing huge tables on an object store (more than 10k partitions), Iceberg works excellent. If you use various query engines and require flexibility for managing mutating datasets, Hudi does the job.
https://lakefs.io/hudi-iceberg-and-delta-lake-data-lake-table-formats-compared/
Adobe: Iceberg Series - ACID Transactions at Scale on the Data Lake in Adobe Experience Platform
The write amplification increases while concurrent process trying upsert at the same time. Adobe writes about Tombstone, its internal implementation of row-level upsert operation on Iceberg to handle more than 10B rows reprocessing every day
Airbnb: Achieving Insights and Savings with Cost Data
Cloud cost optimization is one of the vital aspects of platform engineering, and Airbnb's cost data foundation shares its learnings from building a pipeline, defining metrics, and designing visualizations.
https://medium.com/airbnb-engineering/achieving-insights-and-savings-with-cost-data-ec9a49fd74bc
Microsoft: Time series forecasting - Selecting algorithms
Microsoft writes the second part of the time series forecasting series, focusing on selecting the algorithms. The blog narrates a univariate forecasting engine and evaluation metrics to measure the predictions.
LinkedIn: Solving for the cardinality of set intersection at scale with Pinot and Theta Sketches
LinkedIn writes about Apache Pinot's Theta-Sketches set intersection cardinality estimation to solve the audience-reach estimation problem in production. This new solution alleviated the existing problem of data staleness by reducing data size (by approximately 80%) and capping the data size growth from superlinear to sub-linear.
https://engineering.linkedin.com/blog/2021/pinot-and-theta-sketches
Mitchell Silverman : Layering Your Data Warehouse
How to layer a DBT project? The author narrates why the layering is vital in data infrastructure and a complete description of the root layer, logic layer, dimension & activity layer, and reporting layer.
https://mitchellsilv-79772.medium.com/layering-your-data-warehouse-f3da41a337e5
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.