Data Engineering Weekly #38

Weekly Data Engineering Newsletter

Apr 19, 2021

Welcome to the 38th edition of the data engineering newsletter. This week's release is a new set of articles that focus onOrphaned Analytics, Macro Trends in the tech, a comprehensive guide on data quality, SQL engine benchmarking, Lake file format comparison, Iceberg’s ACID transaction support, cloud cost management, forecasting algorithms, Apache Pinot’s theta sketches, and layering DBT project

Bill Schmarzo: Orphaned Analytics: The Great Destroyers of Economic Value

Does the number of developed ML models is an indicator of a company's analytics prowess and maturity? What is the cost of orphaned analytics? The author narrates the cost of orphaned analytics, representing a significant operational and regulatory risk and walkthrough the role of Hypothesis Development Canvas to Prevent Orphaned Analytics.

https://www.datasciencecentral.com/profiles/blogs/orphaned-analytics-the-great-destroyers-of-economic-value

ThoughtWorks: Macro trends in the technology industry

ThoughtWorks writes about the macro trends in the technology industry. The fall and rise of SQL and mainstream machine learning trends are an exciting read.

https://www.thoughtworks.com/insights/blog/macro-trends-technology-industry-april-2021

Chau Vinh Loi: A Comprehensive Framework for Data Quality Management

Data quality shows the extent to which data meets users’ standards of excellence or expectations. In a comprehensive guide for data quality, the author narrates data quality dimensions, including Accuracy, Completeness, Timeliness, Consistency, and Uniqueness, and the formulas for metrics calculation.

https://towardsdatascience.com/a-comprehensive-framework-for-data-quality-management-b110a0465e83

Explorium: Benchmarking SQL engines for Data Serving - PrestoDb, Trino, and Redshift

Explorium writes an informative benchmark comparing Redshift, Trino & Presto. Redshift Spectrum outperforming Trino & Presto adjusted to the cost is an interesting finding.

https://medium.com/explorium-ai/benchmarking-sql-engines-for-data-serving-prestodb-trino-and-redshift-1c5f16d6e5da

LakeFS: Hudi, Iceberg & Delta Lake - Data Lake Table Format Compared

LakeFS writes an exciting blog comparing the lake formats Hudi, Iceberg, and Delta Lake on their platform compatibility, performance & throughput, and concurrency. The recommendations are, If you are also already a Databricks customer, Delta Engine brings significant improvements. If your primary pain points are managing huge tables on an object store (more than 10k partitions), Iceberg works excellent. If you use various query engines and require flexibility for managing mutating datasets, Hudi does the job.

https://lakefs.io/hudi-iceberg-and-delta-lake-data-lake-table-formats-compared/

Adobe: Iceberg Series - ACID Transactions at Scale on the Data Lake in Adobe Experience Platform

The write amplification increases while concurrent process trying upsert at the same time. Adobe writes about Tombstone, its internal implementation of row-level upsert operation on Iceberg to handle more than 10B rows reprocessing every day

https://medium.com/adobetech/iceberg-series-acid-transactions-at-scale-on-the-data-lake-in-adobe-experience-platform-f3e8fe0cef01

Airbnb: Achieving Insights and Savings with Cost Data

Cloud cost optimization is one of the vital aspects of platform engineering, and Airbnb's cost data foundation shares its learnings from building a pipeline, defining metrics, and designing visualizations.

https://medium.com/airbnb-engineering/achieving-insights-and-savings-with-cost-data-ec9a49fd74bc

Microsoft: Time series forecasting - Selecting algorithms

Microsoft writes the second part of the time series forecasting series, focusing on selecting the algorithms. The blog narrates a univariate forecasting engine and evaluation metrics to measure the predictions.

https://medium.com/data-science-at-microsoft/time-series-forecasting-part-2-of-3-selecting-algorithms-11b6635f61bb

LinkedIn: Solving for the cardinality of set intersection at scale with Pinot and Theta Sketches

LinkedIn writes about Apache Pinot's Theta-Sketches set intersection cardinality estimation to solve the audience-reach estimation problem in production. This new solution alleviated the existing problem of data staleness by reducing data size (by approximately 80%) and capping the data size growth from superlinear to sub-linear.

https://engineering.linkedin.com/blog/2021/pinot-and-theta-sketches

Mitchell Silverman : Layering Your Data Warehouse

How to layer a DBT project? The author narrates why the layering is vital in data infrastructure and a complete description of the root layer, logic layer, dimension & activity layer, and reporting layer.

https://mitchellsilv-79772.medium.com/layering-your-data-warehouse-f3da41a337e5

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.

Data Engineering Weekly