Data Engineering Weekly #103

The Weekly Data Engineering Newsletter

Oct 17, 2022

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

Schema Ops - Let’s (re)-imagine Data Management

We missed the last couple of weeks, but hey, folks - we are back. Data Contract has been a hot topic in the last couple of weeks. Chad published an engineering guide to a data contract, Jake published the contract-powered platform, David published three-part series on Data Contracts, and Yali Sassoon published why data contract is a good idea.

I guess this tweet summarizes all, and this is my general reaction to the case against the data contract.

Charity Majors @mipsytipsy

The idea is that each dataset should have a data contract, consisting of a schema plus any SLAs, semantics, policies etc and a version id. This may strike you as a self-evidently good idea, but the article says it seems to be hotly debated amongst data engineers.

I thought about it a lot and realized how poor naming could significantly impact the adoption curve in mainstream companies. Data Contract implies it is slow and bureaucratic. I understand why since a simple addition of columns in the traditional data warehouses can take months to roll out due to the org structure.

I think it is essential to name it to reflect its purpose. Hence I call it "Schema Ops." after inspiration from the success of reliability engineering.

Ananth Packkildurai @ananthdurai

"Schema Ops" - is a collective operation to define the structure of the data, enforce constraints, and find & share among different domains. Business continuity and non-breaking schema management is the top priority for Schema Ops. 1/3

DoorDash: Five Common Data Quality Gotchas in Machine Learning and How to Detect Them Quickly

Data collection and preparation occupy the vast majority of the Machine Learning workload. DoorDash writes about five typical data quality gotchas in the ML pipeline, including missing and invalid values, outliers, defaulting, and sampling errors. The open-source Data Quality Report is an exciting project to keep an eye on.

https://doordash.engineering/2022/09/27/five-common-data-quality-gotchas-in-machine-learning-and-how-to-detect-them-quickly/

Intuit: Numaproj — Driving Dev Velocity with Real-time Analytics, AIOps on Kubernetes

Developer productivity is vital to win the digital business, and Intuit is taking that principle in AI/ ML development. Intuit writes about Numaproj, an open-source collection of Kubernetes-native developer tools for real-time data analysis and AIOps.

https://medium.com/intuit-engineering/numaproj-driving-dev-velocity-with-real-time-analytics-aiops-on-kubernetes-fba62e00eecf

Airbnb: Upgrading Data Warehouse Infrastructure at Airbnb

Airbnb writes about its migration story of Spark 3 + Iceberg. The highlight is how far Airbnb's data infrastructure has evolved from HDFS to S3, and now Iceberg reminds data infra requires continuous improvement & investment. I've not tried Spark 3 AQE (Adaptive Query Execution), but exciting to see performance benefits in the article.

https://medium.com/airbnb-engineering/upgrading-data-warehouse-infrastructure-at-airbnb-a4e18f09b6d5

DBS Tech: Accelerating Big Data processing with Spark optimisation

What if we can’t utilize Spark 3 & AQE? Regardless of the Spark versions, the article gives an excellent overview of the basic structure to keep in mind while building data pipelines with Apache Spark.

https://medium.com/dbs-tech-blog/accelerating-big-data-processing-with-spark-optimisation-1f2f5dad03ea

Uber: Reducing Logging Cost by Two Orders of Magnitude using CLP

Big data processing generates too big of logs to process and index. Uber writes about how efficiently compress and index Spark logs using CLP integrated with the Log4J appender. CLP(Compressed Log Processor) is a tool capable of losslessly compressing text logs and searching the compressed logs without decompression.

CLP Paper: https://www.usenix.org/system/files/osdi21-rodrigues.pdf

CLP Github: https://github.com/y-scope/clp

https://www.uber.com/blog/reducing-logging-cost-by-two-orders-of-magnitude-using-clp/

Snap: Speed Up Feature Engineering for Recommendation Systems

Developer velocity to improve feature engineering is the focus for many companies to iterate fast and build ML applications. Along the line of Airbnb's Zipline and Uber's Michelangelo Palette, Snap writes about Robusta, its internal feature automation framework.

https://eng.snap.com/speed-up-feature-engineering

Netflix: RecSysOps: Best Practices for Operating a Large-Scale Recommender System

There is much literacy for building recommendation engines, and there is Kaggle competition. Operating a recommendation engine at scale is another challenge. Netflix writes an exciting blog narrating best practices for operating recommendation engines in production.

https://netflixtechblog.medium.com/recsysops-best-practices-for-operating-a-large-scale-recommender-system-95bbe195a841

AWS: Amazon File Cache – A High Performance Cache On AWS For Your On-Premises File Systems

You wonder why I included a product announcement from AWS. I found AWS Cache a fascinating development to have a native file cache on top of S3. It will be interesting to see how emerging in-process query engines like DuckDB and native file system cache on top of S3 can change the query layer.

https://aws.amazon.com/blogs/aws/amazon-file-cache-a-high-performance-cache-on-aws-for-your-on-premises-file-systems/

Lyft: Evolution of Streaming Pipelines in Lyft’s Marketplace

Lyft writes about the evolution of its streaming pipeline architecture on top of Apache Beam. The blog narrates how the initial version started with cron jobs and the continuous improvement to simplify pipeline creation.

https://eng.lyft.com/evolution-of-streaming-pipelines-in-lyfts-marketplace-74295eaf1eba

Blibli.com: Data Lineage - State-of-the-art and Implementation Challenges

The blog narrates the current state of the standalone data lineage tools in the market. There is much momentum behind the data catalogs. The minimal number of options in data lineage makes me wonder if data lineage is still a separate category or subset of the scheduler or catalog.

https://medium.com/bliblidotcom-techblog/data-lineage-state-of-the-art-and-implementation-challenges-1ea8dccde9de

Adrian Bednarz: The hidden risk of using CTEs

CTE, dbt, Snowflake, and performance impact seems to have repeated theme in recent times. The author demonstrated another example of how an additional layer of CTE can cause a significant performance impact.

https://techwithadrian.medium.com/the-hidden-risk-of-using-ctes-53b241e256b2

All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly