Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
Schema Ops - Let’s (re)-imagine Data Management
We missed the last couple of weeks, but hey, folks - we are back. Data Contract has been a hot topic in the last couple of weeks. Chad published an engineering guide to a data contract, Jake published the contract-powered platform, David published three-part series on Data Contracts, and Yali Sassoon published why data contract is a good idea.
I guess this tweet summarizes all, and this is my general reaction to the case against the data contract.
I thought about it a lot and realized how poor naming could significantly impact the adoption curve in mainstream companies. Data Contract implies it is slow and bureaucratic. I understand why since a simple addition of columns in the traditional data warehouses can take months to roll out due to the org structure.
I think it is essential to name it to reflect its purpose. Hence I call it "Schema Ops." after inspiration from the success of reliability engineering.
DoorDash: Five Common Data Quality Gotchas in Machine Learning and How to Detect Them Quickly
Data collection and preparation occupy the vast majority of the Machine Learning workload. DoorDash writes about five typical data quality gotchas in the ML pipeline, including missing and invalid values, outliers, defaulting, and sampling errors. The open-source Data Quality Report is an exciting project to keep an eye on.
Intuit: Numaproj — Driving Dev Velocity with Real-time Analytics, AIOps on Kubernetes
Developer productivity is vital to win the digital business, and Intuit is taking that principle in AI/ ML development. Intuit writes about Numaproj, an open-source collection of Kubernetes-native developer tools for real-time data analysis and AIOps.
Airbnb: Upgrading Data Warehouse Infrastructure at Airbnb
Airbnb writes about its migration story of Spark 3 + Iceberg. The highlight is how far Airbnb's data infrastructure has evolved from HDFS to S3, and now Iceberg reminds data infra requires continuous improvement & investment. I've not tried Spark 3 AQE (Adaptive Query Execution), but exciting to see performance benefits in the article.
https://medium.com/airbnb-engineering/upgrading-data-warehouse-infrastructure-at-airbnb-a4e18f09b6d5
Sponsored - Monte Carlo: Join us on October 25-26, 2022, for IMPACT The Data Observability Summit
Hear from some of the biggest names in data and analytics about the ideas and technologies pioneering our industry. Featuring keynote speakers Nate Silver, founder of FiveThirtyEight, Daniel Kahneman, Nobel Prize-winning economist, Ali Ghodsi, CEO of Databricks, and founders and data leaders from dbt Labs, Fivetran, The New York Times, GitLab, Fox Corporation, and other companies pioneering the way forward for reliable data.
Data Engineering Weekly readers, Get Your Free Ticket!!!
DBS Tech: Accelerating Big Data processing with Spark optimisation
What if we can’t utilize Spark 3 & AQE? Regardless of the Spark versions, the article gives an excellent overview of the basic structure to keep in mind while building data pipelines with Apache Spark.
Uber: Reducing Logging Cost by Two Orders of Magnitude using CLP
Big data processing generates too big of logs to process and index. Uber writes about how efficiently compress and index Spark logs using CLP integrated with the Log4J appender. CLP(Compressed Log Processor) is a tool capable of losslessly compressing text logs and searching the compressed logs without decompression.
CLP Paper: https://www.usenix.org/system/files/osdi21-rodrigues.pdf
CLP Github: https://github.com/y-scope/clp
https://www.uber.com/blog/reducing-logging-cost-by-two-orders-of-magnitude-using-clp/
Sponsored: Soda - 🗣 Podcast: How To Build A Common Understanding Of Your Data Reliability Rules
Regardless of how data is being used, it is critical that the information is trusted. The practice of data reliability engineering has gained momentum recently to address that question. Soda Checks Language helps support the efforts of data teams wtih the corresponding Soda Core utility that acts on this new DSL. In this Data Engineering Podcast by Tobias Macey episode, Tom Baeyens explains their reasons for creating a new syntax for expressing and validating checks for data assets and processes, as well as how to incorporate it into your own projects.
Snap: Speed Up Feature Engineering for Recommendation Systems
Developer velocity to improve feature engineering is the focus for many companies to iterate fast and build ML applications. Along the line of Airbnb's Zipline and Uber's Michelangelo Palette, Snap writes about Robusta, its internal feature automation framework.
https://eng.snap.com/speed-up-feature-engineering
Netflix: RecSysOps: Best Practices for Operating a Large-Scale Recommender System
There is much literacy for building recommendation engines, and there is Kaggle competition. Operating a recommendation engine at scale is another challenge. Netflix writes an exciting blog narrating best practices for operating recommendation engines in production.
Sponsored - RudderStack: Exploring Options for GA4 Cloud Measurement
Join this webinar with RudderStack and BlastX to learn all about the differences between Universal Analytics and GA4 and get a detailed rundown of the different GA4 implementation options. The session will cover server-side vs. client-side tracking and offer a practical framework to help you determine the best deployment method for your business.
https://www.rudderstack.com/events/exploring-options-for-ga4-cloud-measurement/
AWS: Amazon File Cache – A High Performance Cache On AWS For Your On-Premises File Systems
You wonder why I included a product announcement from AWS. I found AWS Cache a fascinating development to have a native file cache on top of S3. It will be interesting to see how emerging in-process query engines like DuckDB and native file system cache on top of S3 can change the query layer.
Lyft: Evolution of Streaming Pipelines in Lyft’s Marketplace
Lyft writes about the evolution of its streaming pipeline architecture on top of Apache Beam. The blog narrates how the initial version started with cron jobs and the continuous improvement to simplify pipeline creation.
https://eng.lyft.com/evolution-of-streaming-pipelines-in-lyfts-marketplace-74295eaf1eba
Blibli.com: Data Lineage - State-of-the-art and Implementation Challenges
The blog narrates the current state of the standalone data lineage tools in the market. There is much momentum behind the data catalogs. The minimal number of options in data lineage makes me wonder if data lineage is still a separate category or subset of the scheduler or catalog.
Adrian Bednarz: The hidden risk of using CTEs
CTE, dbt, Snowflake, and performance impact seems to have repeated theme in recent times. The author demonstrated another example of how an additional layer of CTE can cause a significant performance impact.
https://techwithadrian.medium.com/the-hidden-risk-of-using-ctes-53b241e256b2
All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.