Data Engineering Weekly #40
Data Engineering Weekly
Welcome to the 40th edition of the data engineering newsletter. This week's release is a new set of articles that focus on Airbnb’s metrics consistency at scale, Google’s logica, Shopify’s guide to exploratory analysis, Uber’s automating merchant live monitoring in real-time, SoundCloud’s the journey of the corpus, Jupyter notebook on the terminal, and Apache Spark 3.1 features.
Event Highlight: The LinkedIn Big Data Summit
LinkedIn published the LinkedIn Big Data Summit agenda is a half-day workshop-style event that focuses on the intersection of AI, Cloud, and Big Data. The conference is open for everyone to attend.
Airbnb: How Airbnb Achieved Metric Consistency at Scale
Airbnb writes about its analytical journey, sharing a few growing pains and introducing Minerva, Airbnb's metrics infrastructure. It's exciting to read Minerva's simplified denormalization process, flexible backfill, comprehensive data management policy support, and integration with the data discovery system.
Google: Logica - organizing your data queries, making them universally reusable and fun
One of the shortcomings of SQL, it is not flexible enough to test and develop reusable components. Google open-source Logica extends classical Logic programming syntax to solve SQL problems using the syntax of mathematical propositional logic rather than the natural English language.
Shopify: A Five-Step Guide for Conducting Exploratory Data Analysis
Exploratory data analysis (EDA) is a critical tool in every data scientist’s kit, and the results are invaluable for answering critical business questions. Shopify shared some of the essential tips for an effective EDA, highlighting the importance of understanding the missing values, categorizing the data, distribution nature of the data, data correlation, and outlier data.
Intuit: Safeguarding Data in the Data Lake - Intuit’s Holistic Approach
Intuit writes about its holistic approach to secure the data lake. The journey from manual to automated data discovery and classification, encryption by default, focus on dataset ownership are the key highlights.
Uber: Automating Merchant Live Monitoring with Real-Time Analytics - Charon
Uber writes about Charon, its internal framework for controlling the demand at the merchant level through the enforcement of real-time rules. The high-level architecture is an exciting read with Presto & Pinot at the core of the rule engine integrated with Hive & Kafka.
SoundCloud: The Journey of Corpus
SoundCloud writes its journey migrating from Redshift to BigQuery with the project Corpus to create a single centralized source of truth for SoundCloud's most relevant data. It's an exciting read on the mission-driven approach focusing on quality, compliance, timeliness, usability, efficiency & maintainability, and the approaches to adhere to the principles.
Jupyter: nbterm- Jupyter Notebooks in the terminal
Jupyter notebook on terminal!!! The blog walkthrough on how to install with examples.
Databricks: What’s New in Apache Spark™ 3.1 Release for Structured Streaming
Databricks writes the highlights of Spark 3.1 releases introducing the new streaming table API, support for stream-stream joins, and structured streaming UI improvements.
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.