Data Engineering Weekly #40

Data Engineering Weekly

May 02, 2021

Welcome to the 40th edition of the data engineering newsletter. This week's release is a new set of articles that focus on Airbnb’s metrics consistency at scale, Google’s logica, Shopify’s guide to exploratory analysis, Uber’s automating merchant live monitoring in real-time, SoundCloud’s the journey of the corpus, Jupyter notebook on the terminal, and Apache Spark 3.1 features.

Event Highlight: The LinkedIn Big Data Summit

LinkedIn published the LinkedIn Big Data Summit agenda is a half-day workshop-style event that focuses on the intersection of AI, Cloud, and Big Data. The conference is open for everyone to attend.
https://thelinkedinbigdatasummit.splashthat.com/

Airbnb: How Airbnb Achieved Metric Consistency at Scale

Airbnb writes about its analytical journey, sharing a few growing pains and introducing Minerva, Airbnb's metrics infrastructure. It's exciting to read Minerva's simplified denormalization process, flexible backfill, comprehensive data management policy support, and integration with the data discovery system.

https://medium.com/airbnb-engineering/how-airbnb-achieved-metric-consistency-at-scale-f23cc53dea70

Google: Logica - organizing your data queries, making them universally reusable and fun

One of the shortcomings of SQL, it is not flexible enough to test and develop reusable components. Google open-source Logica extends classical Logic programming syntax to solve SQL problems using the syntax of mathematical propositional logic rather than the natural English language.

https://opensource.googleblog.com/2021/04/logica-organizing-your-data-queries.html

Shopify: A Five-Step Guide for Conducting Exploratory Data Analysis

Exploratory data analysis (EDA) is a critical tool in every data scientist’s kit, and the results are invaluable for answering critical business questions. Shopify shared some of the essential tips for an effective EDA, highlighting the importance of understanding the missing values, categorizing the data, distribution nature of the data, data correlation, and outlier data.

https://shopifyengineering.myshopify.com/blogs/engineering/conducting-exploratory-data-analysis

Intuit: Safeguarding Data in the Data Lake - Intuit’s Holistic Approach

Intuit writes about its holistic approach to secure the data lake. The journey from manual to automated data discovery and classification, encryption by default, focus on dataset ownership are the key highlights.

https://medium.com/intuit-engineering/safeguarding-data-in-the-data-lake-intuits-holistic-approach-1109bbbae2cb

Uber: Automating Merchant Live Monitoring with Real-Time Analytics - Charon

Uber writes about Charon, its internal framework for controlling the demand at the merchant level through the enforcement of real-time rules. The high-level architecture is an exciting read with Presto & Pinot at the core of the rule engine integrated with Hive & Kafka.

https://eng.uber.com/charon/

SoundCloud: The Journey of Corpus

SoundCloud writes its journey migrating from Redshift to BigQuery with the project Corpus to create a single centralized source of truth for SoundCloud's most relevant data. It's an exciting read on the mission-driven approach focusing on quality, compliance, timeliness, usability, efficiency & maintainability, and the approaches to adhere to the principles.

https://developers.soundcloud.com/blog/the-journey-of-corpus

Jupyter: nbterm- Jupyter Notebooks in the terminal

Jupyter notebook on terminal!!! The blog walkthrough on how to install with examples.

https://blog.jupyter.org/nbterm-jupyter-notebooks-in-the-terminal-6a2b55d08b70

Databricks: What’s New in Apache Spark™ 3.1 Release for Structured Streaming

Databricks writes the highlights of Spark 3.1 releases introducing the new streaming table API, support for stream-stream joins, and structured streaming UI improvements.

https://databricks.com/blog/2021/04/27/whats-new-in-apache-spark-3-1-release-for-structured-streaming.html

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.

Data Engineering Weekly

Data Engineering Weekly #40