Welcome to the 14th edition of the data engineering newsletter. This week's release is a new set of articles that focus on data quality at Microsoft, Operating Pinot at Uber, data science at Hulu & Trivago, Data lake at Grofers, Notebook from Yelp, Spark shuffle optimizer from Linkedin, Testing the SQL by SoundCloud, Running Airflow on Kubernetes, and Flink SQL platform from Ververica.
The poor data quality costs an estimated $3.1 trillion per year in the USA alone, equating to 16.5% of the GDP. Reliably generating high-quality datasets to calculate business metrics is of the utmost importance in data engineering. Microsoft writes an exciting blog on the same describing its DataCop, a high scale data quality auditor.
https://medium.com/data-science-at-microsoft/partnering-for-data-quality-dc9123557f8b
Whale! 🐳 The stupidly simple data discovery tool is an interesting take on the data discovery tooling. The author shared the Airbnb data portal experiences, and the startup data frame looks exciting to watch.
https://medium.com/df-foundation/meet-whale-the-stupidly-simple-data-discovery-tool-9f847c004b47
LinkedIn writes about Magnet: A scalable and performant shuffle architecture for Apache Spark. The blog narrates the current reliability, efficiency, and scalability challenges. The shuffle push is an exciting take to optimize Apache Spark.
https://engineering.linkedin.com/blog/2020/introducing-magnet
Apache Pinot is a low latency, high throughput rich analytical engine. Uber writes about its experience running Apache Pinot at Uber’s scale.
https://eng.uber.com/operating-apache-pinot/
Jupyter notebooks allow us to do ad hoc development interactively and analyze data with visualization support. As the business critically of Jupytor notebook increases, it is important to reproduce the output of the notebook. Yelp writes about Folium that enabling reproducible Notebooks.
What is the function of a data science team at Hulu? How do they approach a data science problem? What are the interesting challenges ahead? The blog narrates the landscape of Hulu’s data science approaches.
https://medium.com/hulu-tech-blog/data-science-at-hulu-an-overview-bbc8c9b52a24
Grofers is one of the biggest online grocery delivery company in India. Grofers writes about its evaluation of data lake. It's an exciting read, particularly with the honest take on the mistakes, lessons learned, and Apache Hudu and CDC approaches' adoption.
https://lambda.grofers.com/origins-of-data-lake-at-grofers-6c011f94b86c
On continuing with Apache Hudi, As the change data capture becomes a mainstream sourcing mechanism for the data lake, the need for row-level upsert gains momentum. AWS writes a blog post on how it does the row-level changes with the Apache Hudi tables.
The untested code is a legacy code, but the data pipelines built on SQL never tested. Great Expectations and DBT are steps in that direction for overall data quality, yet ingesting test data is still the challenge. SoundCloud writes its attempt to fix the SQL testing challenge.
https://developers.soundcloud.com/blog/testing-sql-for-bigquery
Airflow on Kubernetes is becoming the standard for running the Airflow infrastructure. The blog post narrates three ways to run Airflow on Kubernetes, Using the KubernetesPodOperator, Using the KubernetesExecutor, and Using the KEDA with Airflow.
https://fullstaq.com/blog/three-ways-to-run-airflow-on-kubernetes/
The rise of the SQL in streaming workload is inevitable. Ververica announces the general availability of Flink SQL in the Ververica platform.
https://www.ververica.com/blog/ververica-platform-2.3-an-end-to-end-platform-for-flink-sql
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.