Data Engineering Weekly #1

Weekly data engineering related newsletter

I'm passionate about data engineering and decided to start a newsletter to share what I'm learning every week. I hope the newsletter will act as a medium for distributing knowledge and creates healthy conversations. This week's release is an exciting set of articles that focus on data privacy, data discoverability, and data applications.

Privacy often afterthought development in the data world. There are numerous ways one might betray someone's privacy, but they are evident in most everyday situations. The New York Times wrote their thought on data privacy. The post is a good overview of privacy, useful links, and what are the steps NYT is doing in marketing and advertisement on their user's privacy.

The popularity of microservices adds complexity to enforce data privacy policies over the period. The data often flows through an organization, duplicate multiple times without any accountability. Tracing the data flow and implement security policy is a challenge. Facebook writes about how a scalable data classification system helps to enforce the data policies.

Poor data quality leads to unusable data. How much can you trust your data is a question in the minds of every data consumers. Thoughtworks wrote an interesting article on the same with an introduction to opensource library deequ from AWS lab.

The Spark + AI Summit 2020 ended in the last week of June-2020. In case you missed it, all the slides and the talk available on the summit page.

The Klarna data team wrote an excellent summarization of the summit.

Data discoverability is an essential aspect of the data infrastructure. The value proportion of a data warehouse system exponentially decreases with a weak data discovery system. The Shopify data team writes about their data discovery system, which is an excellent comprehensive overview of a data discovery design.

Catalog services are an essential metadata engine for data discovery and schema management. Hive meta store, AWS Glue data catalog are some of the catalog services used in data infrastructure. Apache Flink 1.9 added catalog integration, and this blog post is describing how to integrate Apache Flink with the Hive and Postgress based catalog services.

The University of Florida and NVIDIA Tuesday unveiled a plan to build the world's fastest AI supercomputer in academia, delivering 700 petaflops of AI performance.

TimeZone is a complicated yet crucial part of data infrastructure. Databricks writes an excellent overview of TimeZone, Dates, and Timestamp with Spark 3.0

Pinterest writes shopping intent ML model to drive the shopping upsells Pinterest search. The evolution of the model from the upsell click rate model to the "long click" model is an exciting read.

Walmart wrote about stream processing with Spring Cloud. Spring Cloud provides stream processing on top of the familiar spring framework. The post gives an introduction to Spring Cloud, a sample application, and how to unit test.

Apache Airflow summit videos now available on Youtube.

Noria is a new streaming data-flow system designed to act as a fast storage backend for read-heavy web applications based on this paper from OSDI'18. The thesis presentation by Jon Gjengset on his work on Noria is an educational one.

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.