Welcome to the seventh edition of the data engineering newsletter. This week's release is a new set of articles that focus on Natural language query processing, Data Ethics, MLOps, Data Orchestration comparison, and A/B testing platform from NVIDIA, Criteo, Netflix, Zendesk, Salesforce, Airbnb, and Facebook.
Differential privacy is a mathematically, rigorous framework for quantifying the anonymization of sensitive data. Facebook opensource Opacus, it's a high-speed library for training PyTorch models with differential privacy. It's exciting to see more tooling coming around for the privacy and ethical approach in machine learning community.
https://github.com/pytorch/opacus
On similar Facebook efforts, Airbnb writes about Project Lighthouse, an initiative to measure and combat discrimination when booking or hosting on Airbnb. The blog post focuses on using data that satisfies p-sensitive k-anonymity to calculate acceptance rates by guest perceived race.
https://medium.com/airbnb-engineering/project-lighthouse-part-1-p-sensitive-k-anonymity-c6ee7d79c4f9
MLOps is a set of best practices for businesses to run AI successfully. MLOps is a relatively new field because the commercial use of AI is itself reasonably new. In this blog post, NVIDIA narrates about MLOps and a few success stories of MLOps adoption.
https://blogs.nvidia.com/blog/2020/09/03/what-is-mlops/
What quantifies as the large-scale machine learning platform. Is it massive hardware, petabytes of data, or the complexity and time constraint? In this blog post, Criteo walkthrough the trade-off of a large-scale machine learning system.
https://medium.com/criteo-labs/the-trade-offs-of-large-scale-machine-learning-71ad0cf7469f
The data orchestration is the central nervous system of data infrastructure. Redpoint ventures write an exciting blog about an overview of the data orchestration and the landscape of the tools available.
https://medium.com/memory-leak/data-orchestration-a-primer-56f3ddbb1700
Zendesk narrates about its data catalog story for the enterprise data team and their choice of Apache Atlas.
https://medium.com/zendesk-engineering/data-catalogs-the-luxury-of-choice-c161cd91e713
Apache Flink 1.11 comes with significant changes to the memory model of Flink’s JobManager and configuration options for the Flink clusters. The developers can configure off-heap memory and the metaspace overhead and Total Process memory and Total Flink memory.
https://flink.apache.org/2020/09/01/flink-1.11-memory-management-improvements.html
It’s exciting to see more ecosystem starting to build on top of DBT. Census is the data automation platform that syncs your data warehouse with the apps you use. [Unfortunately not open sourced]. Census writes about its data sync tool that syncs with the DBT model.
https://blog.getcensus.com/making-your-dbt-models-more-useful-with-census/
RazorPay writes about it's Druid adoption to expose high level and business-critical data within the organization through various dashboards, powered by Looker. The journey they took from the batch processing to Apache Kylin to Druid is an exciting read.
Salesforce Einstein introduced Photon, a natural language interface to query the database. I tried a few queries, and the joins seem working amazingly.
https://blog.einstein.ai/talk-to-your-data-one-model-any-database/

Unlike True Experimentation, the participants were not randomly assigned either the treatment or the control group in the quasi experimentation. Netflix runs quasi-experiments at scale and writes about its Quasimodo tool, which automates some aspects of the scientists' workflow and runs more quasi-experiments in parallel.
https://netflixtechblog.com/key-challenges-with-quasi-experiments-at-netflix-89b4f234b852
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.