Welcome to the 18th edition of the data engineering newsletter. This week's release is a new set of articles that focus on Microsoft’s ML model governance, Google’s MinDiff , Slack’s Airflow migration, Doordash’s scaling ML feature store, Data science methodology at Atlassian, Pinterest’s journey from Lambda to Kappa architecture, What is next step for data management? What is data mesh? How is data pipeline framework & data ingestion framework landscape looks like? and experience with the Oracle Cloud for Data Engineering.
Predictive modeling is critical for today's business. As a result, governing the Machine Learning Model, including controlling access, testing, validating, logging changes and access, and tracing model results is critical. Microsoft writes an excellent read about Machine Learning Model governance at scale with development lifecycle, roles, and responsibility.
https://medium.com/data-science-at-microsoft/machine-learning-model-governance-at-scale-26c9a2dc15b5
Google announced the release of MinDiff, a new regularization technique available in the TF Model Remediation library for effectively and efficiently mitigating unfair biases when training ML models. It’s exciting to see more tooling coming along on overcoming unfair biases in ML.
https://ai.googleblog.com/2020/11/mitigating-unfair-bias-in-ml-models.html
The efficiency of an infrastructure team is defined by the number of clean migration it can do. Slack's data infrastructure team has done it twice with the Airflow and writes about how it does Python 3 migration. It is an exciting read in terms of excellence in software engineering practices.
https://slack.engineering/migrating-slack-airflow-to-python-3-without-disruption/
As the ML development becomes mainstream in a company, the amount of feature data can grow to billions of records with millions actively retrieved during model inference under low latency constraints. Doordash narrates the challenges of operating a large scale feature store and optimizes Redis for scalability.
https://doordash.engineering/2020/11/19/building-a-gigascale-ml-feature-store-with-redis/
How to build a data science methodology that works for your team? Atlassian narrates its experience running the data science process and various methodologies for building a data science team.
https://medium.com/atlassiandata/build-a-data-science-methodology-7633935dc644
Pinterest writes it's the evolution from Lambda to Kappa Architecture for the visual signal infrastructure. The Lambda architecture inherently suffers from high latency, challenging to debug with two different business logic execution paths. The blog narrates about how granular retries are a significant advantage while adopting Kappa architecture.
Beyond the Database, and Beyond the Stream Processor: What's the Next Step for Data Management? The article is an excellent summarization of stream processing vs database tables duality. The article narrates the fundamental different from stream & database with the access pattern, and the need for rethinking what is database means in coming years.
https://www.infoq.com/articles/whats-the-next-step-for-data-management/
Data mesh is the paradigm shift in developing data products. The concept is still evolving and adopted as the complexity of data increases. ThoughtWorks writes an excellent article on Data Mesh detailing it's not about tech, but it's about ownership and communication.
The popularity of cloud databases like Snowflake increases the adoption of cloud data integration services like Fivetran. The blog post compares the current landscape of opensource data integration services Singer, Airbyte, Pipelinewise & Meltano. It is exciting to watch Airbyte gaining momentum in the landscape.
https://towardsdatascience.com/the-state-of-open-source-data-integration-and-etl-d2f2e8733e2a
The data pipeline frameworks are the central nervous system of building data applications. The author narrates the current state of the data pipeline by comparing Luigi, Airflow & Kubeflow. Dagster & Prefect are the other two exciting frameworks emerging in the data pipeline landscape.
https://dav009.medium.com/current-state-of-data-pipelines-frameworks-november-2020-4cceefb1cb14
What is the status of Data Engineering at Oracle Cloud? The author narrates experience with the Oracle Cloud in comparison with GCP. The verdict If you ignore the research time, from pre-setup to deployment and then post-setup, all the process takes about 2 hours to complete. Just for the sake of comparison, deploying a Dataproc cluster (GCP’s equivalent) takes on average 2 minutes!!!
https://medium.com/@danielapetruzalek/one-week-using-oracle-cloud-for-data-engineering-7933e35e4af2
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.