Welcome to the 32nd edition of the data engineering newsletter. This week's release is a new set of articles that focus on Picnic’s Data Vault modeling, Mihaileric’s why we need more data engineers, Microsoft’s onboarding data scientist checklist, Netflix’s data movement with Google Services, Redpoint Venture’s data feedback loop with SAAS applications, DoorDash’s declarative real-time feature engineering, Uber’s applying ML for internal auditing, Pinterest’s ML techniques to fight misinformation, Monte Carlo’s new data quality rules, and Anna Anisienia’s take on Airflow task group design.
Let’s start this week with some fun but also the sad reality of the data engineering journey.
Data vault - new weaponry in your data science toolkit
The emerging cloud datawarehouse and the structured data approach bring back the importance of data modeling techniques like data vault and the Kimball methodologies. Picnic writes an exciting read on how it uses these data modeling techniques on top of Snowflake to empower historical data access, time-traveling through historical data, integrate with the real-time pipeline.
It is an exciting area of study. I'm wondering how traditional data modeling techniques go hand-in-hand with the modern data engineering principles of the immutable idempotent data pipeline and data versioning techniques. If you've thoughts, let's connect and discuss.
We Don't Need Data Scientists, We Need Data Engineers
How do the data practitioners' (data & ML engineering) jobs distributed across the companies are interesting to understand the data domain's emerging pattern. Though the Author analyzed a small set of YC startups, the underlying observation is worth noting. The modern ML frameworks like Tensorflow, PyTorch industrialized machine learning, but the data collection, cleaning & labeling remains unindustrialized and requires manual work for the most part.
Data movement for Google services at Netflix
The business operations use multiple SAAS tools to operate a business unit effectively. It brings many challenges like data access control, lineage tracking, and integration with other business operations. Netflix writes an exciting blog post highlighting how it tackles the challenges using a proxy service for Google workspace apps integrations.
The Feedback Loops in Data that Will Change SaaS Architecture
As we noticed in Netflix's Google workspace integration journey, It's an increasingly common pattern for an enterprise to contribute and leverage data from SAAS applications to meet the business goals. The author captures the feedback loop of data flowing across the SAAS applications. It is an exciting space to watch.
Building Riviera: A Declarative Real-Time Feature Engineering Framework
ML models play a significant role in improving the users' experience. As a result, an efficient feature engineering framework is a critical part of the ML infrastructure.DoorDash writes an exciting blog that narrates the importance of having a near-realtime feature store to enrich the customer experience and how the Flink-as-a-service platform helps to fulfill the mission.
Applying Machine Learning in Internal Audit with Sparsely Labeled Data
As machine learning continues to evolve, transforming the various industries, it touches. Uber narrates one such transformation on how ML helps its internal auditing system, answering questions such as how many Agents per country, number of transactions, total cash paid, evolution over the past three years. It's no surprise to notice the data availability and data labeling mentioned as the most significant challenge rather than ML model development.
How Pinterest fights misinformation, hate speech, and self-harm content with machine learning
Providing a safe and secure experience from health misinformation to hate speech, self-harm, and graphic violence is a significant challenge for social platforms. Pinterest narrates the ML-driven architecture that empowers the system to detect unsafe content before it’s reported automatically.
The New Rules of Data Quality
Historically the data quality checks focused on a siloed, data producer-driven testing, which is essentially equivalent to unit testing. Is a unit test is enough for data testing? The blog narrates some of the principles to follow to engineer data quality and empathize the data quality is a collective responsibility.
TaskFlow API in Apache Airflow 2.0 — Should You Use It?
A data pipeline is more than a unit of execution and often requires sharing its state for the downstream jobs for providing composable pipeline. The blog narrated some of the task group design's pros and cons and the practical implication and raised some interesting points on data transformation vs. orchestration.
Onboarding to a data science team
Microsoft writes an exciting blog on a typical checklist for onboarding to a data science job. Though the blog focuses on the data science job, the checklist applies to data engineers since data engineers often require multiple stakeholder communication.
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.