Welcome to the 35th edition of the data engineering newsletter. This week's release is a new set of articles that focus on DeepLearningAI's model-centric to data-centric AI, FreeCodeCamp's MLOps explained, Salesforce's building successful AI platform, Shopify's CDC journey, Strava's scaling data culture, New York Times SQL interview process, Tiquets taming data dependency with DBT, executing shuffle without MapReduce in Ray, Adaltas file format's data size comparison, Working with nested data structure in PySpark, DataCamp's open-source auto-dependency ingestion project ViewForm.
Deep Learning AI: A Chat with Andrew on MLOps: From Model-centric to Data-centric AI
80% of the ML workload is data preparation and management, yet 99% of papers published focus on AI research, and 1% focus on data. The talk narrates ML lifecycle's importance and why it requires from model-centric to data-centric makes data quality a systematic & reliable process.
FreeCodeCamp: What is MLOps? Machine Learning Operations Explained
The enterprises are increasingly embedding ML-enabled decision automation across business verticles. The reliability of the ML applications became mainstream, and the rise of MLOps takes the mainstage. The article walks through different stages of the MLOps and the skills required to develop ML products.
https://www.freecodecamp.org/news/what-is-mlops-machine-learning-operations-explained/
Paper:
The article quoted an interesting paper read: Hidden Technical Debt in Machine Learning Systems
Salesforce: Building a Successful Enterprise AI Platform
How to plan about building an AI platform? The blog narrates the founding blocks of a successful platform. It emphasizes the importance of end-to-end user experience, having the right mix of domain and technical expertise, effective communication channels, faster research to production, uniformity, privacy & trust.
https://engineering.salesforce.com/building-a-successful-enterprise-ai-platform-197a3c4d8b60
Shopify: Capturing Every Change From Shopify’s Sharded Monolith
Shopify writes an exciting blog about its change data capture journey from periodic batch query polling to continuous change data capture using Debeizium & Kafka Connect. The blog narrates the technical challenges with pull-based change data capture and lessons learned from the CDC platform, such as schema changes and handling large records.
https://shopify.engineering/capturing-every-change-shopify-sharded-monolith
Fivetran/ Strava: Scaling Data Culture Is a Marathon, Not a Sprint
One of the data engineering team's vital responsibilities is to drive the data culture across the organization. The blog narrates the importance of focusing on the data pipeline's bottleneck to accelerate the data journey to minimize the people scale problems.
https://fivetran.com/blog/scaling-data-culture-is-a-marathon-not-a-sprint
New York Times: An Update to Our SQL Interviews
NYT writes an exciting blog about the pros and cons of whiteboarding vs. online coding vs. take-home interview formats for the data analyst workload. NYT adoption of the hybrid model interview process is an interesting approach to read.
https://open.nytimes.com/an-update-to-our-sql-interviews-cf39dafeddcf
Tiqets Engineering: Taming the Dependency Hell with dbt
The model-based dependency much powerful when it comes to the data pipeline workload, and DBT made it a default dependency mode. The blog narrates the challenges of maintaining views without tools like DBT, how DBT simplified the problem, and some of the pain points running DBT in production.
https://medium.com/tiqets-tech/taming-the-dependency-hell-with-dbt-2491771a11be
Distributed Computing with Ray: Executing a distributed shuffle without a MapReduce system
Ray provides a simple primitive for building and running distributed applications. A distributed data computation algorithms rely on efficient data shuffling. The blog narrates how Ray simplifies the data shuffling without the need for MapReduce frameworks.
Adaltas: Storage size and generation time in popular file formats
Object storage becomes the default persistence layer for the data lake, and choosing an efficient file format is equally important. The blog did excellent work on comparing various file formats and concludes ORC provides a much effective storage optimization.
https://medium.com/adaltas/storage-size-and-generation-time-in-popular-file-formats-48a23190c1da
Anindya Saha: Nested Attributes & Functions Operating on Nested Types in PySpark
The nested data structures are the norm of data analytics and help minimize the need to build complex normalization forms. Arrays and Maps are the common nested structures. The author walks through how to handle nested data types in PySpark.
https://anindyacs.medium.com/working-with-nested-data-types-7d1228c09903
Data Camp Engineering Blog: Data Scientists, don’t worry about data engineering - Viewflow has your back.
In a complex data pipeline, finding all the upstream dependency is a tedious job that often results in hacky code search. The data lineage can mitigate the findability of the dependency yet requires multiple navigations. ViewForm takes an exciting approach to automatically generate the internal and external dependency for the task as a code-gen for Airflow.
https://medium.com/datacamp-engineering/viewflow-fe07353fa068
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.