Data Engineering Weekly #5

Weekly data engineering newsletter

Welcome to the fifth edition of the data engineering newsletter. This week's release is a new set of articles that focus on DBT, Testing, and production deployment of ML infra, ML applications from Reddit, FarFetch, Pinterest, Google Cloud, Lyft, and AI economics.


DBT is capturing the heart of the data engineers and becomes an essential toolkit for building the data pipeline. I tweeted my thoughts recently on DBT and why it's groundbreaking.

The article echoes a similar sentiment. The author narrates some of the drawbacks of SQL or the looseness of the SQL and how DBT addresses these concerns to build complex testable SQL pipelines.

https://highgrowthengineering.substack.com/p/why-is-dbt-so-important-


The data workload much different and complicated from the traditional request/ response applications. The data world missed some of the devops practices since it's expensive to maintain a parallel data pipeline. The Dataops terminology is trying to bridge the difference. The article gives a good overview of how Devops and Dataops practices can complement each other.

https://www.infoq.com/articles/dataops-devops-scale-speed/


a16z writes about the challenges of building successful AI companies. The cost of computing, the need for the human in the loop, and the scaling problems are the standout concerns. It also provides opportunities to innovate, and the second part focuses on some of the best practices to develop successful AI applications.

https://a16z.com/2020/02/16/the-new-business-of-ai-and-how-its-different-from-traditional-software/

https://a16z.com/2020/08/12/taming-the-tail-adventures-in-improving-ai-economics/


It will always be tempting to think of every data application as an ML-driven application. If you've got a finite input and a known output for a given input, you don't have an ML problem. Capital One writes an interesting article on the same by comparing Pros & Cons of Rule Engines vs. ML models.

https://medium.com/capital-one-tech/a-modern-dilemma-when-to-use-rules-vs-machine-learning-61cc908769b0


Reddit writes about it's reporting services design to monitor the effectiveness of the campaigns. The pragmatic approach of using AWS services and Kubernetes cron services as a scheduler is a good reminder of how to use cloud services effectively.

https://redditblog.com/2020/07/21/building-scheduled-reports-for-ad-campaigns/


GPS data that we get is often noisy and does not match the real world. Lyte writes about its map-matching algorithm in detail and compares how it overcomes some of the Hidden Markov Model's limitations.

https://eng.lyft.com/a-new-real-time-map-matching-algorithm-at-lyft-da593ab7b006


FarfetchTech writes about it's an email recommendation engine. The exciting part of the article is not how to generate recommendations, but when to create it. The latest possible time to create the recommendation for an email is to generate it as it opened! Any earlier than that and the recommendation starts getting stale. Any later than that and the email is missing content.

https://www.farfetchtechblog.com/en/blog/post/recommendations-in-emails-pretty-close-to-rocket-surgery/


Pinterest writes about its search and recommendation engine by the skin tone model. It's exciting to see how Pinterest focuses on eliminating biases and diverging datasets to ensure it's an inclusive system.

https://medium.com/pinterest-engineering/powering-inclusive-search-recommendations-with-our-new-visual-skin-tone-model-1d3ba6eeffc7


Change data capture (CDC) is a term used to refer to a set of techniques for identifying and exposing changes made to a database. Bolt writes about it's CDC pipeline on top of Debezium, Kafka and Kafka Connect.

https://www.confluent.io/blog/how-bolt-adopted-cdc-with-confluent-for-real-time-data-and-analytics/


Traditionally the data pipelines centered around JVM languages and the data science workloads on Python. Google Cloud writes about the Dataflow Runner v2, a more efficient and portable worker architecture rewritten in C++, based on Apache Beam’s new portability framework. The worker architecture provides the standard feature set across all language-specific SDKs and share bug fixes and performance improvements.

https://cloud.google.com/blog/products/data-analytics/multi-language-sdks-for-building-cloud-pipelines


The official Flink docker image download crosses 50 million. Flink on Docker often the preferred deployment model. The article narrates the current state of Flink's docker support and how to get started.

https://flink.apache.org/news/2020/08/20/flink-docker.html


The biggest challenge of the ML application is the algorithm but is to put ML systems into production. MLFlow from Databricks, Googles's TFX, Uber's Michelangelo, Facebook's FBLearner Flow, Microsoft's AI Lab, Amazon's Amazon ML, Airbnb's BigHead are some of the systems attempted to streamline running ML applications in production. The article walks through various stages of ML deployment from portability, CI/CD, deployment strategy to monitoring the production system.

https://medium.com/swlh/productionizing-machine-learning-models-bb7f018f8122


Machine learning systems are trickier to test since we're not explicitly writing the logic of the system. However, automated testing is still an essential tool for the development of high-quality software systems. The article narrates the difference between model evaluation and model testing and how to write model tests.

https://www.jeremyjordan.me/testing-ml/amp/


Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.