Data Engineering Weekly #9

Weekly data engineering newsletter

Welcome to the 9th edition of the data engineering newsletter. This week's release is a new set of articles that focus on the COVID-19 effect on ML models, combat disinformation, Analytics @ Netflix, data quality libraries, and Apache Pinot from Microsoft, Netflix, Doordash, CapitalOne & Databricks.

Disinformation is widespread. As a data professional, it's our ethical responsibility to think and prevent disinformation. Deepfake is one of the AI technique can spread disinformation. Microsoft announces Video Authenticator to combat deepfake videos.

COVID-19 introduced significant habits and patterns changing in many business applications. The volatility brings a new set of challenges to the machine learning model. DoorDash witnesses an extreme surge and writes a blog post narrates how it retrain the ML model to accommodate the business dynamics changes.

Netflix writes an excellent post on analytics @ Netflix and the purpose of the analytical role at Netflix. The article provides an excellent narrative difference between the data analyst and the data engineer.

People in data science and engineering are highly connected to the business, solve end-to-end problems, and are directly responsible for improving business outcomes. But what makes this group shine are their differences. They come from lots of backgrounds, which yields different perspectives on how to approach problems. Netflix shares its data science & engineering members' stories and the career path.

Apache Pinot is a real-time distributed datastore, built to deliver scalable real-time analytics with low latency. The blog post narrates how to run real-time climate analysis on Apache Pinot using National Center for Environmental Information (NCEI) dataset.

A typical analytical development lifecycle goes like Ingest -> Build a model -> Deploy -> Monitor. There is more focus on build the model than deploy, which produces the business value. The article narrates various strategies to deploy the analytical workload.

Data infrastructure relies on the complex chain of data pipelines. As the data flow through the pipeline, the familiar adage goes, "Garbage In, Garbage Out." Hence the data quality is an integral part of the data pipeline. The blog post compares the top data quality frameworks available, The Tensorflow data validator, Great Expectations, and Deequ.

The metadata search and discovery of the dataset are critical parts of the data infrastructure to democratize data. The blog post narrates some of the open-source metadata hub tools and deep dive on Linkedin's datahub and Lyft's Amundsen.

Testing the correctness of an async data pipeline is always challenging. The Disney Streaming team writes about the weaver-test, an open-source scala test framework for Kafka & Kinesis streams.

Dwelo writes about its data infrastructure and exciting to see the containerization, DBT, and cloud storage becomes the standard tooling to build data infrastructure.

The data reveals the hidden truth in social economics. In 2015, the United Nations adopted 17 Sustainable Development Goals (SDGs), which represented a universal call to action to end poverty, protect the planet and ensure that all people enjoy peace and prosperity by 2030. The blog post narrates how enriching the SDG indicator with the open street map geolocation data brings more insights into the average share of the built-up area of open space for public use.

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.