Welcome to the 16th edition of the data engineering newsletter. This week's release is a new set of articles that focus on Data quality at Airbnb, Data version control tools, Spotify experimentation framework, Uber’s proposal to add remote shuffle for Spark, Data pipeline’s health with Great Expectation, Dagster-DBT integrations, and why documentation matters for ML.
Airbnb writes an excellent post about data quality and how to structure the organization to maximize data engineering efficiency. The blog highlights the startup’s common mistake of not focusing on data quality and its impact while scaling the organization.
https://medium.com/airbnb-engineering/data-quality-at-airbnb-e582465f3ef7
The functional programming model is widely adopted in data engineering. One of the key characteristics of a pure function is it should always return the same output, given the same input. Data versioning is one of the common techniques in data engineering to adopt a pure function strategy. The article compares the currently available data versioning frameworks.
https://dagshub.com/blog/data-version-control-tools/
Spotify writes about the second part of its Experimentation Framework. The article narrates how users are assigned to experiments, analyze results, and ensure test integrity.
https://engineering.atspotify.com/2020/11/02/spotifys-new-experimentation-platform-part-2/
Data shuffling is the performance bottleneck for most of the Spark jobs. The throughput of the network IO improving over the period of time, Uber writes about its proposal to build an external shuffle service. The external shuffle service accepts shuffle data from the shuffle writer, persists, and sends it to the shuffle reader. I wonder if Kafka can emerge as a shuffle service provider.
https://github.com/uber/RemoteShuffleService/blob/master/docs/server-high-level-design.md
Github writes about how to integrate Great Expectation with Github's actions. Although the cost of running the test against the production data and the obvious security issue still a concern, the concept sounds very exciting.
Dagster is growing rapidly with some amazing new features in the data orchestration layer. DBT and Dagster integration is an exciting read.
https://dagster.io/blog/dagster-dbt
DBT tags are an impressive feature addition to the pipeline. The tags can be added for a table-level or a column level, and the test suites can be executed for the associated tags. The article narrates how to use DBT tags.
https://yu-ishikawa.medium.com/understanding-the-scopes-of-dbt-tags-691d0286f3aa
Allegro Tech writes about its Python framework for data processing pipelines on GCP, BigFlow. BigFlow is a set of the toolkit developed by Allegro tech for data processing in the Google cloud.
https://allegro.tech/2020/11/bigflow-a-python-framework-for-data-processing-on-gcp.html
Who created the data? What is in the data? When was the data created? How to access the data? Data discoverability is the key aspect of data engineering to find all these questions. Documentation and adding metadata to the datasets are essential for reliable data infrastructure. The article is an excellent narration of why documentation is important for Machine Learning.
https://medium.com/df-foundation/why-data-documentation-matters-for-machine-learning-d2265b76fe
Capacity planning is a critical part of scaling the system. Cloudflare writes about how it monitors the disk size and bytes consumed to enrich capacity planning for ClickHouse.
https://blog.cloudflare.com/clickhouse-capacity-estimation-framework/
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.