Welcome to the 17th edition of the data engineering newsletter. This week's release is a new set of articles that focus on Databook at Uber, Flink Forward recap, Linkedin's new JVM ML Lib, PyTorch & MlFlow integration, A Data Engineer's take of Postgres, ClickHouse Vs. Redshift, and Picnic's data warehouse design.
Uber continues to build on this strategy and write an excellent blog about Databook, Uber’s in-house platform which surfaces and manages the metadata related to various data entities such as datasets, internal dashboards, business metrics, and more.
https://eng.uber.com/metadata-insights-databook/
Python is a popular language of choice for writing ML applications, where data pipeline mostly of the JVM languages. Linkedin writes about its opensource ML library for Java, Dagli, to build an integrated data pipeline and reduce the technical debt.
https://engineering.linkedin.com/blog/2020/open-sourcing-dagli
Ververica writes about a recap of the recent Flink Forward Global 2020 virtual conference. The evaluation of Flink SQL as a standard for unified stream and batch processing is an exciting trend to watch out for.
https://www.ververica.com/blog/flink-forward-global-2020-recap
PyTorch writes about its integration with MLFlow as a step toward enabling an end-to-end exploration of the production platform. The blog narrates some of the essential requirements for MlOps and what is next for PyTorch to enable MlOps.
https://medium.com/pytorch/mlflow-and-pytorch-where-cutting-edge-ai-meets-mlops-1985cf8aa789
ClickHouse OLAP engine largely flying under the radar compares to popular OLAP alternatives such as Apache Druid and Apache Pinot. The blog post is an excellent narration of ClickHouse capabilities, especially the window functions and a performance comparison with Redshift.
http://brandonharris.io/redshift-clickhouse-time-series/
Event sourcing is a critical yet not widely discussed area in data engineering. Source and standardize the events across different devices without impacting the client's performance, developer velocity, and security still a challenge. Walmart writes an excellent article about Walmart's development process for sourcing and preparing the event-driven data for analysis.
https://medium.com/walmartglobaltech/preparing-event-driven-data-for-analysis-3010da7416d7
As a Data Engineer, I wish Postgres could offer these features an excellent refreshing read about a data engineer's take on a database. Though the blog focuses on Postgres, the features like Temporal View, Incremental View Maintenance, and In-memory tables are fantastic features to have in any data warehouse storage.
Picnic writes a detailed and blog about its data warehouse system. The simplified infrastructure focuses on the data quality, trust in the data, and data democratization is a great read. The blog narrates an exciting take on the dimensional modeling techniques, with the blend of Data Vault and the Kimball Methodology.
https://blog.picnic.nl/picnics-lakeless-data-warehouse-8ec02801d50b
What happens if an accidental removal of data on S3? the deletion can happen accidentally or manually, and the regeneration of the dataset is often expensive pipeline orchestration. Fandom writes an excellent blog on disaster recovery for the S3 datasets.
Fresenius medical care writes about its data lake infrastructure. Jupiter Notebook, Presto, S3 with Parquet remains the standard choices and widespread in data democratization.
https://drdirk.medium.com/data-lake-architecture-at-fresenius-medical-care-f826536f09fe
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.