Welcome to the 27th edition of the data engineering newsletter. This week's release is a new set of articles that focus on decentralized content moderation, Kafka as a database, Snowflake's External Table, Dagster 0.10.0, Uber's real-time data intelligence platform, Dropbox's Superset adoption, Cloudflare's data center operations using Airflow, Apache Kudi's clustering, Trainline's data lake.
Martin Kleppmann:
Decentralized content moderation
January 2021 is a happening month, brings a lot of debate over censorship and content moderation by social media. People gossip and spread misinformation over the centuries, but the impact is limited to a local context. Twitter's and Facebook created a Cerebro for misinformation. The author summarizes the need to rethink content moderation from a centralized, subjective moderation to democratic, decentralized content moderation. It is an exciting space to watch how data infrastructure can evolve to improve content moderation.
https://martin.kleppmann.com/2021/01/13/decentralised-content-moderation.html
Facebook’s Fighting abuse @scale 2019 conference contains some exciting talks on the same.
https://engineering.fb.com/2019/12/13/security/fighting-abuse-scale-2019/
David Xiang:
Kafka As A Database? Yes Or No
Apache Kafka plays a vital component in modern infrastructure. Is Kafka a database? It is a hot debate that shapes the future of streaming technology. The author summarizes the merits and demerits of treating Kafka as a database. One of Kafka's conventional arguments is that it supports the read/ write separation of concerns with write-once/ multi-model read pattern. Simultaneously, maintaining data integrity and multi-model materialization is not cheap and can further complicate the system design. Nonetheless, it is exciting to watch the evolution of streaming databases.
https://davidxiang.com/2021/01/10/kafka-as-a-database/
Snowflake:
External Tables Are Now Generally Available On Snowflake
The cloud storage services like AWS S3, Azure Data Lake Storage, or Google Cloud are the popular choice for data lake systems. Snowflake, the famous cloud data warehouse, introduced external tables that enable Snowflake to query cloud data storage. Snowflake also supports streaming ingestion for the external datasets similar to Apache Hudi & Delta Lake. Presto played the federated query engine role to unify querying data lake and cloud data warehouse systems, and it is a significant development from Snowflake to provide the native implementation.
https://www.snowflake.com/blog/external-tables-are-now-generally-available-on-snowflake/
Dagster:
Dagster 0.10.0: The Edge of Glory
Dagster released version 0.10.0, codenamed "The Edge of Glory." It's exciting to see Dagster's focus on native scheduler instead of relying on the cron or Kubernetes, supporting the sensors, tight integration with Kubernetes, and I/O manager abstraction to simplify the dev & testing phase of the pipeline development.
https://dagster.io/blog/dagster-0-10-0-the-edge-of-glory
Uber:
Uber’s Real-time Data Intelligence Platform At Scale: Improving Gairos Scalability/Reliability
Uber writes about Gairos, its real-time data processing, storage, and querying platform to facilitate streamlined and efficient data exploration at scale. The total size of queryable data served by Gairos is 1,500+TB, and the number of production pipelines is over 30. The total number of records is more than 4.5 trillion, and the total number of clusters is over 20. Over 1 million events flow into Gairos every second. The Gairos Optimization Engine is an exciting implementation to self-tune Elasticsearch & ingestion pipeline.
https://eng.uber.com/gairos-scalability/
Dropbox:
Why we chose Apache Superset as our data exploration platform
Apache Superset is now the top-level Apache project. Dropbox writes about why it chooses Apache Superset over competitive visualization frameworks like redash, mode & periscope.
https://dropbox.tech/application/why-we-chose-apache-superset-as-our-data-exploration-platform
Cloudflare:
Automating data center expansions with Airflow
The infrastructure operations and maintenance tasks are often scheduled as a cron job. However, cron has its limitation, and the orchestration engines like Airflow provides much more efficient scheduler for non-time-sensitive tasks. Cloudflare writes an excellent blog on the same of using Apache Airflow for data center operations.
https://blog.cloudflare.com/automating-data-center-expansions-with-airflow/
Apache Hudi:
Optimize Data Lake layout using Clustering in Apache Hudi
The small file is a classic problem in data infrastructure and inherent impact on query performance. Apache Hudi introduced a pluggable clustering architecture to handle the small files and colocated related data to improve query efficiency.
Trainline:
Building a data lake: from batch to real-time using Kafka
Timeline writes about its data pipeline evolution. It's exciting to see a similar data ingestion maturity model from API integration to batch processing to real-time data ingestion systems.
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.