Data Engineering Weekly #3

Weekly data engineering newsletter

Welcome to the third edition of the data engineering newsletter. This week's release is a new set of articles that focus on practical learning, performance tuning, version control, and next-gen GPU stream processing. 


I'm excited to read about the GPU-accelerated streaming platform this week. NVIDIA writes about cuStreamz, the first GPU-accelerated streaming data processing library. Written in Python, it built on top of RAPIDS, the GPU-accelerator for data science libraries.

https://medium.com/rapids-ai/gpu-accelerated-stream-processing-with-rapids-f2b725696a61


Continue on the GPU-accelerated stream processing, Apache Flink 1.11 introduces a new External Resource Framework, which allows you to request external resources from the underlying resource management systems (e.g., Kubernetes) and accelerate your workload with those resources. The blog post explains how to integrate the GPU plugin that can help to build an end-to-end real-time AI workflow.

https://flink.apache.org/news/2020/08/06/external-resource.html


Linkedin writes about learning from Hadoop incidents. All the modern workflow schedulers support retries, but the unstable infrastructure hides the resource cost with this build-in fault tolerance of the system. Though the article focused on HDFS data loss, the theory applies all parts of the data pipeline.

https://engineering.linkedin.com/blog/2020/learnings-from-a-recent-hadoop-incident


Tencent wrote a guest post about it's Apache Kafka infrastructure to Handle 10 Trillion+ Messages Per Day. The federated Kafka clusters and the logical topic mapping is emerging as a design pattern to handle large scale Kafka infrastructure. The proxy approach for the consumers is a contrasting approach from the Kafka consumer SDK approach. 

https://www.confluent.io/blog/tencent-kafka-process-10-trillion-messages-per-day/


eBay writes about Terapeak Research 2.0 platform based on Apache Kafka and Elastic search. The article narrates it's the approach to the fault-tolerant pipeline. The primary, secondary consumer pattern is something new to me.

https://tech.ebayinc.com/engineering/terapeak-research-2-0-making-the-data-processing-pipeline-robust/


Patterns of Distributed Systems is a refreshing read about the system design. The data infrastructure engineers deal with multiple distributed systems, and the article is an exciting read to approach the design abstractly.

https://martinfowler.com/articles/patterns-of-distributed-systems/


The COVID-19 outburst changes the landscape of many businesses and personal life. The Expedia data visualization group writes a fantastic article about how it monitors local restrictions to predict when the customers want to go traveling again and employees' well-being.

https://medium.com/expedia-group-tech/how-expedia-group-is-monitoring-market-recovery-during-covid-19-1ce79e4cf60d


The source to destination validation is an essential step in an ETL pipeline. Direct Energy rewrote over 350 SQL Server stored procedures in PySpark as part of on-premises data warehouses to AWS migration. The article narrates Pythagoras, a data reconciliation engine using Amazon EMR and Amazon Athena.

https://aws.amazon.com/blogs/big-data/build-a-distributed-big-data-reconciliation-engine-using-amazon-emr-and-amazon-athena/


Apache Flink writes about Pandas UDF's support for PyFlink. The current version supports only the scalar Pandas UDFs.

https://flink.apache.org/2020/08/04/pyflink-pandas-udf-support-flink.html


Cost optimization becomes mainstream engineering in the cloud infrastructure. Expedia's blog series is an exciting read on optimizing Apache Spark's cost for running the batch workload.

https://medium.com/expedia-group-tech/part-1-cloud-spending-efficiency-guide-for-apache-spark-on-ec2-instances-79ee8814de4e

https://medium.com/expedia-group-tech/part-2-real-world-apache-spark-cost-tuning-examples-42390ee69194


Continue with the cost optimization, Amazon EC2 Spot Instances, which enable you to use unused Amazon EC2 computing capacity in the AWS Cloud, offer up to 90% savings over On-Demand Instances. That data may need to be "shuffled" to other Amazon EC2 instances to continue processing. In this article, Qubole writes about FSx for Lustre; a high-performance parallel file system provides a mechanism to offload and eventually access this data in a high-performance shared file system helps reduce costs and improve performance.

https://aws.amazon.com/blogs/storage/how-qubole-optimizes-cost-and-performance-by-managing-shuffle-data/


Many innovative approaches are coming out for the data lifecycle management. The presentation walks through the major trends in each part of the data life cycle, such as data pipelines, compute engines, data modeling, data products, and data quality. It's missing the data discovery/ data accessibility trends, though.

https://tomtunguz.com/five-data-trends-one-mega-trend-data-lifecycle/


DBT is gaining much momentum as the leader as an analytics engineering workflow engine. The article walkthrough five reasons why BigQuery users should use DBT. The article focuses on BigQuery, but the reasoning applicable to any SQL databases.

https://medium.com/@yuu.ishikawa/5-reasons-why-bigquery-users-should-use-dbt-144f326c458a


Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.