Data Engineering Weekly #60

Weekly Data Engineering Newsletter

Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers

RudderStack Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.


Event: Join Impact 2021 on November 3, 2021: The First-Ever Data Observability Summit. Join Today's Leading Data Pioneers

Hear from data leaders pioneering the technologies & processes shaping data engineering. Featuring First Chief Data Scientist of the U.S., founder of the Data Mesh, and many more!

Click To Get Your Free Ticket For All Data Engineering Weekly Readers


Google A.I.: An ML-Based Framework for COVID-19 Epidemiology

COVID-19 pandemic has had a profound impact on daily life. Google A.I. discusses the recent paper A prospective evaluation of AI-augmented epidemiology to forecast COVID-19 in the USA and Japan. Though the learned transition with the available data is novel, the author acknowledges that a lack of reliable, high-quality public data is significant.

https://ai.googleblog.com/2021/10/an-ml-based-framework-for-covid-19.html


Udemy: Designing the New Event Tracking System at Udemy

Udemy writes about its journey to build an event tracking system. The discussion around buy vs. build, protobuf vs. Avro, Avro schema annotations are exciting reads.

https://medium.com/udemy-engineering/designing-the-new-event-tracking-system-at-udemy-a45e502216fd


Pinterest: Efficient Resource Management at Pinterest’s Batch Processing Platform

Pinterest writes about efficient Yarn resource management for its batch processing platform. The blog is an exciting case study of data-driven system design compared to the auto-scaling of computing instances.

https://medium.com/pinterest-engineering/efficient-resource-management-at-pinterests-batch-processing-platform-61512ad98a95


Open Metadata: Announcing OpenMetadata

OpenMetadata is an open-source project building Schema First and API First Metadata Standard. A Single place to Discover, Collaborate and Get your data right.

Reviewer: Sriharsha Chintalapani

https://blog.open-metadata.org/announcing-openmetadata-20399b816e60

Now you can submit your reviews here https://github.com/ananthdurai/dataengineeringweekly.


Salesforce: How to ETL at Petabyte-Scale with Trino

Salesforce writes about its usage of Trino as an ETL engine. Trino certainly has some shortcomings in ETL, such as lack of mid-query fault tolerance and limited expressive power; there are also some highly underrated advantages to using Trino for ETL. The author narrates techniques to overcome some of the shortcomings of Trino as an ETL engine.

https://engineering.salesforce.com/how-to-etl-at-petabyte-scale-with-trino-5fe8ac134e36


Sponsored: RudderStack - Churn Prediction With BigQueryML to Increase Mobile Game Revenue

Here’s an interesting case study on how machine learning can directly impact the bottom line. RudderStack writes an outline of how app developers, Torpedo Labs, use BigQuery ML to identify high-value mobile game players who are dangerously close to churning.

https://rudderstack.com/blog/churn-prediction-with-bigqueryml


Stitch Fix: Functions & DAGs: introducing Hamilton, a microframework for dataframe generation

Stitch Fix writes about Hamilton, a microframework for dataframe generation. Hamilton efficiently solving the complexity of the chain of dataframe transformation on each column. Instead of having Data Scientists write code that they subsequently execute in a massive procedural tangle, Hamilton utilizes how the function is defined to create a DAG and execute it for Data Scientists.

https://multithreaded.stitchfix.com/blog/2021/10/14/functions-dags-hamilton/


AWS: Implement a slowly changing dimension in Amazon Redshift

Slowly changing dimension and incremental data processing are the 90% of data pipeline workload pattern. AWS writes how to handle slowly changing dimensions (SCD) in Redshift with best practices and anti-patterns.

https://aws.amazon.com/blogs/big-data/implement-a-slowly-changing-dimension-in-amazon-redshift/


Databricks: Native Support of Session Window in Spark Structured Streaming

Excited to see in the upcoming Apache Spark 3.2, we add “session windows” as new supported types of windows, which works for both streaming and batch queries. The blog walkthrough how to add a session window on event time.

https://databricks.com/blog/2021/10/12/native-support-of-session-window-in-spark-structured-streaming.html


HomeToGo: DBT at HomeToGo

HomeToGo writes about its adoption of dbt into the data infrastructure and dbt integration with Apache Airflow. The layered approach of metrics computations on top of the dbt model, testing the dbt model with GreatExpectations, is exciting to read.

https://engineering.hometogo.com/dbt-at-hometogo-ece067987267


Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.