Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers
RudderStack
Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.
Event: Join Impact 2021 on November 3, 2021: The First-Ever Data Observability Summit. Join Today's Leading Data Pioneers
Hear from data leaders pioneering the technologies & processes shaping data engineering. Featuring First Chief Data Scientist of the U.S., founder of the Data Mesh, and many more!
Click To Get Your Free Ticket For All Data Engineering Weekly Readers
Google A.I.: An ML-Based Framework for COVID-19 Epidemiology
COVID-19 pandemic has had a profound impact on daily life. Google A.I. discusses the recent paper A prospective evaluation of AI-augmented epidemiology to forecast COVID-19 in the USA and Japan. Though the learned transition with the available data is novel, the author acknowledges that a lack of reliable, high-quality public data is significant.
https://ai.googleblog.com/2021/10/an-ml-based-framework-for-covid-19.html
Udemy: Designing the New Event Tracking System at Udemy
Udemy writes about its journey to build an event tracking system. The discussion around buy vs. build, protobuf vs. Avro, Avro schema annotations are exciting reads.
https://medium.com/udemy-engineering/designing-the-new-event-tracking-system-at-udemy-a45e502216fd
Pinterest: Efficient Resource Management at Pinterest’s Batch Processing Platform
Pinterest writes about efficient Yarn resource management for its batch processing platform. The blog is an exciting case study of data-driven system design compared to the auto-scaling of computing instances.
Open Metadata: Announcing OpenMetadata
OpenMetadata is an open-source project building Schema First and API First Metadata Standard. A Single place to Discover, Collaborate and Get your data right.
Reviewer: Sriharsha Chintalapani
https://blog.open-metadata.org/announcing-openmetadata-20399b816e60
Now you can submit your reviews here https://github.com/ananthdurai/dataengineeringweekly.
Salesforce: How to ETL at Petabyte-Scale with Trino
Salesforce writes about its usage of Trino as an ETL engine. Trino certainly has some shortcomings in ETL, such as lack of mid-query fault tolerance and limited expressive power; there are also some highly underrated advantages to using Trino for ETL. The author narrates techniques to overcome some of the shortcomings of Trino as an ETL engine.
https://engineering.salesforce.com/how-to-etl-at-petabyte-scale-with-trino-5fe8ac134e36
Sponsored: RudderStack - Churn Prediction With BigQueryML to Increase Mobile Game Revenue
Here’s an interesting case study on how machine learning can directly impact the bottom line. RudderStack writes an outline of how app developers, Torpedo Labs, use BigQuery ML to identify high-value mobile game players who are dangerously close to churning.
https://rudderstack.com/blog/churn-prediction-with-bigqueryml
Stitch Fix: Functions & DAGs: introducing Hamilton, a microframework for dataframe generation
Stitch Fix writes about Hamilton, a microframework for dataframe generation. Hamilton efficiently solving the complexity of the chain of dataframe transformation on each column. Instead of having Data Scientists write code that they subsequently execute in a massive procedural tangle, Hamilton utilizes how the function is defined to create a DAG and execute it for Data Scientists.
https://multithreaded.stitchfix.com/blog/2021/10/14/functions-dags-hamilton/
AWS: Implement a slowly changing dimension in Amazon Redshift
Slowly changing dimension and incremental data processing are the 90% of data pipeline workload pattern. AWS writes how to handle slowly changing dimensions (SCD) in Redshift with best practices and anti-patterns.
https://aws.amazon.com/blogs/big-data/implement-a-slowly-changing-dimension-in-amazon-redshift/
Databricks: Native Support of Session Window in Spark Structured Streaming
Excited to see in the upcoming Apache Spark 3.2, we add “session windows” as new supported types of windows, which works for both streaming and batch queries. The blog walkthrough how to add a session window on event time.
HomeToGo: DBT at HomeToGo
HomeToGo writes about its adoption of dbt into the data infrastructure and dbt integration with Apache Airflow. The layered approach of metrics computations on top of the dbt model, testing the dbt model with GreatExpectations, is exciting to read.
https://engineering.hometogo.com/dbt-at-hometogo-ece067987267
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.