Data Engineering Weekly #62

Weekly Data Engineering Newsletter

Nov 01, 2021

Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers

RudderStack Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.

Event: Join Impact 2021 on November 3, 2021: The First-Ever Data Observability Summit. Join Today's Leading Data Pioneers

Hear from data leaders pioneering the technologies & processes shaping data engineering. Featuring First Chief Data Scientist of the U.S., founder of the Data Mesh, and many more!

Click To Get Your Free Ticket For All Data Engineering Weekly Readers

Netflix: Open-Sourcing a Monitoring G.U.I. for Metaflow, Netflix’s ML Platform

The success of any developer framework depends on how efficiently the tool integrates with the developer workflow. Netflix writes about open source Metaflow G.U.I. for monitoring and operating its full-stack framework for data science.

https://netflixtechblog.com/open-sourcing-a-monitoring-gui-for-metaflow-75ff465f0d60

Ahmad Houri: How Netflix Metaflow Helped Us Build Real-World Machine Learning Services

The article gives a good overview of Netflix's Metaflow, demonstrating the scaling and cloud integration support of Metaflow with the A.W.S. step function.

https://towardsdatascience.com/how-netflix-metaflow-helped-us-build-real-world-machine-learning-services-9ab9a97cdf33

Presto: Scaling with Presto on Spark

Presto is known for interactive queries against data warehouses, but it has evolved into a unified SQL engine on open data lake analytics for interactive and batch workloads. Apache Spark execution engine with Presto is an exciting development to bring one SQL for batch & interactive workload.

https://prestodb.io/blog/2021/10/26/Scaling-with-Presto-on-Spark.html

Shopify: Shopify’s Path to a Faster Trino Query Execution: Custom Verification, Benchmarking, and Profiling Tooling

Reliable data infrastructure is critical for a faster “time-to-insight” for analytical queries. Shopify writes about its approach to benchmarking Trino infrastructure. The Key lessons section highlighting

A solid statistics foundation is crucial.
Many nuances of an environment can unintentionally influence results
Ensure you gather all the relevant data

The principles are essential for operating any data-driven infrastructure.

https://shopifyengineering.myshopify.com/blogs/engineering/faster-trino-query-execution-verification-benchmarking-profiling

InfoQ: A.W.S. Announces the Public Preview of A.W.S. Data Exchange for Amazon Redshift

Access to the third-party data to correlate with the business metrics is vital to understanding the business's external influence. "Data Sharing" from cloud datawarehouse is increasingly popular, as is the ETL & Reverse-ETL tooling. I wrote about the data exchange pattern in the past.

Data Engineering Weekly

Omicron Paradigm: Architectural patterns for the Infinite Data Logistic

An Introduction To Omicron Paradigm Many enterprises are increasingly relying on vertical & horizontal SAAS applications to operate their business. The modern enterprise depends on SAAS applications for all business operation touchpoints from customer relationship management, marketing & demand generations, human resource management, finance and accounti…

5 years ago · 5 likes · Ananth Packkildurai

Following Snowflake data exchange, Redshift announces the A.W.S. data exchange for Redshift. It is an exciting phase to watch marketplaces build on top of it.

https://www.infoq.com/news/2021/10/aws-dax-amazon-redshift-preview/

PayPal: Machine Learning Model CI/CD and Shadow Platform

PayPal writes about its Machine Learning model CI/CD pipeline and shadow platform to meet the regulatory requirements of ML/DL models tested in a shadow pipeline before deploying in production. The end-to-end workflow of CI/CD & shadow platform handling temporally aware features is an exciting read.

https://medium.com/paypal-tech/machine-learning-model-ci-cd-and-shadow-platform-8c4f44998c78

Groupon: Pinion — The Load Framework Part-2

Groupon writes the second part of the blog about its loader framework Pinion to ingest the event to Delta Lake. The blog narrates how the loader framework performs data validation, compaction, auditing to support data governance, multi-stage ingestion strategy.

https://medium.com/groupon-eng/pinion-the-load-framework-part-2-e6a47586e7be

Microsoft: Measuring the Impact of Data Science

The measurable impact is critical to iterate and improve the efficiency of a platform. Microsoft data science writes an exciting blog on measuring the impact of data science with P.U.G.E.T. (product/ problem definition, Users and customer segments, Goals, and metrics, Efficient and measurable strategy, Trade-offs).

https://medium.com/data-science-at-microsoft/measuring-impact-in-data-science-part-1-6ef9712bcbea

Nextdoor: Running ML Inference Services in Shared Hosting Environments

The data workload is increasingly adopting a shared execution environment and the talk from Nextdoor highlights the impact of load balancing & resource sharing on inference service's performance.

https://engblog.nextdoor.com/running-ml-inference-services-in-shared-hosting-environments-6176b39bc9b7

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.

Data Engineering Weekly

Discussion about this post

Ready for more?

Data Engineering Weekly

Data Engineering Weekly #62

Weekly Data Engineering Newsletter

Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers

Event: Join Impact 2021 on November 3, 2021: The First-Ever Data Observability Summit. Join Today's Leading Data Pioneers

Netflix: Open-Sourcing a Monitoring G.U.I. for Metaflow, Netflix’s ML Platform

Ahmad Houri: How Netflix Metaflow Helped Us Build Real-World Machine Learning Services

Presto: Scaling with Presto on Spark

Shopify: Shopify’s Path to a Faster Trino Query Execution: Custom Verification, Benchmarking, and Profiling Tooling

InfoQ: A.W.S. Announces the Public Preview of A.W.S. Data Exchange for Amazon Redshift

Sponsored: Live Tech Session - The Modern Data Stack Is Warehouse-First

PayPal: Machine Learning Model CI/CD and Shadow Platform

Groupon: Pinion — The Load Framework Part-2

Microsoft: Measuring the Impact of Data Science

Nextdoor: Running ML Inference Services in Shared Hosting Environments

Discussion about this post

Ready for more?