Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers
RudderStack
Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.
Event: Join Impact 2021 on November 3, 2021: The First-Ever Data Observability Summit. Join Today's Leading Data Pioneers
Hear from data leaders pioneering the technologies & processes shaping data engineering. Featuring First Chief Data Scientist of the U.S., founder of the Data Mesh, and many more!
Click To Get Your Free Ticket For All Data Engineering Weekly Readers
Netflix: Open-Sourcing a Monitoring G.U.I. for Metaflow, Netflix’s ML Platform
The success of any developer framework depends on how efficiently the tool integrates with the developer workflow. Netflix writes about open source Metaflow G.U.I. for monitoring and operating its full-stack framework for data science.
https://netflixtechblog.com/open-sourcing-a-monitoring-gui-for-metaflow-75ff465f0d60
Ahmad Houri: How Netflix Metaflow Helped Us Build Real-World Machine Learning Services
The article gives a good overview of Netflix's Metaflow, demonstrating the scaling and cloud integration support of Metaflow with the A.W.S. step function.
Presto: Scaling with Presto on Spark
Presto is known for interactive queries against data warehouses, but it has evolved into a unified SQL engine on open data lake analytics for interactive and batch workloads. Apache Spark execution engine with Presto is an exciting development to bring one SQL for batch & interactive workload.
https://prestodb.io/blog/2021/10/26/Scaling-with-Presto-on-Spark.html
Shopify: Shopify’s Path to a Faster Trino Query Execution: Custom Verification, Benchmarking, and Profiling Tooling
Reliable data infrastructure is critical for a faster “time-to-insight” for analytical queries. Shopify writes about its approach to benchmarking Trino infrastructure. The Key lessons section highlighting
A solid statistics foundation is crucial.
Many nuances of an environment can unintentionally influence results
Ensure you gather all the relevant data
The principles are essential for operating any data-driven infrastructure.
InfoQ: A.W.S. Announces the Public Preview of A.W.S. Data Exchange for Amazon Redshift
Access to the third-party data to correlate with the business metrics is vital to understanding the business's external influence. "Data Sharing" from cloud datawarehouse is increasingly popular, as is the ETL & Reverse-ETL tooling. I wrote about the data exchange pattern in the past.
Following Snowflake data exchange
, Redshift announces the A.W.S. data exchange for Redshift. It is an exciting phase to watch marketplaces build on top of it.
https://www.infoq.com/news/2021/10/aws-dax-amazon-redshift-preview/
Sponsored: Live Tech Session - The Modern Data Stack Is Warehouse-First
Join leaders from Snowflake, Mammoth Growth, RudderStack, and Mixpanel to learn why the most sophisticated teams architect their data stacks around the data warehouse.
https://rudderstack.com/video-library/the-modern-data-stack-is-warehouse-first
PayPal: Machine Learning Model CI/CD and Shadow Platform
PayPal writes about its Machine Learning model CI/CD pipeline and shadow platform to meet the regulatory requirements of ML/DL models tested in a shadow pipeline before deploying in production. The end-to-end workflow of CI/CD & shadow platform handling temporally aware features is an exciting read.
https://medium.com/paypal-tech/machine-learning-model-ci-cd-and-shadow-platform-8c4f44998c78
Groupon: Pinion — The Load Framework Part-2
Groupon writes the second part of the blog about its loader framework Pinion to ingest the event to Delta Lake. The blog narrates how the loader framework performs data validation, compaction, auditing to support data governance, multi-stage ingestion strategy.
https://medium.com/groupon-eng/pinion-the-load-framework-part-2-e6a47586e7be
Microsoft: Measuring the Impact of Data Science
The measurable impact is critical to iterate and improve the efficiency of a platform. Microsoft data science writes an exciting blog on measuring the impact of data science with P.U.G.E.T. (product/ problem definition, Users and customer segments, Goals, and metrics, Efficient and measurable strategy, Trade-offs).
https://medium.com/data-science-at-microsoft/measuring-impact-in-data-science-part-1-6ef9712bcbea
Nextdoor: Running ML Inference Services in Shared Hosting Environments
The data workload is increasingly adopting a shared execution environment and the talk from Nextdoor highlights the impact of load balancing & resource sharing on inference service's performance.
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.