Data Engineering Weekly #62

Weekly Data Engineering Newsletter

Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers

RudderStack Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.


Event: Join Impact 2021 on November 3, 2021: The First-Ever Data Observability Summit. Join Today's Leading Data Pioneers

Hear from data leaders pioneering the technologies & processes shaping data engineering. Featuring First Chief Data Scientist of the U.S., founder of the Data Mesh, and many more!

Click To Get Your Free Ticket For All Data Engineering Weekly Readers


Netflix: Open-Sourcing a Monitoring G.U.I. for Metaflow, Netflix’s ML Platform

The success of any developer framework depends on how efficiently the tool integrates with the developer workflow. Netflix writes about open source Metaflow G.U.I. for monitoring and operating its full-stack framework for data science.

https://netflixtechblog.com/open-sourcing-a-monitoring-gui-for-metaflow-75ff465f0d60


Ahmad Houri: How Netflix Metaflow Helped Us Build Real-World Machine Learning Services

The article gives a good overview of Netflix's Metaflow, demonstrating the scaling and cloud integration support of Metaflow with the A.W.S. step function.

https://towardsdatascience.com/how-netflix-metaflow-helped-us-build-real-world-machine-learning-services-9ab9a97cdf33


Presto: Scaling with Presto on Spark

Presto is known for interactive queries against data warehouses, but it has evolved into a unified SQL engine on open data lake analytics for interactive and batch workloads. Apache Spark execution engine with Presto is an exciting development to bring one SQL for batch & interactive workload.

https://prestodb.io/blog/2021/10/26/Scaling-with-Presto-on-Spark.html


Shopify: Shopify’s Path to a Faster Trino Query Execution: Custom Verification, Benchmarking, and Profiling Tooling

Reliable data infrastructure is critical for a faster “time-to-insight” for analytical queries. Shopify writes about its approach to benchmarking Trino infrastructure. The Key lessons section highlighting

  1. A solid statistics foundation is crucial.

  2. Many nuances of an environment can unintentionally influence results

  3. Ensure you gather all the relevant data

The principles are essential for operating any data-driven infrastructure. 

https://shopifyengineering.myshopify.com/blogs/engineering/faster-trino-query-execution-verification-benchmarking-profiling


InfoQ: A.W.S. Announces the Public Preview of A.W.S. Data Exchange for Amazon Redshift

Access to the third-party data to correlate with the business metrics is vital to understanding the business's external influence. "Data Sharing" from cloud datawarehouse is increasingly popular, as is the ETL & Reverse-ETL tooling. I wrote about the data exchange pattern in the past.

Data Engineering Weekly
Omicron Paradigm: Architectural patterns for the Infinite Data Logistic
An Introduction To Omicron Paradigm Many enterprises are increasingly relying on vertical & horizontal SAAS applications to operate their business. The modern enterprise depends on SAAS applications for all business operation touchpoints from customer relationship management, marketing & demand generations, human resource management, finance and accounti…
Read more

Following Snowflake data exchange, Redshift announces the A.W.S. data exchange for Redshift. It is an exciting phase to watch marketplaces build on top of it.

https://www.infoq.com/news/2021/10/aws-dax-amazon-redshift-preview/


Sponsored: Live Tech Session - The Modern Data Stack Is Warehouse-First

Join leaders from Snowflake, Mammoth Growth, RudderStack, and Mixpanel to learn why the most sophisticated teams architect their data stacks around the data warehouse.

https://rudderstack.com/video-library/the-modern-data-stack-is-warehouse-first


PayPal: Machine Learning Model CI/CD and Shadow Platform

PayPal writes about its Machine Learning model CI/CD pipeline and shadow platform to meet the regulatory requirements of ML/DL models tested in a shadow pipeline before deploying in production. The end-to-end workflow of CI/CD & shadow platform handling temporally aware features is an exciting read.

https://medium.com/paypal-tech/machine-learning-model-ci-cd-and-shadow-platform-8c4f44998c78


Groupon: Pinion — The Load Framework Part-2

Groupon writes the second part of the blog about its loader framework Pinion to ingest the event to Delta Lake. The blog narrates how the loader framework performs data validation, compaction, auditing to support data governance, multi-stage ingestion strategy.

https://medium.com/groupon-eng/pinion-the-load-framework-part-2-e6a47586e7be


Microsoft: Measuring the Impact of Data Science

The measurable impact is critical to iterate and improve the efficiency of a platform. Microsoft data science writes an exciting blog on measuring the impact of data science with P.U.G.E.T. (product/ problem definition, Users and customer segments, Goals, and metrics, Efficient and measurable strategy, Trade-offs).

https://medium.com/data-science-at-microsoft/measuring-impact-in-data-science-part-1-6ef9712bcbea


Nextdoor: Running ML Inference Services in Shared Hosting Environments

The data workload is increasingly adopting a shared execution environment and the talk from Nextdoor highlights the impact of load balancing & resource sharing on inference service's performance.

https://engblog.nextdoor.com/running-ml-inference-services-in-shared-hosting-environments-6176b39bc9b7


Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.