Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
Pedram Navid: We need to talk about dbt
It's a busy week in the dbt land, with Pedram Navid's blog post detailing the lack of clarity or transparency on the roadmap for dbt. It triggers exciting conversation in the data Twitter that leads to The response you deserve! It's a healthy sign of a robust community-driven system design in progress.
https://pedram.substack.com/p/we-need-to-talk-about-dbt
Anaconda: Welcome to the World PyScript
I know I'm probably late to talk about it, but "Python in the browser" is an exciting development. My First Impression Trying Python on Browser is an excellent follow-up on trying PyScript.
https://engineering.anaconda.com/2022/04/welcome-pyscript.html
Sarah Krasnik: Choosing a Data Catalog
A data catalog is essential for a collaborative analytical solution, but how do you choose one? Sarah writes about the spectrum of available data catalog solutions and their pros & cons.
https://sarahsnewsletter.substack.com/p/choosing-a-data-catalog?s=r
Twitter: Understanding Twitter conversations - A Wordle case study
Twitter shared an exciting blog about the Twitter conversation with Wordle as a case study. It's a good reference article on telling stories with data analytics.
Sponsored: Firebolt - How Vimeo Keeps Data Intact with 85 Billion Events Per Month
Lior Solomon, VP of Data Engineering at Vimeo shares his own experience on The Data Engineering Show: What made him recently build a new data ops team? How do you operate a data stack that supports 85 billion events per month and 2 PBs of data? What does Fatal Attraction have to do with all of this?
https://www.firebolt.io/blog/how-vimeo-keeps-data-intact-with-85b-events-per-month
DoorDash: How We Applied Client-Side Caching to Improve Feature Store Performance by 70%
The latency requirements bring unique challenges in adopting the prediction services. DoorDash writes about its client-side Caching to improve Feature Store performance by 70%.
https://doordash.engineering/2022/05/03/how-we-applied-client-side-caching/
Snowflake: Data Vault Techniques on Snowflake - Immutable Store, Virtual End Dates
The importance of the data model is often an undervalued process. Data modeling is challenging since most of the process depends on individual experience and opinion, but following standard techniques like Data Vault can bridge the gap. Snowflake writes an exciting blog on data vault techniques in Snowflake.
https://www.snowflake.com/blog/data-vault-technique-immutable-storage/
Jesse Paquette: What Is Well-Modeled Data for Analysis?
Staying with the importance of data modeling, why should one care about data modeling? The author narrates the various aspects of a well-defined data model.
https://towardsdatascience.com/what-is-well-modeled-data-for-analysis-28f73146bf96
Sponsored: Rudderstack - A Practical Guide to The Modern Data Stack: The Data Maturity Journey
Data maturity is rapidly becoming a matter of survival, but the modern data stack can be overwhelming. Here, RudderStack provides a helpful framework that places the tools of the modern stack in the context of a 4-stage journey to help you build the right stack at every stage.
Intuit: Data X-ray - Automated Data Quality Analysis Tool Streamlines Feature Selection Process for Machine Learning
Intuit writes about the challenges of too many features in the feature selection process and how Data X-ray data quality solutions help them. The blog narrates Data X-rays' automated data quality analysis of feature attributes analysis, feature selection analysis & feature pruning analysis.
Whatnot: Tuning Whatnot’s Data Platform for Speed and Scale
Whatnot writes about tuning its data platform for speed and scale, focusing on three founding principles.
Build Modules, Not Monoliths
Domains Own Their Data
Automate Platform Processes
Sponsored: Monte Carlo Data - The Modern Data Leader’s Playbook
Learn how today’s best data engineering and analytics leaders are staying ahead of the competition in our complete guide.
Download the modern data leader’s playbook
Pinterest: Manas HNSW Streaming Filters
Pinterest writes about HNSW (Hierarchical Navigable Small World graphs) streaming filters on top of its in-house search engine Manas. The streaming filtering abstracts away implementation details of how filtering is executed and relieves the client from the burden of over-fetch tuning.
https://medium.com/pinterest-engineering/manas-hnsw-streaming-filters-351adf9ac1c4
Back Market Tech: Data for Product Managers
How does data analytics translate into the day-to-day as a product manager? The article is excellent data for the product managers.
https://engineering.backmarket.com/data-for-product-managers-part-1-2-fd2967333c00
All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.