Sponsored: Joybird’s Warehouse-First Customer Data Stack with Iterable, Snowflake, & RudderStack
Live on April 20th, the Director of Analytics from Joybird, a La-Z-Boy company, joins RudderStack, Snowflake, and Iterable to detail how his team retooled its data stack and reduced their time spent building integrations and managing data pipelines by 93%.
Let’s start this week with an informative Twitter thread from Wes Kao.
Apoorva Pandhi and Chad Sanderson: Modern Data Stack - Looking into the Crystal Ball
What is the next stage of disruption in the data ecosystem? The authors predicted Self-serving AI/ Ml infrastructure, "Data Contracts," decoupling data infrastructure where the distinction between the data warehouse and data lake is becoming increasingly obscure. Streamlining became mainstream adoption and the standardization of data practices.
I'm pretty bullish on "Data Contracts," which will significantly impact the data landscape. It's an exciting time to be in data engineering.
https://www.linkedin.com/pulse/modern-data-stack-looking-crystal-ball-apoorva-pandhi/
Transform: Introducing MetricFlow - Your powerful, open-source metrics framework
The promise of the metrics layer as a semantic abstraction between the storage & compute to define the metrics once and reuse them across the data ecosystem is groundbreaking. Transform took its first step in that mission by open-sourcing MetricFlow, an open-source metrics framework.
Github: https://github.com/transform-data/metricflow
Singularity Data: RisingWave - A Cloud-Native Streaming Database
The barrier to entry for building & operating the real-time infrastructure with high reliability is still rigid. This week, Singularity Data's open-source RisingWave, a cloud-native streaming database, continues the shower of the open-source announcement.
https://singularity-data.com/blog/risingwave-A-Cloud-Native-Streaming-Database/
Github: https://github.com/singularity-data/risingwave
Shopify: The Magic of Merlin - Shopify's New Machine Learning Platform
Shopify writes about Merlin, its Machine Learning platform on top of open-source technologies such as Kubernetes, Ray, Airflow, Oozie & Jupyter Notebook. It's exciting to see the increasing adoption of Ray, which is something I'm looking to try this month.
https://shopifyengineering.myshopify.com/blogs/engineering/merlin-shopify-machine-learning-platform
Sponsored: Firebolt - Building Data Products For Data Engineers
What does a tech stack that always needs to be at the forefront of technology look like? Roy Miara from Explorium talks about building data products for the audience that can’t be fooled – data engineers.
https://www.firebolt.io/blog/building-data-products-for-data-engineers
DoorDash: Using Gamma Distribution to Improve Long-Tail Event Predictions
The long-tail events are always challenging in computing & prediction. DoorDash writes about how the long-tail prediction impacts the delivery estimates and uses gamma distribution to improve the model performance.
LinkedIn: The journey to building an explainable AI-driven recommendation system to help scale sales efficiency across LinkedIn
I'm thrilled to read the LinkedIn post about an explainable AI-driven recommendation system. The goal of the analytics function is to give a context-rich narration to empower decision making, not just metrics & dashboard. I have written about it in the past.
Whenever I raise the question, it is always the job of the data catalog as a documentation solution. I wouldn't be surprised if the next generation of startups focused on explainable, news feed-driven business analytics.
Dagster: Introducing Software-Defined Assets
Last month we went through multiple discussions of bundling and unbundling, and sooner we started to see a few M&A in the data ecosystem as a sign of consolidation. However, the core of the problem is an efficient & programmatic approach to managing data assets and asset lifecycle. Dagster writes about its approach to software-defined assets and how it accelerates data management efficiency.
https://dagster.io/blog/software-defined-assets
Nikhil Sachdeva: The role of a technical program manager in AI projects
A typical AI project is a cross-functional effort from the incubation to operating in production. The author discusses the skills required to be an AI Technical Product Manager (TPM).
Sponsored Event - IMPACT TOUR 2022 - The data leaders event series to learn key strategies to make an IMPACT with your data.
Join 3 virtual keynotes and 3 city stops, to learn how data leaders are tackling the biggest challenges in data, from building more reliable stacks to hiring top talent for your team.
Peter Gao: Lessons From Deploying Deep Learning To Production
Some great practical tips on deploying deep learning applications in production start with an iterative development model, leveraging domain-specific feedback loops, a human-in-the-loop review process, etc.
https://medium.com/aquarium-learning/lessons-from-deploying-deep-learning-to-production-9b7a3576881d
Ben Rogojan: Why Building Data Reliability Systems Is Hard
There is always a trade-off between correctness and speed in a data pipeline. The author explains why data quality is more than SQL queries and the trade-off between operational reliability and data quality.
https://medium.com/coriers/why-building-data-reliability-systems-is-hard-e0bf25c5ee36
Sponsored: Rudderstack - Analytics Engineering vs. Data Engineering
Data engineering is changing as technologies advance, and new roles emerge. Alex Dovenmuehle breaks down the differences between analytics engineers and data engineers, and he outlines the benefits of the two roles working together.
https://www.rudderstack.com/blog/analytics-engineering-vs-data-engineering
Uber: Securing Kafka® Infrastructure at Uber
Uber writes about essential components to enable security features on a Kafka cluster and how it secures it. The incremental rollout of the security features for 500+ Kafka topics and the performance tuning of the Kafka clusters are some of the exciting system design reading.
https://eng.uber.com/securing-kafka-infrastructure-at-uber/
Halodoc: Key Learnings on Using Apache HUDI in building Lakehouse Architecture @ Halodoc
Hudi is a rich platform to build streaming data lakes with incremental data pipelines on a self-managing database layer while being optimized for lake engines and regular batch processing. Halodoc writes an exciting blog sharing its experience in adopting and optimizing the Apache Hudi lakehouse infrastructure.
Redpanda: Evaluating Graviton 2 for data-intensive applications - an Arm vs. Intel comparison
Cost optimization is always top of the mind for most data teams. Redpanda publishes the performance of Graviton 2 using Redpanda as a target application, with results that show Arm at about a 20% price/performance advantage over Intel instances for a high-throughput message ingestion scenario.
https://redpanda.com/blog/aws-graviton-2-arm-vs-x86-comparison/
Peng Wang: What Skills Do You Need to Become a Data Engineer
An outstanding data-driven finding of what skillsets the data engineers require: the author extracted 550 United States data engineer jobs from indeed.com and did some quick analyses using job description, location, and salary range.
https://www.linkedin.com/pulse/what-skills-do-you-need-become-data-engineer-peng-wang/
All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.