Experience Enterprise-Grade Apache Airflow
Astro augments Airflow with enterprise-grade features to enhance productivity, meet scalability and availability demands across your data pipelines, and more.
Cube Research: Crystallizing Snowflake Summit 2024
We should officially call the first week of June the data engineering week, as two major data companies are running their developer conference. There are many announcements, from Snowflake's open-sourcing Polaris Catalog to Databricks buying Tabular. I will write a separate blog on these announcements after the Databricks conference; in the meantime, I found the blog from Cube Research, a balanced article about Snowflake Summit.
https://thecuberesearch.com/234-breaking-analysis-crystallizing-snowflake-summit-2024/
Piethein Strengholt: Integrating Azure Databricks and Microsoft Fabric
Databricks buying Tabluar certainly triggers interesting patterns in the data infrastructure. Databricks and Snowflake offer a data warehouse on top of cloud providers like AWS, Google Cloud, and Azure. Snowflake and Databricks acknowledge the support for Iceberg and move the battle to the data governance layer. However, all these cloud providers do offer competitive products. The architecture pattern establishes the baseline of how the cloud providers will eventually eat Snowflake & Databricks lunch. Will they co-exist or fight with each other? On the time will tell us.
https://piethein.medium.com/integrating-azure-databricks-and-microsoft-fabric-0030d3cf5156
Open AI: Model Spec
LLM models are slowly emerging as the intelligent data storage layer. Similar to how data modeling techniques emerged during the burst of relation databases, we started to see similar strategies for fine-tuning and prompt templates. On a similar line, Open AI released the first draft of Model Speck as guidelines for researchers and data labelers to create data as part of a reinforcement learning technique from human feedback.
https://cdn.openai.com/spec/model-spec-2024-05-08.html
Sponsored: Data Pipelines with Apache Airflow
If you're looking to build, maintain, and more efficiently manage your data pipelines, check out our comprehensive guide, Data Pipelines with Apache Airflow. This ebook covers practical use cases and provides an overview of key concepts and best practices, including how to set up Airflow in production environments, along with best practices for building, testing, and deploying Airflow DAGs.
Sync Computing: 5 Lessons learned from testing Databricks SQL Serverless + DBT
The Databricks serverless offering intends to reduce the friction in maintaining the infrastructure and accelerate building value out of data. The author publishes the lessons learned from testing Databricks serverless SQL, where the author noted the runtime improvement saturated after the medium-scale warehouse.
Torsten Walbaum: What 10 Years at Uber, Meta, and Startups Taught Me About Data Analytics
The blog is an excellent recap for anyone starting their career as a data analyst/ data scientist. The data is meaningless unless we find a connection that builds a story. We often start with the other way around, constructing a story and trying to find the data to fit into our narration. The author points out why a data scientist should be an objective truth seeker and open to accepting new insights.
Sponsored: DoubleCloud - Production-ready managed Apache Kafka in just about 10 minutes
Apache Kafka is the #1 open-source event streaming service, but the challenge of setting up and maintaining clusters can be a dealbreaker for many companies. And this is where DoubleCloud comes in: with our fully managed service for Apache Kafka, you can deploy production-ready clusters in just about 10 minutes. Combined with DoubleCloud's fully managed ClickHouse and Apache Airflow, you can get everything you need to build a real-time analytics infrastructure in one place.
Start a free trial and see just how easy setting up real-time analytics can be!
https://double.cloud/services/managed-kafka/
Netflix: Round 2 - A Survey of Causal Inference Applications at Netflix
Netflix writes about the talks from their internal Causal Inference and Experimentation Summit 2024, highlighting the key talks.
Metrics Projection for Growth A/B Tests
A Systematic Framework for Evaluating Game Events
Double Machine Learning for Weighing Metrics Tradeoffs
Survey AB Tests with Heterogeneous Non-Response Bias
Design: The Intersection of Humans and Technology
Part 1: https://netflixtechblog.com/a-survey-of-causal-inference-applications-at-netflix-b62d25175e6f
LinkedIn: How data is powering skills-based hiring on LinkedIn
LinkedIn writes about integrating Graph Neural Network (GNN) technology into its talent solutions, exemplifying cutting-edge applications of AI in enhancing job matching accuracy and equity in hiring. LinkedIn tailors job recommendations more precisely by effectively mapping professional relationships and interactions through a dynamic graph structure and levels the playing field for underrepresented groups. This approach not only innovates within the realm of data engineering but also reinforces the potential of GNNs to transform industry practices by leveraging complex relationship data.
AWS: How WarpStream enables cost-effective low-latency streaming with Amazon S3 Express One Zone
I’m highly optimistic about AWS Express One Zone. As I have shared, its impact on data engineering is exciting. Warpstream writes about using Amazon S3 Express One Zone to power a cost-effective, low-latency streaming solution.
I like all the details in the blog, but I can’t stop laughing while seeing this comparison chat, a classic example of how to cheat people with charts. Dear author, please add the latency numbers in the y-axis.
Growth Acceleration Partners: A Brief Introduction to Optimized Batched Inference with vLLM
As the development of LLM-based applications increases, so does the need for efficient inference and serving of LLMs. TIL about the vLLM library. The author publishes an introduction to vLLM and a comparison study with other related libraries.
All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employer” opinions.