The Week of Data Conference Extravaganza: Databricks, Snowflake, LLM and the Future of Data Engineering
Thoughts on Product announcement from Databricks and Snowflake
I’m at the Databricks summit as I’m writing my thoughts on the Snowflake and Databricks announcements. It is my first visit to San Francisco after spending almost a decade here. My flight got 6 hours delayed (yep), and I had to sleep in Denver Airport at night to catch the connecting flight. But hey, I met my friends after a long time and got my copy of “Fundamentals of Data Engineering” signed by Joe Reis & Matt Housely. If you’re starting data engineering, I highly recommend reading it.
There are many exciting announcements from both the data conferences, and I want to highlight a few themes here.
The Race to Deprecate Data Analyst
Databricks announced the support of Natural Language Query as part of LakeHouseIQ. In an episode of MLOps Weekly, Josh Wills talks about “Taping the Shoulder problem,” explaining how the analyst gets disrupted frequently with ad-hoc requests. The quest to simplify data access is there forever, but with the advancement in LLM, I think it will become a reality. Databricks and Snowflake are better places to index the data and its metadata to enable natural language query capabilities.
The obvious question follows, how do we know the foundation model gives the correct answer? Well, how do we know a human writes the SQL for an ad-hoc request, which often goes through a zero review process, wrote the correct SQL query? The Adhoc request often looks for guidelines or verifying a hypothesis. It aims to achieve a directionally correct answer in a short amount of time. It is still the early days of natural language query, but I’m excited about the progress in this space.
Snowflake is a DataLake Platform
Snowflake is moving beyond a SQL data warehouse. Snowflake’s Snowpark already supports running Java & Python code on its platform. It expanded its capabilities to include fully managed container service, MLOps, Feature Store, and Model serving. Snowflake adopted Iceberg as a LakeHouse format and announced tons of performance improvement in querying the Iceberg external table. I believe that a year from now, both Databricks & Snowflake features will look alike, if not already.
Raising the bar for Data Catalogs
Databricks announcement around Unity Catalog is truly exciting. The Unity Catalog now supports the search and exploration of catalogs, as we know of the current modern data catalog tools. On top of it, it does support access control for queries and maintains the permission model. Unity Catalog is expanding its territory with access control but LLM-driven natural language query for your data and, most importantly, the data observability for your data on the Databricks platform.
The direction Unity Catalog is moving is certainly raising the bar for the Data Calaogs. A few data catalog companies have already announced their support for LLM-driven search, auto-generate documentation, etc. The question remains how far the data catalog tools can go with just the metadata.
One Platform vs. App Ecosystem
Though the product features for Snowflake and Databricks converge architecturally, Snowflake and Databricks are making different execution strategies. Snowflake is going after enabling app ecosystems where modern data stack can deploy their application, whereas Databricks is going after a fully integrated secure experience.
The modern data stack has some truly novel ideas: Snowflake as a Unix-like platform and all the modern data stack piped together to bring the best of the world experience. However, the problem is every modern data stack company has started to claim that they are the control center aks the UNIX terminal of the modern data stack. It significantly increases the friction in integrating this system, which soon becomes a nightmare. It leads to the infamous Snowflake Tax, which drives users away from the platform or adds a limitation to Snowflake usage. Snowflake announced Snowflake Performance Index to optimize the performance and cost this week to reduce the impact. It will be interesting to see how far Snowflake will go with the App ecosystem model vs. the fully integrated data management platform.
AWS & Azure are the real winners
All these announcements from Snowflake’s container support and Databricks LakeHouseIQ require enormous computing capabilities, which is possible only with those cloud providers. I exclude Google Cloud since I rarely see Google Cloud users using either Snowflake or Databricks.
Snowflake and Databricks are racing to build the data intelligence layer on top of the cloud infrastructure layer. The last time a similar competition happened was between Hortonworks vs. Cloudera. AWS EMR replicated the exact Hadoop layer and burned these two companies (combined). Today I was walking around the block, and I saw 👇🏼
I wonder what Snowflake and Databricks learn from Cloudera and Hortonworks.
Microsoft recently announced a literal Databricks clone as Microsoft Fabric. AWS has some form of these features as AWS LakeFormation. Google Cloud is always a pioneer in the data intelligence layer.
In the race to LLM, the quest to simplify AI & Machine Learning, and competing products with access to money and human power, I think we are in a golden age of data innovation. I’m excited to see how the data engineering domain changes in the next 24 months!!!!
If you follow Databricks & Snowflake conference and announcement, please comment on your favorite feature announcement from either of the companies.
Concise, thanks for another post!!