Data Talk Club: Data Engineering Bootcamp Videos
Kudos to Data Talk Club for running the data engineering Zoom camp. All the videos were published on Youtube.
You can find the code for the Zoom camp here,
Snowflake: Expanding the Data Cloud with Apache Iceberg
We’ve seen the LakeHouse vs. DataWarehouse (aren’t they the same?) benchmark studies a couple of months back
Snowflake, and again
Databricks. It is interesting to see Snowflake announce the support for Apache Iceberg as an external table format support.
I’ve seen a pattern where raw product data often sit on S3. The data move close to the systems like Snowflake as the data get aggregated. Presto traditionally played the role of the federated query engine. It is interesting to see Snowflake stepping into it. Coincidently, AWS EMR announced support for Apache Iceberg on the same day.!!
Apache Hudi: Change Data Capture with Debezium and Apache Hudi
Staying on the LakeHouse architecture, Apache Hudi writes about change data capture with Debezium and Apache Hudi. The support for the “incremental view” (Merge on Read) makes Hudi a perfect system for Change Data Capture use cases.
This is an excellent summarization of what happened in Apache Hudi 2021.
Amplify Partners: Sales Metrics 101: Self Serve, Sales-Assisted, and PQL Funnels
Understanding the funnel of the business process flow is vital for a business. Measuring things is hard, but data helps enrich our understanding of what is going on. Amplify Partners writes an excellent blog on Sales metrics for self-service business models, sales-assisted business models, and product-qualified leads for potential upselling.
Benn Stancil & Mark Grover: Good Data Citizenship Doesn’t Work
Data is a critical differentiator for a company among its competitors. As a result, we see increased adoption or talk about democratizing the data across the organization. The current answer to the quest is more documentation & cataloging. But is this enough? Is there anything we can learn from consumer media about information sharing? The authors compare news sites, Wikipedia, Yelp & Google.
We need to look at the social news feed industry and the information pushed to the end-users rather than polling.
Ananth Packkildurai @ananthduraiA case for near real-time data warehouse: There is a vast difference between you can see insights every day 10 AM vs. you can see the up to date insights every time you access it. The batch nature of the insight generation causes the zombie dashboards.
Spotify: Product Lessons from ML Home - Spotify’s One-Stop Shop for Machine Learning
Spotify writes about ML Home, the internal user interface for Spotify’s Machine Learning Platform. The blog focuses on product lessons learned along the way in the quest to entrench Spotify’s ML ecosystem.
StarTree: Native Text Indices and Like Operator Support in Apache Pinot
One of the significant features of Apache Pinot is the ability to define an indexing strategy for each column. The talk gives excellent insights on how text search indexing works in Apache Pinot.
Twitter: Investing in privacy-enhancing tech to advance transparency in ML
Twitter writes a quick note on its ongoing effort to invest in privacy-enhancing tech and the partnership with
Square: Secure Apache Airflow Using Customer Security Manager
Square writes about implementing DAG level ACL support for Apache Airflow. The blog discusses various auth support available in Apache Airflow and the implementation of REMOTE_USER mode.
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.