Data Engineering Weekly #71

Weekly Data Engineering Newsletter

Ananth Packkildurai

Jan 24, 2022

Data Talk Club: Data Engineering Bootcamp Videos

Kudos to Data Talk Club for running the data engineering Zoom camp. All the videos were published on Youtube.

https://www.youtube.com/playlist?list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb

You can find the code for the Zoom camp here,

https://github.com/DataTalksClub/data-engineering-zoomcamp

Snowflake: Expanding the Data Cloud with Apache Iceberg

We’ve seen the LakeHouse vs. DataWarehouse (aren’t they the same?) benchmark studies a couple of months back by Databricks, Snowflake, and again Databricks. It is interesting to see Snowflake announce the support for Apache Iceberg as an external table format support.

I’ve seen a pattern where raw product data often sit on S3. The data move close to the systems like Snowflake as the data get aggregated. Presto traditionally played the role of the federated query engine. It is interesting to see Snowflake stepping into it. Coincidently, AWS EMR announced support for Apache Iceberg on the same day.!!

Snowflake Announcement:

https://www.snowflake.com/blog/expanding-the-data-cloud-with-apache-iceberg/

AWS announcement
https://aws.amazon.com/about-aws/whats-new/2022/01/amazon-emr-supports-apache-iceberg/

Apache Hudi: Change Data Capture with Debezium and Apache Hudi

Staying on the LakeHouse architecture, Apache Hudi writes about change data capture with Debezium and Apache Hudi. The support for the “incremental view” (Merge on Read) makes Hudi a perfect system for Change Data Capture use cases.

https://hudi.apache.org/blog/2022/01/14/change-data-capture-with-debezium-and-apache-hudi/

This is an excellent summarization of what happened in Apache Hudi 2021.

https://hudi.apache.org/blog/2022/01/06/apache-hudi-2021-a-year-in-review

Amplify Partners: Sales Metrics 101: Self Serve, Sales-Assisted, and PQL Funnels

Understanding the funnel of the business process flow is vital for a business. Measuring things is hard, but data helps enrich our understanding of what is going on. Amplify Partners writes an excellent blog on Sales metrics for self-service business models, sales-assisted business models, and product-qualified leads for potential upselling.

https://amplifypartners.com/company-building/sales-metrics-101/

Benn Stancil & Mark Grover: Good Data Citizenship Doesn’t Work

Data is a critical differentiator for a company among its competitors. As a result, we see increased adoption or talk about democratizing the data across the organization. The current answer to the quest is more documentation & cataloging. But is this enough? Is there anything we can learn from consumer media about information sharing? The authors compare news sites, Wikipedia, Yelp & Google.

https://towardsdatascience.com/good-data-citizenship-doesnt-work-265f13a37fa5

We need to look at the social news feed industry and the information pushed to the end-users rather than polling.

Ananth Packkildurai @ananthdurai

Expanding more on this, I agree that 99% of the business use cases might be sufficient for an hour delay to refresh the dashboard. I'm more thinking about instrumenting a habit of data-driven decision-making. 🧵1/4

Ananth Packkildurai @ananthdurai

A case for near real-time data warehouse: There is a vast difference between you can see insights every day 10 AM vs. you can see the up to date insights every time you access it. The batch nature of the insight generation causes the zombie dashboards.

Spotify: Product Lessons from ML Home - Spotify’s One-Stop Shop for Machine Learning

Spotify writes about ML Home, the internal user interface for Spotify’s Machine Learning Platform. The blog focuses on product lessons learned along the way in the quest to entrench Spotify’s ML ecosystem.

https://engineering.atspotify.com/2022/01/19/product-lessons-from-ml-home-spotifys-one-stop-shop-for-machine-learning/

StarTree: Native Text Indices and Like Operator Support in Apache Pinot

One of the significant features of Apache Pinot is the ability to define an indexing strategy for each column. The talk gives excellent insights on how text search indexing works in Apache Pinot.

Twitter: Investing in privacy-enhancing tech to advance transparency in ML

Twitter writes a quick note on its ongoing effort to invest in privacy-enhancing tech and the partnership with openmined.org.

https://blog.twitter.com/engineering/en_us/topics/insights/2022/investing-in-privacy-enhancing-tech-to-advance-transparency-in-ML

Square: Secure Apache Airflow Using Customer Security Manager

Square writes about implementing DAG level ACL support for Apache Airflow. The blog discusses various auth support available in Apache Airflow and the implementation of REMOTE_USER mode.

https://developer.squareup.com/blog/secure-apache-airflow-using-customer-security-manager/

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Alex Merced

Feb 3, 2022

Been really loving all the adoption of Apache Iceberg latley, not sure if you are aware but there will be several talks and breakout sessions on Apache Iceberg at the upcoming Subsurface conference: https://www.dremio.com/subsurface/live/

Will I see you there?

Expand full comment

Data Engineering Weekly