Data Engineering Weekly #64

Weekly Data Engineering Newsletter

Nov 15, 2021

Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers

RudderStack Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.

Poll Result: What tools/SaaS products are you using for data access & security, such as column-level access control for multi-database (DW) environments?

Recent work on building multi-cloud data identity & access management allowed revisiting this space. The opinion poll shows Apache Ranger is the widely adopted solution, and the cloud provider's solution is second to Apache Ranger.

Patrick Chase: Data warehouse is the new backend

SasS applications emerging from business process solutions to full-suite data workflow engines provide lower cost & faster distribution to run a business effectively. The article raises an interesting question. Does the role of Data Warehouse changing to a backend of data?!! The following tweet also echoes a similar thought on the role of SaaS applications in modern data engineering. It would be interesting to see this trend and how it shapes the data warehouse systems as we know of today.

Gwen (Chen) Shapira @gwenshap

And with the move to cloud services, the data integration world moved from DB tools like CDC and exports to APIs. Many teams don't notice that SalesForce and Marketo, and Zendesk are all actually databases.

https://pchase.substack.com/p/thenewbackend

Confluent: Scaling Apache Druid for Real-Time Cloud Analytics at Confluent

Confluent writes about its adoption story of Apache Druid for its Cloud Metrics API services. The scalability challenges, hardware choices, and compaction strategies are an exciting read.

https://www.confluent.io/blog/scaling-apache-druid-for-real-time-cloud-analytics-at-confluent/

Expedia: Apache Cassandra for Real-Time User Analytics at Expedia Group

Expedia shares its high-level overview of real-time user analytics infrastructure. The blog narrates a good refresher for Apache Cassandra with some trivia quizzes!!!

https://medium.com/expedia-group-tech/apache-cassandra-for-real-time-user-analytics-at-expedia-group-4b612bac05a7

Samhita Alla: Bring ML Close to Data Using Feast and Flyte

Feature engineering is one of the most significant challenges in applied machine learning. Flyte makes it easy to create concurrent, scalable, and maintainable workflows for machine learning and data processing. Feast provides the feature registry, an online feature serving system, and Flyte can engineer the features. The blog narrates how two systems complement each other and the interoperability among them.

https://betterprogramming.pub/bring-ml-close-to-data-using-feast-and-flyte-bd0cb5608678

Coinbase: How we scaled data streaming at Coinbase using AWS MSK

Coinbase writes about its adoption story of AWS MSK and the benefits it provides from Kafka security service (KSS), tooling & Kafka connect service. Coinbase reduced the end-to-end streaming pipeline latency by 95% when switching from Kinesis (~ 200 msec) to Kafka (< 10 msec).

https://blog.coinbase.com/how-we-scaled-data-streaming-at-coinbase-using-aws-msk-4595f171266c

PolicyGenius: Building a Data Warehouse on Google Cloud Platform That Scales With the Business

PolicyGenius writes about its data warehouse system built on Google Cloud & Airflow. It is exciting to see the Google sheet is an important data source. The data classification on stages of data lifecycle as the Source data, Foundational view, Unified view & the Reporting view is a refreshing take on the pipeline classification.

https://medium.com/policygenius-stories/building-a-data-warehouse-on-google-cloud-platform-that-scales-with-the-business-2b07f7c7292e

Scentbird: Scentbird Analytics 2.0. Migrate from Redshift to Snowflake

Scentbird writes some limitations with AWS Redshift & Glue-based data warehouse solution and its migration journey to Snowflake. The narration around Glue limitations is exciting, and I presume these limitations will apply to most of the no-code UI-based ETL engines.

https://medium.com/@Not4j/scentbird-analytics-2-0-migrate-from-redshift-to-snowflake-redesign-etl-process-e79611723a90

Finally, found this exciting Git Repo full of dbt tips!!!

https://github.com/erika-e/dbt-tips

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly