Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers
RudderStack
Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.
Sponsored: Announcing O'Reilly Data Quality Fundamentals Book. Now Available: Exclusive access to the first two chapters!
Thrilled to announce the release of O'Reilly's first-ever book on data quality, Data Quality Fundamentals: A Practitioner's Guide to Building More Trustworthy Data Pipelines! In this book, the Data Observability category creators make the business case for data trust and explain how data leaders can tackle data quality at scale by leveraging best practices and technologies used by some of the world’s most innovative companies.
Poll Result: What tools/SaaS products are you using for data access & security, such as column-level access control for multi-database (DW) environments?
Recent work on building multi-cloud data identity & access management allowed revisiting this space. The opinion poll shows Apache Ranger is the widely adopted solution, and the cloud provider's solution is second to Apache Ranger.
Patrick Chase: Data warehouse is the new backend
SasS applications emerging from business process solutions to full-suite data workflow engines provide lower cost & faster distribution to run a business effectively. The article raises an interesting question. Does the role of Data Warehouse changing to a backend of data?!! The following tweet also echoes a similar thought on the role of SaaS applications in modern data engineering. It would be interesting to see this trend and how it shapes the data warehouse systems as we know of today.
https://pchase.substack.com/p/thenewbackend
Confluent: Scaling Apache Druid for Real-Time Cloud Analytics at Confluent
Confluent writes about its adoption story of Apache Druid for its Cloud Metrics API services. The scalability challenges, hardware choices, and compaction strategies are an exciting read.
https://www.confluent.io/blog/scaling-apache-druid-for-real-time-cloud-analytics-at-confluent/
Expedia: Apache Cassandra for Real-Time User Analytics at Expedia Group
Expedia shares its high-level overview of real-time user analytics infrastructure. The blog narrates a good refresher for Apache Cassandra with some trivia quizzes!!!
Samhita Alla: Bring ML Close to Data Using Feast and Flyte
Feature engineering is one of the most significant challenges in applied machine learning. Flyte
makes it easy to create concurrent, scalable, and maintainable workflows for machine learning and data processing. Feast
provides the feature registry, an online feature serving system, and Flyte can engineer the features. The blog narrates how two systems complement each other and the interoperability among them.
https://betterprogramming.pub/bring-ml-close-to-data-using-feast-and-flyte-bd0cb5608678
Sponsored: RudderStack: RudderStack's Data Stack - Deep Dive
The team at RudderStack provides a detailed breakdown of their data stack. The write-up includes details on how they eat their own dog food using bi-directional RudderStack pipelines to connect the entire stack, so they can extract full value from every component.
https://rudderstack.com/blog/rudderstacks-data-stack-deep-dive
Coinbase: How we scaled data streaming at Coinbase using AWS MSK
Coinbase writes about its adoption story of AWS MSK and the benefits it provides from Kafka security service (KSS), tooling & Kafka connect service. Coinbase reduced the end-to-end streaming pipeline latency by 95% when switching from Kinesis (~ 200 msec) to Kafka (< 10 msec).
https://blog.coinbase.com/how-we-scaled-data-streaming-at-coinbase-using-aws-msk-4595f171266c
PolicyGenius: Building a Data Warehouse on Google Cloud Platform That Scales With the Business
PolicyGenius writes about its data warehouse system built on Google Cloud & Airflow. It is exciting to see the Google sheet is an important data source. The data classification on stages of data lifecycle as the Source data, Foundational view, Unified view & the Reporting view is a refreshing take on the pipeline classification.
Scentbird: Scentbird Analytics 2.0. Migrate from Redshift to Snowflake
Scentbird writes some limitations with AWS Redshift & Glue-based data warehouse solution and its migration journey to Snowflake. The narration around Glue limitations is exciting, and I presume these limitations will apply to most of the no-code UI-based ETL engines.
Finally, found this exciting Git Repo full of dbt tips!!!
https://github.com/erika-e/dbt-tips
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.