Data Engineering Weekly 65

Weekly Data Engineering Newsletter

Nov 22, 2021

Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers

RudderStack Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.

How to Analyze Data with Python, Pandas & Numpy - 10 Hour Course

I have always been a big fan of what FreeCodeCamp.org is doing to create free CS educational content. If you're starting to learn Data Analytics with Python, this is a fantastic course to start.

`https://www.freecodecamp.org/news/how-to-analyze-data-with-python-pandas/`

Meta: Data Observability Learning Summit 2021

Meta (Facebook) published videos of its recent data observability summit. I've not watched all videos and looking forward to watching data and ML observability in the public cloud & "Catch me if you can": Keeping up with ML in production.

https://m.facebook.com/watch/9445547199/490224945331402

Netflix: Building confidence in a decision

Netflix published the fifth post in a multi-part series on how Netflix uses A/B tests to inform decisions and continuously innovate its products. The fifth part focuses on how Netflix uses the test results to support decision-making in a complex business environment.

https://netflixtechblog.com/building-confidence-in-a-decision-8705834e6fd8

Spotify: The Rise (and Lessons Learned) of ML Models to Personalize Content on Home

Spotify shared a two-part post on its ML adoption story & lesson learned to build personalized content on its Homepage. The blog is an exciting narration of thinking through converting a rule-based application into ML-driven.

Part 1: https://engineering.atspotify.com/2021/11/15/the-rise-and-lessons-learned-of-ml-models-to-personalize-content-on-home-part-i/

Part 2: https://engineering.atspotify.com/2021/11/18/the-rise-and-lessons-learned-of-ml-models-to-personalize-content-on-home-part-ii/

Vimeo: Uncovering bias in search and recommendations

The code we write fundamentally, the reflection of the human thought process, and human bias in the system are harmful by-products. Being dependent on existing data tends to privilege what systems are already in place. Vimeo writes an exciting blog on that line on how its search & recommendation team approaches uncover the bias in its ML models.

https://medium.com/vimeo-engineering-blog/uncovering-bias-in-search-and-recommendations-751b01d1c874

Pinterest: MemQ -An efficient, scalable cloud-native PubSub system

Pinterest writes about its internal pub-sub system called MemQ, born out of learning from operating Kafka. The system design of the pluggable replicator storage layer is the highlight of the design. The key takeaways on operating Kafka is a must-read.

Not every dataset needs a sub-second latency service. Latency and cost should be inversely proportional (lower latency should cost more)
A PubSub system's storage and serving components must be separated to enable independent scalability based on resources.
Ordering on reading instead of writing provides the required flexibility for specific consumer use cases (different applications can have different for the same dataset)
Strict partition ordering is not necessary at Pinterest in most cases and often leads to scalability challenges.
Rebalancing in Kafka is expensive, often results in performance degradation, and harms customers on a saturated cluster.
Running custom replication in a cloud environment is expensive.

https://medium.com/pinterest-engineering/memq-an-efficient-scalable-cloud-native-pubsub-system-4402695dd4e7

PayPal: Scaling Kafka Consumer for Billions of Events

PayPal writes about its performance benchmark on improving the throughput of its Kafka cluster. The performance gain from switching java GC from CMS to G1GC is an interesting takeaway.

https://medium.com/paypal-tech/kafka-consumer-benchmarking-c726fbe4000

Confluent: How to Efficiently Subscribe to a SQL Query for Changes

Subscribing to a real-time CDC pipeline to get the update in a scalable way is powerful. Confluent writes about how ksqlDB supports efficiently subscribe to real-time SQL queries. However, the lack of support for the group by partition by & window expression is a disappointment.

https://www.confluent.io/blog/push-queries-v2-with-ksqldb-scalable-sql-query-subscriptions/

Servian: Modelling Type 1 + 2 Slowly Changing Dimensions with dbt

Finally, the blog narrates the practical implementation of Type 1 & Type 2 slowly changing dimensions with dbt.

https://servian.dev/modelling-type-1-2-slowly-changing-dimensions-with-dbt-1b80078f290a

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly