Data Engineering Weekly 65

Weekly Data Engineering Newsletter

Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers

RudderStack Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.


Sponsored: You're Invited: Exclusive Data Engineer Event at AWS re: Invent 2021!

Attending AWS re: Invent this year? If so, Join data leaders at AWS re: Invent for an exclusive networking and happy hour at TAO in the Venetian Hotel & Casino on 12/1/2021. Enjoy conversations, hors d'oeuvres, drinks, and music at this exclusive data event hosted by Monte Carlo and Trifacta.

Join us at AWS re: Invent


How to Analyze Data with Python, Pandas & Numpy - 10 Hour Course

I have always been a big fan of what FreeCodeCamp.org is doing to create free CS educational content. If you're starting to learn Data Analytics with Python, this is a fantastic course to start.

https://www.freecodecamp.org/news/how-to-analyze-data-with-python-pandas/


Meta: Data Observability Learning Summit 2021

Meta (Facebook) published videos of its recent data observability summit. I've not watched all videos and looking forward to watching data and ML observability in the public cloud & "Catch me if you can": Keeping up with ML in production.

https://m.facebook.com/watch/9445547199/490224945331402


Netflix: Building confidence in a decision

Netflix published the fifth post in a multi-part series on how Netflix uses A/B tests to inform decisions and continuously innovate its products. The fifth part focuses on how Netflix uses the test results to support decision-making in a complex business environment.

https://netflixtechblog.com/building-confidence-in-a-decision-8705834e6fd8


Spotify: The Rise (and Lessons Learned) of ML Models to Personalize Content on Home

Spotify shared a two-part post on its ML adoption story & lesson learned to build personalized content on its Homepage. The blog is an exciting narration of thinking through converting a rule-based application into ML-driven.

Part 1: https://engineering.atspotify.com/2021/11/15/the-rise-and-lessons-learned-of-ml-models-to-personalize-content-on-home-part-i/

Part 2: https://engineering.atspotify.com/2021/11/18/the-rise-and-lessons-learned-of-ml-models-to-personalize-content-on-home-part-ii/


Vimeo: Uncovering bias in search and recommendations

The code we write fundamentally, the reflection of the human thought process, and human bias in the system are harmful by-products. Being dependent on existing data tends to privilege what systems are already in place. Vimeo writes an exciting blog on that line on how its search & recommendation team approaches uncover the bias in its ML models.

https://medium.com/vimeo-engineering-blog/uncovering-bias-in-search-and-recommendations-751b01d1c874


Sponsored: The Data Stack Show Live - What is the Modern Data Stack?

Join The Data Stack Show Live for a special panel with experts from Databricks, dbt, Fivetran, Essence VC, and Hinge. The panel will look at the modern stack from all angles and discuss the future of data tooling.

https://rudderstack.com/video-library/the-data-stack-show-live-what-is-the-modern-data-stack/


Pinterest: MemQ -An efficient, scalable cloud-native PubSub system

Pinterest writes about its internal pub-sub system called MemQ, born out of learning from operating Kafka. The system design of the pluggable replicator storage layer is the highlight of the design. The key takeaways on operating Kafka is a must-read.

  1. Not every dataset needs a sub-second latency service. Latency and cost should be inversely proportional (lower latency should cost more)

  2. A PubSub system's storage and serving components must be separated to enable independent scalability based on resources.

  3. Ordering on reading instead of writing provides the required flexibility for specific consumer use cases (different applications can have different for the same dataset)

  4. Strict partition ordering is not necessary at Pinterest in most cases and often leads to scalability challenges.

  5. Rebalancing in Kafka is expensive, often results in performance degradation, and harms customers on a saturated cluster.

  6. Running custom replication in a cloud environment is expensive.

https://medium.com/pinterest-engineering/memq-an-efficient-scalable-cloud-native-pubsub-system-4402695dd4e7


PayPal: Scaling Kafka Consumer for Billions of Events

PayPal writes about its performance benchmark on improving the throughput of its Kafka cluster. The performance gain from switching java GC from CMS to G1GC is an interesting takeaway.

https://medium.com/paypal-tech/kafka-consumer-benchmarking-c726fbe4000


Confluent: How to Efficiently Subscribe to a SQL Query for Changes

Subscribing to a real-time CDC pipeline to get the update in a scalable way is powerful. Confluent writes about how ksqlDB supports efficiently subscribe to real-time SQL queries. However, the lack of support for the group by partition by & window expression is a disappointment.

https://www.confluent.io/blog/push-queries-v2-with-ksqldb-scalable-sql-query-subscriptions/


Servian: Modelling Type 1 + 2 Slowly Changing Dimensions with dbt

Finally, the blog narrates the practical implementation of Type 1 & Type 2 slowly changing dimensions with dbt.

https://servian.dev/modelling-type-1-2-slowly-changing-dimensions-with-dbt-1b80078f290a


Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.