Data Engineering Weekly #55

Weekly Data Engineering Newsletter

Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers

RudderStack Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.


Event: [On-Demand Webinar] Data Reliability for ELT: How Clearcover Drives Data Trust

Hear from data leaders pioneering the technologies & processes shaping data engineering. Featuring First Chief Data Scientist of the U.S., founder of the Data Mesh, and many more!

Register Today


AWS: A PartiQL deep dive - Understand the language and bring SQL queries to AWS non-relational database services

AWS writes a deep dive on PartiQL, a SQL 92 compatible query language that runs queries against structured, semi-structured, and unstructured data. PartiQL idea of a logical type system data modeling on top of formats like JSON & Parquet and the support for dynamic typing is an exciting space to watch. Though AWS data services rapidly adopting PartiQL, how far it can gain momentum in the open-source community against the likes of Apache Calcite is yet to be seen.

https://aws.amazon.com/blogs/database/a-partiql-deep-dive-understanding-the-language-bringing-sql-queries-to-aws-non-relational-database-services/


Paige Berry: Share Your Data Insights to Engage Your Colleagues

People don't make decisions based on data; they make the decision based on the story.!!!

Data storytelling is a vital aspect of data analytics that increases collaboration and informed decisions. The culture and workflow of collaborative story building on top of data are the critical ingredients for efficient business ops. The author writes an exciting blog narrating the workflow of sharing data insights at Netlify.

https://locallyoptimistic.com/post/share-your-data-insights-to-engage-your-colleagues/


Pinterest: Pinterest’s Analytics as a Platform on Druid

Pinterest shared a 3 part blog post on its journey with Apache Druid. The blog narrates the shortcoming of the Apache HBase infrastructure, instance optimization based on tiered request pattern, secondary key pruning, and bloom filter index on real-time segments.

https://medium.com/pinterest-engineering/pinterests-analytics-as-a-platform-on-druid-part-1-of-3-9043776b7b76

https://medium.com/pinterest-engineering/pinterests-analytics-as-a-platform-on-druid-part-2-of-3-e63d5280a1a9

https://medium.com/pinterest-engineering/pinterests-analytics-as-a-platform-on-druid-part-3-of-3-579406ffa374


Confluent: Protecting Data Integrity in Confluent Cloud: Over 8 Trillion Messages Per Day

Confluent writes about its end-to-end data durability monitoring infrastructure for Apache Kafka. The data integrity check focuses on the system state change operations to detect the integrity instead of data scrubbing is an elegant integrity check approach.

https://www.confluent.io/blog/how-confluent-cloud-protects-kafka-data-integrity-for-eight-trillion-messages-per-day/


Databricks: Implementing More Effective FAIR Scientific Data Management With a Lakehouse

FAIR framework for good data management and stewardship for scientific data initially introduced in a 2016 article in Nature, with “long-term care of valuable digital assets” at the core of it. Databricks writes an exciting blog on how lakehouse architecture empowering the FAIR framework. The blog introduced me to the FAIR principle, and it is an exciting article to read.

https://databricks.com/blog/2021/09/07/implementing-more-effective-fair-scientific-data-management-with-a-lakehouse.html

Nature Article on FAIR principle


Sponsored: RudderStack - Churn Prediction With BigQueryML to Increase Mobile Game Revenue

Here’s an interesting case study on how machine learning can directly impact the bottom line. RudderStack writes an outline of how app developers, Torpedo Labs, use BigQuery ML to identify high-value mobile game players who are dangerously close to churning.

https://rudderstack.com/blog/churn-prediction-with-bigqueryml


StarTree.AI: Launching at LinkedIn: The Story of Apache Pinot

It is always refreshing to read the backstory of a successful open-source system and how it starts from a simple beginning and grows over time. StarTree shared one of the success stories of how Apache Pinot starts from a simple beginning at LinkedIn and grows with the adoption at Uber.

https://www.startree.ai/blogs/launching-at-linkedin-the-story-of-apache-pinot/


Uber: Streaming Real-Time Analytics with Redis, AWS Fargate, and Dash Framework

Uber writes about real-time analytics systems with Redis, AWS Fargate & Dash framework evaluation from the long polling ingestion to event-driven model. It is the first story I read about Uber's usage of AWS and sounds like an interesting development. Earlier Dropbox shared its analytical stack migration to AWS, Twitter ads analytical stack migration to Google Cloud.

https://eng.uber.com/streaming-real-time-analytics/


Snowflake: Migrating Airflow from Amazon EC2 to Kubernetes

Snowflake shared its Apache Airflow migration from EC2 instances to KubernetesPodExecutors to scale DAG growth. The blog adds best practices of Airflow health monitoring & alerting practices. It is sad to see the Airflow operational challenges remain the same even after years!!!

https://www.snowflake.com/blog/migrating-airflow-from-amazon-ec2-to-kubernetes/


Capital One Tech: Automate Application Monitoring with Slack

Slack (like) platform plays a significant role in data ops and application monitoring to bridge the workflow between humans and machines. CapitalOne writes an exciting blog that narrates using Apache Airflow and Slack Bot to monitor ElasticSearch.

https://medium.com/capital-one-tech/automate-application-monitoring-with-slack-9e4e498652a3


Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.