Data Engineering Weekly #31

Weekly Data Engineering Newsletter

Welcome to the 31st edition of the data engineering newsletter. This week's release is a new set of articles that focus on Redpoint Ventures Reverse ETL, JP Morgan’s data mesh implementation, DBT’s modern data stack, ValidIO’s ML & Data trends 2021, Airbnb’s visualizing data timeline, Pinterest’s lesson learned from running Kafka at scale, Confluent’s 42 things to do once Zookeeper is gone, LinkedIn’s solving data integration problem with Apache Gobblin, Facebook’s mitigating the effect of silent data corruption, Reddit’s scaling reporting system, and LinkedIn’s GraphQL implementation of DataHub.


Redpoint Ventures: Reverse ETL — A Primer

Over the last decade, the cloud and SAAS products changed the way business operates. In a modern business, the customer data spread across many SAAS vendors. A single source of truth is a myth in the modern data infrastructure. I often call this a "Truth In Motion." I shared a similar thought a year back.

On the same line, the blog narrates "Reverse ETL.," where the data are flowing from the internal data warehouse to SAAS providers like Salesforce, Zendesk & Intercom. It is an exciting space to watch as the success depends on how the SAAS vendors simplify the ingress and produce cost-effective time to value the customer's data.

https://medium.com/memory-leak/reverse-etl-a-primer-4e6694dcc7fb


JPMorgan Chase: Implementing a Data Mesh Architecture at JPMC

JPMC talked about the team's thoughts and implementation strategy of adopting the data mesh principles. The talk is a good narrative of structuring the data mesh principles, publishing taxonomy, and the need for a pragmatic compromise while adopting the data mesh principles.

https://www.dremio.com/subsurface/implementing-a-data-mesh-architecture-at-jpmc


DBT: The Modern Data Stack: Past, Present, and Future

How does the data engineering world look like beyond the Hadoop ecosystem? The blog from DBT gives a comprehensive overview of the modern data stack, starting from the introduction of Redshift and its impact on the data warehouse. The blog narrates the challenges ahead reminds this space is wide open for innovation over the next decades.

https://blog-getdbt-com.cdn.ampproject.org/c/s/blog.getdbt.com/future-of-the-modern-data-stack/amp/


VALIDIO: ML & Data Trends: Wrapping up 2020 and looking into 2021 & beyond

The blog narrates how the underlying data infrastructure influences the ML development in line with the recent trends on "Reverse ETL" and the modern cloud-native data stack. The blog also reiterates we are still in the early stages of MLOps, Data Quality tooling, and unified data architecture on the path to industrialization ML development.

https://medium.com/validio/ml-data-trends-wrapping-up-2020-and-looking-into-2021-beyond-b3ff1eadc211


Airbnb: Visualizing Data Timeliness at Airbnb

Commitment, Consistency & Clarity in the data pipeline are the core principles to build trust in data to empower a data-driven culture. Airbnb writes an exciting blog about SLA Tracker and how it took a data-driven approach to debug the data pipeline to improve efficiency.

https://medium.com/airbnb-engineering/visualizing-data-timeliness-at-airbnb-ee638fdf4710


Confluent/ Pinterest: Lessons Learned from Running Apache Kafka at Scale at Pinterest

Pinterest writes its lessons learned from running Apache Kafka at scale. Broker replacement, partition rebalancing, and cost control are the common challenges running Kafka at scale, and the blog narrates how automation can help run the tasks. The Pinterest Orion is an exciting project to watch.

https://www.confluent.io/blog/running-kafka-at-scale-at-pinterest/


Confluent: 42 Things You Can Stop Doing Once ZooKeeper Is Gone from Apache Kafka

Confluent writes about the advantages of removing the Zookeeper dependency can improve the Kafka infrastructure with performance, capacity planning, operations, and monitoring. The KIP-500 RFC on replacing Zookeeper with a self-managed quorum is an exciting read.

https://www.confluent.io/blog/42-ways-zookeeper-removal-improves-kafka/


LinkedIn: Solving the data integration variety problem at scale, with Gobblin

The growing niche SAAS applications add complexity to the data ingestions to the data warehouse system. LinkedIn writes about Apache Gobblin's unique approach to building data integration at scale. Instead of relying upon per source connectors, the multi-stage protocol & message format architecture seems an elegant solution for a complex problem.

https://engineering.linkedin.com/blog/2021/data-integration-library


Facebook: Mitigating the effects of silent data corruption at scale

In a large-scale infrastructure, files usually compressed when they are not being read and decompressed when a request to read the file. What happens when the decompression fails? How often the failure? Facebook writes an exciting blog about its paper. silent data corruption at scale.

https://engineering.fb.com/2021/02/23/data-infrastructure/silent-data-corruption/


Reddit: Scaling Reporting at Reddit

Reddit writes about its journey on scaling the reporting platform from Redis to Apache Druid. The blog discusses the broader limitations of adopting the key-value storage for serving the analytics, the overhead on the application development, and operation issues with unknown bugs.

https://redditblog.com/2021/02/26/scaling-reporting-at-reddit/


LinkedIn: DataHub Project Updates (February 2021 Edition)

One of the challenges of adopting a modern data stack is that it is isolated towards dashboarding and reporting use cases. It is refreshing to read that the recent LinkedIn DataHub release focuses on adopting GraphQL to ease the integration with broader infrastructure components.

https://medium.com/datahub-project/linkedin-datahub-project-updates-february-2021-edition-338d2c6021f0


Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.