Welcome to the 30th edition of the data engineering newsletter. This week's release is a new set of articles that focus on Uber’s schema-agnostic log analytics platform, Google’s opensource model search system, Intuit’s Data Mesh strategy, Salesforce’s secure data intelligence platform, Netflix’s composable data pipeline, BrightWind’s wind analytical data hub, Apache Pinot’s star tree indexing, Squarespace’s A/B testing platform, Snowflake vs. Redshift comparison, and overview of the modern analytical stack.
Uber:
Fast and Reliable Schema-Agnostic Log Analytics Platform
Elasticsearch provides a dynamic schema inference to improve the performance of log indexing. The dynamic type inference often leads to type conflict errors, which drops the offending errors. The indexing and operational cost spike up the cloud cost to maintain the log search engine. Uber writes about schema-agnostic data model log search build on top of ClickHouse. The type-specific schema design is an elegant design for a complicated need for a log search engine.
Google:
Introducing Model Search: An Open Source Platform for Finding Optimal ML Models
In recent years, AutoML algorithms have emerged to help researchers find the right neural network automatically without the need for manual experimentation. To extend access to AutoML solutions, Google open-sourced Model Search
,
A platform that helps researchers develop the best ML models efficiently and automatically built on top of Tensorflow.
https://ai.googleblog.com/2021/02/introducing-model-search-open-source.html
Intuit:
Intuit’s Data Mesh Strategy
The Data Mesh principles based on Domain-Driven Design, which emphasizes ownership and accountability of singularly focused data as a product approach, resonates well with many large-scale data teams. Intuit writes an exciting blog about its learning, vision, and strategy to adopt data mesh principles.
https://medium.com/intuit-engineering/intuits-data-mesh-strategy-778e3edaa017
LinkedIn:
LinkedIn Sales Insights: Quality data foundations for smarter sales planning
Data quality defines the success and failure of a data platform.
LinkedIn writes about its Sales analytical platform focusing on how data quality is critical for its success. The blog narrates the importance of data quality as a key SLI and measuring data consistency, completeness, and freshness.
Salesforce:
Building a Secured Data Intelligence Platform
Data Privacy and Security is a vital aspect of the data infrastructure. The security-driven infrastructure design is essential for a data infrastructure that handles the business's most sensitive information. Salesforce writes an exciting blog on data platform security design throughout the data lifecycle from encryption keys, In-transit encryption, authentication & access control, multi-tenancy, and third-party access.
https://engineering.salesforce.com/building-a-secured-data-intelligence-platform-ba85411a0c1b
Netflix:
Netflix Data Mesh - Composable Data Processing
Netflix talked about its composable data processing pipeline connecting various Netflix’s contention production studios. The challenges around self-serve data processing and the complication of integrating index-based schema evolution system (Iceberg) and name-based schema evolution system (Avro) is an exciting talk to watch.
Please note the Data Mesh is an overloaded term here, and not to confuse with the Data Mesh principles :-)
BrightHub:
BrightWind’s wind resource data hub
Modern internet companies have the luxury of running in a "controlled data production environment" to a large extend yet struggle to deal with data quality and accessibility. A data platform's challenge exponentially multiplies when the data produced either manually or non/ semi-connected devices. BrightWind, the wind & solar analytical system, writes about its experience building data infrastructure.
Part 1:
BrightWind’s wind resource data hub
Part 2:
Handling difficult wind resource data
Part 3:
Ingesting daily data files
Part 4:
Using S3 for spiky time series data ingestion
Apache Pinot:
Star-Tree Index: Space-Time Trade-Off in OLAP
The multi-model/ multi-index databases are an exciting phase to watch. The Apache Pinot's talk on CMU's vaccination series is an exciting talk about the internals of how Pinot effectively using the start tree indexing for providing predictable response time for latency-sensitive applications.
Squarespace:
How We Reimagined A/B Testing at Squarespace
Tools drive the process, and process drives the engineering culture.
Squarespace writes an exciting read on how a unified experimentation framework drives fostering the culture of experimentation. The blog narrates the standardization on experimentation assignment, analytics, and statistical approach that drives the platform unification.
https://engineering.squarespace.com/blog/2021/how-we-reimagined-ab-testing-at-squarespace
GitConnected:
Snowflake vs Redshift RA3 — The need for (more than just) speed
The separation of storage and compute becomes the defacto architectural pattern for MPP engines. Presto, Snowflake long adopted the pattern, and Redshift joined the club with RA3 clusters with a competitive pricing model. The blog is an excellent comparison of Snowflake vs. Redshift in terms of performance, usage cost, idle cost, scaling, workload isolation, and automated data masking.
Technically.dev:
What Your Data Team Is Using: The Analytics Stack
The art and science of data infrastructure are all about stitching diverse yet related toolings to work together :-). A few seemingly small differences can cause big headaches when it comes to interoperability. The blog is an excellent summarization of various analytical toolings categorized by where data comes from, where data goes, how data moves around, how data gets ready, and how data gets used.
https://technically.dev/posts/what-your-data-team-is-using
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.