Vladimir Prus
wrote an interesting article on the motivation of their team to move Apache Spark's workload to Kubernetes
. One particular quote in the article made me think about the EMR pricing model.
The first goal was cost reduction. With AWS EMR you pay for the EC2 instances and for the EMR itself. You can use spot instances to reduce EC2 costs, but then the EMR surcharge can add 50% to the total bill.
Why is there an EMR surcharge of 50%?
Before that, a brief overview of EMR from my perspective. Amazon EMR is a cloud big data platform for running data processing jobs, interactive SQL queries, and machine learning (ML) applications using open-source analytics frameworks
such as Apache Spark, Apache Hive, and Presto. AWS EMR is perhaps the most successful version of the now-retired Apache Ambari
project.
AWS continuously bundles open source solutions like Apache Spark, Apache HBase, Hive, Tensorflow, and Hudi & Apache Iceberg recently. Here is the list of open-source tools available as part of EMR
.
The 50% EMR surcharge cost is simply because AWS EMR charges per instance EMR license cost!!! A 50% surcharge for a service that bundles the open-source frameworks was a surprise. I started to research feature offerings from EMR that prompt these 50% license cost.
Maybe EMRFS?
The EMR File System (EMRFS) is an implementation of HDFS that all Amazon EMR clusters use for reading and writing regular files from Amazon EMR directly to Amazon S3. It is a workaround to support a consistent view of data on top of S3. However, S3 announced strong read-after-write consistency
, which eliminates the need for EMRFS consistent view.
Reference:
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-plan-consistent-view.html
Maybe EMR auto-scaling?
Perhaps, but for auto-scaling to work, one needs to pay for the AWS CloudWatch. Further, if auto-scaling is the reason, is it fair to say disabling the auto-scaling waives off the 50 % EMR surcharge?
How is the pricing in comparison with other AWS services?
Similar to EMR, AWS does offer services on top of Kubernetes [EKS] and Kafka [MSK]. Let's compare how does the pricing look like for these services.
AWS EKS [Flat fee for a cluster]
You pay $0.10 per hour for each Amazon EKS cluster that you create. You can use a single EKS cluster to run multiple applications by taking advantage of Kubernetes namespaces and IAM security policies.
Hmm, it is not per instance model rather than a flat fee per cluster.
Reference:
https://aws.amazon.com/eks/pricing/
AWS MSK [ Special instance pricing: No separate license cost]
AWS quotes MSK instances as “Kafka instances.” The Kafka instances pricing is significantly higher than the respective EC2 instance pricing. The significant advantage thing to note here is
You do not pay for Apache ZooKeeper nodes that Amazon MSK provisions for you or for data transfer that occurs between brokers or between Apache ZooKeeper nodes and brokers within your clusters.
Essentially you’re getting the replication traffic for free in Kafka.
Reference:
https://aws.amazon.com/msk/pricing/
The AWS EMR pricing model made me curious how the competitive cloud providers structure their offerings similar to EMR.
Google Cloud DataProc
Google Dataproc charges a flat hourly rate of $0.010. The pricing model looks similar to AWS EKS.
The Dataproc pricing formula is: $0.010 * # of vCPUs * hourly duration.
Reference: https://cloud.google.com/dataproc/pricing
Azure HDInsight
No surprise!!! Azure HDInsight follows the same pricing strategy as EMR. Is it possible Azure simply mimics the AWS EMR pricing model?
Reference: https://azure.microsoft.com/en-us/pricing/details/hdinsight/
Conclusion
I looked over again on EMR and HDInsight features. I’m still trying to figure out the additional feature offering from AWS that demands per instance EMR license cost.
I guess that is the question I leave to the readers. Please comment or reply if you’ve more understanding behind the AWS EMR pricing model.
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.
I don't get where this 50% comes from? When I take the full list of the prices (https://aws.amazon.com/emr/pricing/), I get as low as 1,10% EMR cost for p3.16xlarge and the highest is 30,77% for g2.2xlarge.
There's a bunch of instance types that don't make sense in an EMR cluster, in my experience the EMR markup is rather ~10-15% on average, which seem rather OK for the service it offers.