Spark on Yarn Resources: Understanding the Configuration
As I delved into the world of Spark on Yarn, I encountered a peculiar issue where the specified resource (memory used, core number) displayed inconsistent usage on the Yarn UI page. This article aims to shed light on the problem and provide a comprehensive explanation of Spark on Yarn resources configuration.
Physical Resources on Each Node
Our Yarn cluster consists of six nodes, each equipped with 16GB of physical memory and four physical cores. To accommodate other applications or system processes, we allocated 12GB of physical memory and 8 virtual cores to each node.
A key concept to grasp is the notion of virtual cores. By default, Yarn’s cluster can utilize all physical cores available on each node. In our case, we configured each node to use 8 virtual cores, which translates to two physical cores per virtual core. If we were to use all 12 virtual cores provided, it would equate to four physical cores per virtual core.
Configuration Files
To achieve the desired resource allocation, we modified the following configuration files:
capacity-scheduler.xml: This file contains theresource-calculatorproperty, which is set toorg.apache.hadoop.yarn.util.resource.DominantResourceCalculator. This calculator can count both core and memory information, unlike the defaultDefaultResourceCalculator, which can only calculate memory-related information.yarn-site.xml: This file contains several properties that define the resources available to each NodeManager. These properties include:yarn.nodemanager.resource.memory-mb: This property sets the physical memory available to each NodeManager to 12GB.yarn.nodemanager.resource.cpu-vcores: This property sets the virtual CPU available to each NodeManager to 8.
Container-Related Configuration
In addition to the above configurations, we also modified the following properties in yarn-site.xml to define the resources available to each container:
yarn.scheduler.minimum-allocation-mb: This property sets the minimum memory allocation for each container to 1GB.yarn.scheduler.maximum-allocation-mb: This property sets the maximum memory allocation for each container to 12GB.yarn.scheduler.minimum-allocation-vcores: This property sets the minimum virtual CPU allocation for each container to 1.yarn.scheduler.maximum-allocation-vcores: This property sets the maximum virtual CPU allocation for each container to 8.yarn.scheduler.increment-allocation-mb: This property sets the increment memory allocation for each container to 1GB.yarn.nodemanager.vmem-pmem-ratio: This property sets the ratio of virtual memory to physical memory for each NodeManager to 2.1.
Spark Submit Command
To test the configuration, we submitted a Spark task using the following command:
$ SPARK_HOME/bin/spark-submit \
--class com.bonc.rdpe.spark.test.yarn.WordCount \
--master yarn \
--deploy-mode client \
--conf spark.yarn.am.memory=512m \
--conf spark.yarn.am.cores=2 \
--conf spark.executor.memory=1g \
--conf spark.executor.cores=4 \
--conf spark.executor.instances=6 \
./spark-test-1.0.jar
Resource Allocation
After submitting the task, we observed the following resource allocation:
- 7 containers were launched, with 1 container dedicated to the ApplicationMaster and the remaining 6 containers dedicated to executors.
- The total virtual CPU used was 26, consisting of 2 virtual CPU for the ApplicationMaster and 4 virtual CPU for each of the 6 executors.
- The total memory used was 13GB, consisting of 1GB for the ApplicationMaster and 12GB for the 6 executors.
Analyzing Memory Allocation
To understand why the actual memory allocation exceeded the application’s memory requirements, we analyzed the memory allocation formula used by Yarn. The formula takes into account the heap memory allocated to the ApplicationMaster and the excess memory allocated to the heap memory overhead. In our case, the formula allocated 896MB to the ApplicationMaster, which was rounded up to 1GB due to the yarn.scheduler.increment-allocation-mb property.
Similarly, the formula allocated 1408MB to each executor, which was rounded up to 2GB due to the yarn.scheduler.increment-allocation-mb property.
Configuring Memory Overhead
We can configure the memory overhead for the ApplicationMaster and executors using the following properties:
spark.yarn.am.memoryOverheadspark.driver.memoryOverhead(for cluster mode)spark.executor.memoryOverhead
By configuring these properties, we can customize the memory overhead for the ApplicationMaster and executors.
Configuring vCore
We can also configure the vCore for the ApplicationMaster and executors using the following properties:
spark.yarn.am.coresspark.driver.cores(for cluster mode)spark.executor.coresspark.executor.cores(for cluster mode)
By configuring these properties, we can customize the vCore for the ApplicationMaster and executors.
Forecasting Resource Usage
Based on our analysis, we can forecast the resource usage for the ApplicationMaster and executors. In our case, we can expect the following resource usage:
- 5 containers, with 1 container dedicated to the ApplicationMaster and the remaining 4 containers dedicated to executors.
- The total virtual CPU used will be 4 + 4 * 2 = 12.
- The total memory used will be 2G + 4 * 3G = 14G.
This forecast matches our expected resource usage, indicating that our configuration is correct.
By understanding the Spark on Yarn resources configuration, we can customize the resource allocation for our applications and executors. This knowledge will enable us to optimize our resource usage and improve the performance of our Spark applications.