HBase and Spark Performance Tuning

This section provides information on tuning HBase and Spark based on available machines and its hardware specification to get maximum performance.

Spark

Apache Spark is very resource intensive so you need to be careful in appropriately allocating YARN and Spark resources to achieve expected performance. Each Spark application consists of a single driver process and a set of executor processes scattered across the nodes on the cluster. Each Spark executor in an application has a number of cores and a heap size. In order to obtain high throughput, you have to specifically set the number of instances, cores, and heap size used by executors.

The recommended configuration for tuning will be illustrated based on the following hardware specification.

RAM	16 GB
Hard Disk	500 GB
Number of Cores	8
Operating System	Windows Server 2012

Number of nodes used in the Hadoop cluster: Two Name nodes and three Data nodes.

Performance tuning and optimization

At the outset we used the configuration settings listed in the following table.

Configuration file: Yarn-site.xml

Property name	Property value
yarn.nodemanager.resource.memory-mb	14336
yarn.scheduler.minimum-allocation-mb	1024
yarn.scheduler.maximum-allocation-mb	6144

yarn.nodemanager.resource.memory-mb: Maximum RAM size in MB can be allocated for yarn utilization on each node of the cluster.
yarn.scheduler.maximum-allocation-mb: Specifies the minimum allocation of memory to create container on each node.
yarn.scheduler.maximum-allocation-mb: Specifies the maximum allocation of memory to create container on each node.

From the previously mentioned yarn configuration, maximum of 14GB from RAM memory has been allocated for map reduce job utilization and remaining 2GB will be used by other application and OS.

This section explains how to calculate the spark configuration with the previously mentioned hardware specification. The following three properties runs the spark job efficiently by utilizing the cluster nodes properly.

To run multiple task’s in JVM and also considering hadoop daemon’s to run smoothly, assign (8 core/ 3) core per executors therefore spark.executor.cores = 2 (for good HDFS throughput)
Available core per node is calculated by leaving 1 core for Hadoop or yarn daemon’s from actual core count on each data node of the cluster that is 8 core - 1 core = 7 core.
Total available core of the cluster is calculated by ( Available core per node * number of data nodes) that is (7 core * 3) = 21 core.
Number of Available executors is (Total available core / core per executor) that is (21 core / 2 ) gives spark.executor.instances = 10
Number of Executors per node is (Available executor / number of data nodes) that is (10/3) = 3
Memory per Executors is ( RAM size of each node / Number of executors) that is ( 16 GB / 3) = 5 GB
Executor Memory is calculated by considering heap size as 7% of Memory per Executors that is (7% of 5) = 0.35 GB therefore spark.executor.memory = 5 - 0.35 = 4.65 GB
Driver Memory is used by spark RDD, it is default 1g upto 60% of executor memory.

NOTE

This setting may affect the performance of other type of Hadoop jobs like MR, Pig, and Hive jobs when spark jobs are actively running.

Configuration file: spark-defaults.conf

spark.driver.memory       3g
spark.executor.memory     4g
spark.executor.instances  10
spark.executor.cores 	  2

In addition, remove the spark.cleaner.ttl property from the configuration information to avoid clearing
cached data.

The above recommended configuration of YARN and Spark can be done using cluster manager by following below steps:

YARN configuration

Step 1: Click Configuration option under Administration menu and select Hadoop service.

SparkConfig dialog

Step 2: In the same yarn-site.xml, add the following two new properties to set minimum and maximum size for requesting containers.

Property name	Property value
yarn.scheduler.minimum-allocation-mb	1024
yarn.scheduler.maximum-allocation-mb	6144

Spark configuration

Step 1: Select Spark from services and select spark-defaults.conf file.

Step 2: Edit spark.executor.memory property’s value as 5g.

Step 3: In the same file spark-defaults.conf, add the following new properties and remove the spark.cleaner.ttl property.

Property name	Property value
spark.driver.memory	3g
spark.executor.memory	4g
spark.executor.instances	10
spark.executor.cores	2

SparkConfig dialog

Step 4: Save the changes and click restart services to apply the changes in the cluster.

HBase

Common problem when running HBase cluster in a high-velocity load (Like handling massive amounts of time-series data) is that often HRegionServer’s shutdown due to GC “stop-all allocation” pause.
This pause occurs when JVM heap memory gets filled due to high traffic in incoming data.

To prevent a “GC complete” pause from occurring, we tuned the JVM to perform garbage collection more efficiently using the following settings:

UseConcMarkSweepGC
UseParNewGC
CMSInitiatingOccupancyFraction=60

The key idea here is that we request the garbage collection system to start collecting before there is potential for completely running out of memory.
For more details, please refer to the Oracle documentation on memory management.

Follow below steps in Cluster Manager Application to set these configurations.

Step 1: Click Configuration option under Administration menu.

HBaseconfig dialog

Step 2: Select HBase from services.

HBaseconfig dialog

Step 3: In hbase-env.cmd file, set the JVM flags in HBASE_OPTS as,

HBASE_OPTS="-XX:+UseConcMarkSweepGC" "-XX:+UseParNewGC" "-XX:CMSInitiatingOccupancyFraction=60" -Xmx1024m

HBaseconfig dialog

NOTE

Here we have set 8 GB as heap memory where machine’s primary memory is assumed to be 16 GB.

Step 4: Save the changes and click Restart services to apply the changes in cluster.

HBaseconfig dialog