四、大数据基准测试工具-HiBench-Run SparkBench

肖钟城
  • 大数据技术栈
  • HiBench
大约 2 分钟

四、大数据基准测试工具-HiBench-Run SparkBench

1. Setup

  • Python 2.x(>=2.6) is required.
  • bc is required to generate the HiBench report.
  • Supported Hadoop version: Apache Hadoop 2.x, 3.0.x, 3.1.x, 3.2.x, CDH5.x, HDP
  • Supported Spark version: 2.4.x, 3.0.x
  • Build HiBench according to build HiBench.
  • Start HDFS, Yarn, Spark in the cluster.

Note: Starting from HiBench 8.0, the support of Spark before 2.3.x(inclusive) was deprecated, please either leverage former version HiBench or upgrade your Spark.

2. Configure hadoop.conf

Hadoop is used to generate the input data of the workloads. Create and edit conf/hadoop.conf:

cp conf/hadoop.conf.template conf/hadoop.conf
PropertyMeaning
hibench.hadoop.homeThe Hadoop installation location
hibench.hadoop.executableThe path of hadoop executable. For Apache Hadoop, it is /YOUR/HADOOP/HOME/bin/hadoop
hibench.hadoop.configure.dirHadoop configuration directory. For Apache Hadoop, it is /YOUR/HADOOP/HOME/etc/hadoop
hibench.hdfs.masterThe root HDFS path to store HiBench data, i.e. hdfs://localhost:8020/user/username
hibench.hadoop.releaseHadoop release provider. Supported value: apache, cdh5, hdp

Note: For CDH and HDP users, please update hibench.hadoop.executable, hibench.hadoop.configure.dir and hibench.hadoop.release properly. The default value is for Apache release.

3. Configure spark.conf

Create and edit conf/spark.conf:

cp conf/spark.conf.template conf/spark.conf

Set the below properties properly:

Propertymeaning
hibench.spark.homeThe Spark installation location
hibench.spark.masterThe Spark master, i.e. spark://xxx:7077, yarn-client

4. Run a workload

To run a single workload i.e. wordcount.

bin/workloads/micro/wordcount/prepare/prepare.sh
bin/workloads/micro/wordcount/spark/run.sh

The prepare.shopen in new window launches a Hadoop job to generate the input data on HDFS. The run.shopen in new window submits the Spark job to the cluster. bin/run_all.sh can be used to run all workloads listed in conf/benchmarks.lst.

5. View the report

The /report/hibench.report is a summarized workload report, including workload name, execution duration, data size, throughput per cluster, throughput per node.

The report directory also includes further information for debugging and tuning.

  • /spark/bench.log: Raw logs on client side.
  • /spark/monitor.html: System utilization monitor results.
  • /spark/conf/.conf: Generated environment variable configurations for this workload.
  • /spark/conf/sparkbench//sparkbench.conf: Generated configuration for this workloads, which is used for mapping to environment variable.
  • /spark/conf/sparkbench//spark.conf: Generated configuration for spark.

6. Input data size

To change the input data size, you can set hibench.scale.profile in conf/hibench.conf. Available values are tiny, small, large, huge, gigantic and bigdata. The definition of these profiles can be found in the workload’s conf file i.e. conf/workloads/micro/wordcount.conf

7. Tuning

Change the below properties in conf/hibench.conf to control the parallelism

PropertyMeaning
hibench.default.map.parallelismPartition number in Spark
hibench.default.shuffle.parallelismShuffle partition number in Spark

Change the below properties to control Spark executor number, executor cores, executor memory and driver memory.

PropertyMeaning
hibench.yarn.executor.numSpark executor number in Yarn mode
hibench.yarn.executor.coresSpark executor cores in Yarn mode
spark.executor.memorySpark executor memory
spark.driver.memorySpark driver memory
评论
  • 按正序
  • 按倒序
  • 按热度
Powered by Waline v2.14.1