Performance Optimization Strategies for Fully Utilizing Apache Spark

KIPS Transactions on Computer and Communication Systems, Vol. 7, No.1, pp.9-18, January 2018
10.3745/KTCCS.2018.7.1.009, Full Text

Abstract

Enhancing performance of big data analytics in distributed environment has been issued because most of the big data related applications such as machine learning techniques and streaming services generally utilize distributed computing frameworks. Thus, optimizing performance of those applications at Spark has been actively researched. Since optimizing performance of the applications at distributed environment is challenging because it not only needs optimizing the applications themselves but also requires tuning of the distributed system configuration parameters. Although prior researches made a huge effort to improve execution performance, most of them only focused on one of three performance optimization aspect: application design, system tuning, hardware utilization. Thus, they couldn’t handle an orchestration of those aspects. In this paper, we deeply analyze and model the application processing procedure of the Spark. Through the analyzed results, we propose performance optimization schemes for each step of the procedure: inner stage and outer stage. We also propose appropriate partitioning mechanism by analyzing relationship between partitioning parallelism and performance of the applications. We applied those three performance optimization schemes to WordCount, Pagerank, and Kmeans which are basic big data analytics and found nearly 50% performance improvement when all of those schemes are applied.


Statistics

Show / Hide Statistics

Statistics (Cumulative Counts from October 15, 2016)

Multiple requests among the same browser session are counted as one view. If you mouse over a chart, the values of data points will be shown.


Cite this paper

[KIPS Transactions Style]
R. Myung, H. Yu, and S. Choi, "Performance Optimization Strategies for Fully Utilizing Apache Spark," KIPS Transactions on Computer and Communication Systems, Vol.7, No.1, pp.9-18, 2018, DOI: 10.3745/KTCCS.2018.7.1.009.

[IEEE Style]
Rohyoung Myung, Heonchang Yu, and Sukyong Choi, "Performance Optimization Strategies for Fully Utilizing Apache Spark," KIPS Transactions on Computer and Communication Systems, vol. 7, no. 1, pp. 9-18, 2018. DOI: 10.3745/KTCCS.2018.7.1.009.

[ACM Style]
Myung, R., Yu, H., and Choi, S. 2018. Performance Optimization Strategies for Fully Utilizing Apache Spark. KIPS Transactions on Computer and Communication Systems, 7, 1, (2018), 9-18. DOI: 10.3745/KTCCS.2018.7.1.009.