[1]英昌甜,于炯,卞琛,等.并行计算框架Spark的自动检查点策略[J].东南大学学报(自然科学版),2017,47(2):231-235.[doi:10.3969/j.issn.1001-0505.2017.02.006]
 Ying Changtian,Yu Jiong,Bian Chen,et al.Automatic checkpoint strategy for parallel computing frame with Spark[J].Journal of Southeast University (Natural Science Edition),2017,47(2):231-235.[doi:10.3969/j.issn.1001-0505.2017.02.006]
点击复制

并行计算框架Spark的自动检查点策略()
分享到:

《东南大学学报(自然科学版)》[ISSN:1001-0505/CN:32-1178/N]

卷:
47
期数:
2017年第2期
页码:
231-235
栏目:
计算机科学与工程
出版日期:
2017-03-20

文章信息/Info

Title:
Automatic checkpoint strategy for parallel computing frame with Spark
作者:
英昌甜1于炯12卞琛1鲁亮1钱育蓉2
1新疆大学信息科学与工程学院, 乌鲁木齐 830046; 2新疆大学软件学院, 乌鲁木齐 830008
Author(s):
Ying Changtian1 Yu Jiong12 Bian Chen1 Lu Liang1 Qian Yurong2
1School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
2School of Software, Xinjiang University, Urumqi 830008, China
关键词:
自动检查点 RDD权重 Spark 恢复时间
Keywords:
automatic checkpoint resilient distribution dataset(RDD)weight Spark recovery time
分类号:
TP311
DOI:
10.3969/j.issn.1001-0505.2017.02.006
摘要:
针对现有的Spark检查点机制需要编程人员根据经验选择检查点,具有一定的风险和随机性,可能导致恢复开销较大的问题,通过对RDD属性的分析,提出了自动检查点策略,包括权重生成(WG)算法和检查点自动选择(CAS)算法.首先,WG算法分析作业的DAG结构,获取RDD的血统长度和操作复杂度等属性,计算RDD权重;然后,CAS算法选择权重大的RDD作为检查点进行异步备份,来实现数据的快速恢复.结果表明:在使用CAS算法时,不同数据集执行时间和检查点容量大小都有所增加,其中Wiki-Talk由于其计算量较大,增幅明显;使用CAS算法设置检查点后,在单点失效恢复的情况下,数据集的恢复时间较短.因此,自动检查点策略在略微增加执行时间开销的基础上,能够有效地降低作业的恢复开销.
Abstract:
The existing Spark checkpoint mechanism required the programmer to choose the checkpoint according to the experience, thus it had a certain risk and randomness, resulting in large recovery overhead. To address this problem, the resilient distribution datasets(RDD)characteristics were analyzed, and the weight generated(WG)algorithm and checkpoint automatic selection(CAS)algorithm were put forward.First, in the WG algorithm, the directed acyclic graph(DAG)of the job was analyzed, and the lineage length and the operation complexity of RDD were obtained to compute the RDD weight. Secondly, in the CAS algorithm, the RDD with the maximum weight was selected for setting checkpoints asynchronously to fast recovery. The experimental results show that comparing with the original Spark, the execution time and the checkpoint size of different datasets are increased by the CAS algorithm, while the increasing extent of Wiki-Talk is more obvious. For the single node failure recovery, the datasets have smaller recovery overhead after setting checkpoint by using the CAS algorithm. Therefore, the strategy can efficiently decrease the recovery overhead of jobs with sacrificing the slight extra overhead.

参考文献/References:

[1] 孟小峰, 慈祥. 大数据管理:概念、技术与挑战[J]. 计算机研究与发展, 2013, 50(1):146-169.DOI:10.7544/issn1000-1239.2013.20121130.
Meng Xiaofeng,Ci Xiang. Big data management: Concepts,techniques and challenges [J]. Journal of Computer Research and Development, 2013, 50(1): 146-169.DOI:10.7544/issn1000-1239.2013.20121130. (in Chinese)
[2] 王元卓, 靳小龙, 程学旗. 网络大数据:现状与展望[J]. 计算机学报, 2013, 36(6):1125-1138.DOI: 10.3724/SP.J.1016.2013.01125.
Wang Yuanzhuo, Jin Xiaolong, Cheng Xueqi. Network big data: Present and future [J]. Chinese Journal of Computers, 2013, 36(6):1125-1138.DOI:10.3724/SP.J.1016.2013.01125. (in Chinese)
[3] Zaharia M, Chowdhury M, Franklin M J, et al. Spark: Cluster computing with working sets[C]//Usenix Conference on Hot Topics in Cloud Computing.Berkeley, CA, USA: USenix Association, 2010:1765-1773.
[4] Zaharia M, Chowdhury M, Das T, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing[C]//Usenix Conference on Networked Systems Design and Implementation.Berkeley,CA, USA:USenix Association, 2012:141-146.
[5] 易会战, 王锋, 左克, 等. 基于内存缓存的异步检查点容错技术[J]. 计算机研究与发展, 2014, 51(6):1229-1239. DOI: 10.7544/issn1000-1239.2014.20121125.
Yi Huizhan, Wang Feng, Zuo Ke, et al. Asynchronous checkpoint/restart based on memory buffer[J]. Journal of Computer Research and Development, 2014, 51(6): 1229-1239. DOI:10.7544/issn1000-1239.2014.20121125. (in Chinese)
[6] 慈轶为,张展,左德承,等. 可扩展的多周期检查点设置[J]. 软件学报, 2010, 21(2): 218-230. DOI: 10.3724/SP.J.1001.2010.03787.
Ci Yiwei, Zhang Zhan, Zuo Decheng, et al. Scalable time-based multi-cycle checkpointing[J]. Journal of Software, 2010, 21(2): 218-230. DOI:10.3724/SP.J.1001.2010.03787. (in Chinese)
[7] 吴俊.基于双优先级队列的异构分布式控制系统容错调度算法[J].东南大学学报(自然科学版),2008,38(3):407-412. DOI:10.3321/j.issn:1001-0505.2008.03.009.
Wu Jun.Fault-tolerant scheduling algorithm for heterogeneous distributed control systems based on dual priorities queues[J].Journal of Southeast University(Natural Science Edition),2008,38(3):407-412. DOI:10.3321/j.issn:1001-0505.2008.03.009. (in Chinese)
[8] Neumeyer L, Robbins B, Nair A, et al. S4: Distributed stream computing platform[C]//IEEE International Conference on Data Mining Workshops. Piscataway, New Jersey, USA: IEEE, 2010: 170-177. DOI: 10.1109/ICDMW.2010.172.
[9] Ongaro D, Rumble S M, Stutsman R, et al. Fast crash recovery in RAMCloud[C]//ACM Symposium on Operating Systems Principles. New York, US:ACM, 2011:29-41. DOI: 10.1145/2043556.2043560.
[10] Li H Y,Ghodsi A, Zaharia M, et al.Tachyon: Reliable, memory speed storage for cluster computing frameworks [C]//IEEE Conference on SYSTEM-ON-CHIP. Piscataway, New Jersey, USA: IEEE, 2014: 1-15. DOI: 10.1145/2670979.2670985.

备注/Memo

备注/Memo:
收稿日期: 2016-11-12.
作者简介: 英昌甜(1989—), 女, 博士生;于炯(联系人), 男, 博士, 教授, 博士生导师, yujiong@xju.edu.cn.
基金项目: 国家自然科学基金资助项目(61462079,61262088,61562086,61363083,61562078)、新疆维吾尔自治区高校科研计划资助项目(XJEDU2016S106).
引用本文: 英昌甜,于炯,卞琛,等.并行计算框架Spark的自动检查点策略[J].东南大学学报(自然科学版),2017,47(2):231-235. DOI:10.3969/j.issn.1001-0505.2017.02.006.
更新日期/Last Update: 2017-03-20