MapReduce Highly Random Fuzzy Forest Algorithm for Noisy Large Data

Mei WANG; Fen LUO; Bao-hua ZHANG

doi:10.13718/j.cnki.xsxb.2019.11.017

2019 Volume 44 Issue 11

Article Contents

Previous Article Next Article

Mei WANG, Fen LUO, Bao-hua ZHANG. MapReduce Highly Random Fuzzy Forest Algorithm for Noisy Large Data[J]. Journal of Southwest China Normal University(Natural Science Edition), 2019, 44(11): 110-117. doi: 10.13718/j.cnki.xsxb.2019.11.017

Citation:

Mei WANG, Fen LUO, Bao-hua ZHANG. MapReduce Highly Random Fuzzy Forest Algorithm for Noisy Large Data[J]. Journal of Southwest China Normal University(Natural Science Edition), 2019, 44(11): 110-117. doi: 10.13718/j.cnki.xsxb.2019.11.017

MapReduce Highly Random Fuzzy Forest Algorithm for Noisy Large Data

1.
Experimental Training and Teaching Department, Changzhou Vocational Institute of Engineering, Changzhou Jiangsu 213164, China
2.
School of Computer Science and Technology, Henan Polytechnic University, Jiaozuo Henan 454000, China
3.
School of Intelligent Equipment and information, Changzhou Vocational Institute of Engineering, Changzhou Jiangsu 213164, China

More Information

Received Date: 25/07/2018
Available Online: 20/11/2019
MSC: TP311

Abstract

In order to solve the problem of increasing noise big data classification, a highly random fuzzy forest algorithm has been proposed, which generates fuzzy partitions of continuous attributes in decision tree learning, and gives a distributed implementation of the proposed algorithm in MapReduce framework. Learning a set of fuzzy decision trees in a large data set contaminated by attribute noise, the distributed implementation model can adapt to the effective allocation strategy of the calculation, thereby generating good scalability data, and the distributed algorithm enables the fuzzy random forest to process learning and classification of big data sets. The highly random fuzzy forest algorithm can achieve high-precision classification of noisy big data, laying a good foundation for future big data analysis. The experimental results show that the proposed method has higher classification accuracy rate than the existing algorithm. In the case of attribute noise, the classification accuracy rate is higher than the random forest algorithm, which shows the feasibility and effectiveness of the proposed algorithm.
- random forest,
- fuzzy decision tree,
- highly random fuzzy forest,
- noise big data

References

[1]	FERNÁNDEZ A, DELRÍO S, LÓPEZ V, et al.Big Data with Cloud Computing:An Insight on the Computing Environment, MapReduce, and Programming Frameworks[J].Wiley Interdisciplinary Reviews:Data Mining and Knowledge Discovery, 2014, 4(5):380-409. doi: 10.1002/widm.1134 CrossRef Google Scholar
[2]	梁吉业, 冯晨娇, 宋鹏.大数据相关分析综述[J].计算机学报, 2016, 39(1): 1-18. Google Scholar
[3]	WU X, ZHU X, WU G Q, et al.Data Mining with Big Data[J].IEEE Transactions on Knowledge and Data Engineering, 2014, 26(1):97-107. doi: 10.1109/TKDE.2013.109 CrossRef Google Scholar
[4]	GANDOMI A, HAIDER M.Beyond the Hype:Big Data Concepts, Methods, and Analytics[J].International Journal of Information Management, 2015, 35(2):137-144. doi: 10.1016/j.ijinfomgt.2014.10.007 CrossRef Google Scholar
[5]	RUTKOWSKI L, JAWORSKI M, PIETRUCZUK L, et al.Decision Trees for Mining Data Streams Based on the Gaussian Approximation[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(1):108-119. doi: 10.1109/TKDE.2013.34 CrossRef Google Scholar
[6]	SEGATORI A, MARCELLONI F, PEDRYCZ W.On Distributed Fuzzy Decision Trees for Big Data[J].IEEE Transactions on Fuzzy Systems, 2018, 26(1):174-192. doi: 10.1109/TFUZZ.2016.2646746 CrossRef Google Scholar
[7]	GENUER R, POGGI J M, TULEAU-MALOT C.VSURF:an R Package for Variable Selection Using Random Forests[J].The R Journal, 2015, 7(2):19-33. doi: 10.32614/RJ-2015-018 CrossRef Google Scholar
[8]	SCORNET E, BIAU G, VERT J P.Consistency of Random Forests[J].The Annals of Statistics, 2015, 43(4):1716-1741. doi: 10.1214/15-AOS1321 CrossRef Google Scholar
[9]	RISTIN M, GUILLAUMIN M, GALL J, et al.Incremental Learning of Random Forests for Large-Scale Image Classification[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(3):490-503. doi: 10.1109/TPAMI.2015.2459678 CrossRef Google Scholar
[10]	GADOMER L, SOSNOWSKI Z A.Fuzzy Random Forest with C-Fuzzy Decision Trees[C]//IFIP International Conference on Computer Information Systems and Industrial Management.Cham: Springer International Publishing, 2016. Google Scholar
[11]	HAIXIANG G, YIJING L, YANAN L, et al.BPSO-Adaboost-KNN Ensemble Learning Algorithm for Multi-Class Imbalanced Data Classification[J].Engineering Applications of Artificial Intelligence, 2016, 49:176-193. doi: 10.1016/j.engappai.2015.09.011 CrossRef Google Scholar
[12]	叶学义, 宋倩倩, 高真, 等.基于直方图条件熵的水声数据分类算法[J].计算机工程, 2016, 42(11): 244-248, 254. doi: 10.3969/j.issn.1000-3428.2016.11.040 CrossRef Google Scholar
[13]	唐校辉, 廖欣, 陈雷霆, 等.基于改进Tri-Training算法的健康大数据分类模型研究[J].现代计算机(专业版), 2017(20): 21-25. Google Scholar
[14]	TRIGUERO I, PERALTA D, BACARDIT J, et al.MRPR:A MapReduce Solution for Prototype Reduction in Big Data Classification[J].neurocomputing, 2015, 150:331-345. doi: 10.1016/j.neucom.2014.04.078 CrossRef Google Scholar

Access History

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(4) / Tables(8)

Export Citation

PDF

XML

Article Metrics

Article views(706) PDF downloads(114) Cited by(0)

Access History

Other Articles By Authors

on this site
on Google Scholar

HTML

随着互联网的发展，各种数据呈指数型增长，Web上的数据量以exabytes(10¹⁸)和zettabytes(10²¹)为单位进行衡量，到2025年预测互联网的数据量将超过全世界人口的大脑容量，数据的快速增长是由于数字传感器、通信、计算和存储方面的进步造成了巨大的数据集^[1-2].大数据的特征体现在4V方面，即体积、速度、品种和准确度，而提取有用的知识和信息变得越来越困难，因此对于大数据的分类分析越来越重要^[3-5].

决策树是数据挖掘中使用最广泛的分类方法之一，能够达到良好的精度水平.决策树的一个特殊优点是与归纳有关，这通常需要对有限数量的参数进行操作^[5].处理不确定数据时引入模糊决策树，模糊决策树中的每个节点都是由模糊集来表征的，因此每个实例可以激活不同的分支并到达多个叶子^[6].但是，决策树的缺点是对轻微噪声非常敏感，因此在噪声大数据中效果并不好.

目前，随机森林被认为是最有效和最流行的分类工具之一^[7-8].随机森林的基本算法包含2个随机元素：bagging(即不同的数据集，从整个数据集中取而代之，用于学习每个不同的树)以及随机属性选择.在森林中的每个树中选择可用属性的子集，并确定相对最佳分割^[9].模糊随机森林具有更高的预测精度和更少的参数方差，比模糊规则的系统更准确，并且比清晰的随机森林更能容忍噪声^[10].

现有普通分类方法有卷积神经网络、支持向量机、K-最近邻、直方图条件熵、堆叠自编码器和粒子群优化算法等^[11-12]，但是这些方法对于大数据普适性并不好，需要利用分布式计算重新设计支持算法来处理这样具有挑战性的数据集.在这种情况下，MapReduce框架被证明是一个非常好的选择，但是其他分类算法在MapReduce框架中的分布式实现对大数据的噪声属性却没有较好的效果.文献[13]中提出一种基于改进Tri-Training算法的健康大数据分类模型，该方法通过更改扩充样本训练集选取方式，剔除可能提高分类误差的样本，解决了随机选取基础分类器的扩充训练样本集会引入噪声这一问题.但是这种方法是避免训练过程引入噪声问题，对于含噪声的大数据处理效果并不好.文献[14]提出一种新的分布式最近邻分类中的原型简化分类方法，旨在将原始训练数据集表示为减少的实例数，加快分类过程并降低最近邻规则的存储要求和对噪声的敏感性，但是该方法对噪声大数据的分类精度并不高.

现有数据分类方法具有无法适应大数据分类及噪声大数据分类精度不高的问题.针对这些问题，本文提出了一种在MapReduce中实现的高度随机模糊森林算法，并给出了MapReduce下的分布式实现.模糊随机森林的生成需要一个初步的bagging步骤，在整个学习数据集上应用采样，以便得到与森林中模糊树的数量一样多的单个学习数据集.在C4.5决策树上构建模糊树的连续属性分区，并在节点创建时生成模糊分区.该算法能够更好地分配计算工作量，优化并节省计算费用，同时不会破坏整体集合的准确性.实验结果表明，在训练数据存在属性噪声的情况下，所提出的算法比现有方法更有效.

4. 结论

本文提出了高度随机模糊森林算法来处理大数据分类问题，并给出了HRFF在MapReduce框架下的分布式实现.该算法的基本思想是在分割数值属性时随机化模糊分区，且能够更好地分配计算工作量，允许模糊分区的随机性用于模糊决策树的输入变量.实验结果表明，在具有属性噪声的情况下，本文HRFF算法比FRF算法分类更准确.通过改变通常的数据分布过程来改进算法的并行化功能，HRFF也可以很好地适应并行处理器的数量.未来工作将研究在噪声影响下，提高分类精度的方法.

Figure (4) Table (8) Reference (14)

Name
	Name cannot be empty!
E-mail
	Mailbox cannot be empty! Mailbox cannot be empty!
Telephone
	Mobile number cannot be empty! Please enter a valid mobile number!
Title

Content
Verification Code

Message Board

MapReduce Highly Random Fuzzy Forest Algorithm for Noisy Large Data

Abstract

References

Access History

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Access History

Other Articles By Authors