Parallel Big Data Classification Method Based on Adaptive Exponential Bat and Stacked Autoencoder

QIAN Zhenkun; ZHOU Siji

doi:10.13718/j.cnki.xsxb.2022.06.002

A parallel big data classification method based on adaptive exponential bats and SAE has been proposed to solve the problem of low efficiency when classifying big data by deep learning. In the parallel computing framework, AEB algorithm is used to select features in the Map stage. AEB is obtained according to the exponential weighted moving average (EWMA) and adaptive weight strategy. Then the selected features are used as the input of Reduce for big data classification. In the Reduce stage, the deep stacked autoencoders trained by AEB algorithm is used for classification, which further improves the classification accuracy. The experimental results show that the proposed method is superior to other existing methods in terms of accuracy and TPR performance for different percentage of training data.

HTML

在信息技术高速发展的社会，数据正以前所未有的速度增长^[1-2]，大数据作为一种新的战略资源正在推动创新，改变不同领域的研究以及人们的生活、思维方式^[3-4]. 分布式计算是一种大数据策略，常用的大数据框架之一是Hadoop，在Hadoop分布式文件系统(Hadoop Distributed File System，HDFS)上实现MapReduce并行计算^[5-6]. Apache Spark是另外一个常用的并行计算框架，Spark基于MapReduce算法实现分布式计算，与Hadoop框架的不同之处是：Job中间输出和结果可以保存在内存中，因此不需要读写HDFS，Spark的核心依然是MapReduce. Spark对于数据挖掘与机器学习等需要迭代的算法更友好，适应性更强^[7-8]. MapReduce由两个阶段组成：Map和Reduce，Map阶段处理输入的数据拆分，生成不同的键值对，Reduce阶段按键汇总在映射阶段获得的结果^[9].

大数据分类研究已经应用到各个行业，如金融、医疗、工业等. 文献[10]针对大数据分类中的噪声问题，提出两种消除噪声样本的大数据预处理方法：同质集合和异类集合过滤器，通过对大数据中噪声的处理得到高质量和干净的数据. 文献[11]提出一种Spark框架下K最邻近(KNN)分类器的网络大数据分类处理方法，该方法通过Map阶段分区K近邻操作，并通过Reduce阶段确定最终K近邻，同时对近邻的标签集合进行聚合，得出分类结果，但是该方法分类准确度较低. 文献[12]提出了物联网大数据的随机森林分类方法，并根据蜻蜓优化选取特征对电子医疗数据进行分类，但该方法仅考虑了目标和当前特征变量的数据. 文献[13]设计了一种线性支持向量机大数据分类方法，相较于传统支持向量机，该方法在训练速度和分类精度上具有明显的优势，但用于更大数据集时会影响性能. 文献[14]提出了蚁群优化-人工神经网络联合算法，该算法使用了深度人工神经网络，并进行了蚁群优化，提升了分类准确度. 文献[15]使用蝙蝠算法优化人工神经网络，提高了分类准确率.

本文在研究了大数据处理框架和现有大数据分类方法的基础上，提出了基于自适应指数蝙蝠和堆叠自编码器(Stacked AutoEncoder，SAE)的并行大数据分类方法，该方法根据大数据分类方法的MapReduce并行实现，设计了基于自适应指数蝙蝠算法. 在Map阶段进行特征选择，在Reduce阶段，使用AEB训练的深度堆叠自动编码器分类，得到分类结果. 实验结果表明，该方法能够实现较高精度的大数据分类.

4. 结论

针对大数据分类性能低的问题，本文提出一种自适应指数蝙蝠和SAE的并行大数据分类方法，该方法在Map阶段使用设计的自适应指数蝙蝠算法进行特征选择；在Reduce阶段使用经过AEB算法训练的深度堆叠自动编码器进行分类，进一步提升了分类性能. 使用不同的实验数据集对本文所提方法进行实验，不同百分比条件下的分类性能结果显示，本文所提方法能够以高精度实现大数据分类，且在准确度和TPR性能方面都优于现有其他方法，说明本文方法的有效性和优越性. 未来的工作将通过扩展本文所提出的方法来处理安全约束.

Figure (8) Reference (15)

Name
	Name cannot be empty!
E-mail
	Mailbox cannot be empty! Mailbox cannot be empty!
Telephone
	Mobile number cannot be empty! Please enter a valid mobile number!
Title

Content
Verification Code

[1]	孙倩, 陈昊, 李超. 基于改进人工蜂群算法与MapReduce的大数据聚类算法[J]. 计算机应用研究, 2020, 37(6): 1707-1710, 1764. Google Scholar
[2]	VARATHARAJAN R, MANOGARAN G, PRIYAN M K. A Big Data Classification Approach Using LDA with an Enhanced SVM Method for ECG Signals in Cloud Computing [J]. Multimedia Tools and Applications, 2018, 77(8): 10195-10215. doi: 10.1007/s11042-017-5318-1 CrossRef Google Scholar
[3]	XIE A, YIN F, XU Y, et al. Distributed Gaussian Processes Hyperparameter Optimization for Big Data Using Proximal ADMM [J]. IEEE Signal Processing Letters, 2019, 26(8): 1197-1201. doi: 10.1109/LSP.2019.2925532 CrossRef Google Scholar
[4]	QURESHI N M F, SIDDIQUI I F, UNAR M A, et al. An Aggregate MapReduce Data Block Placement Strategy for Wireless IoT Edge Nodes in Smart Grid [J]. Wireless Personal Communications, 2019, 106(4): 2225-2236. doi: 10.1007/s11277-018-5936-6 CrossRef Google Scholar
[5]	THIND J S, SIMON R. Implementation of Big Data in Cloud Computing with Optimized Apache Hadoop [C]//2019 3rd International conference on Electronics, Communication and Aerospace Technology (ICECA). Coimbatore: IEEE, 2019. Google Scholar
[6]	SAXENA A, CHAURASIA A, KAUSHIK N, et al. Handling Big Data Using MapReduce over Hybrid Cloud [C]. Ostrave: International Conference on Innovative Computing and Communications, 2019. Google Scholar
[7]	LUNGA D, GERRAND J, YANG L X, et al. Apache Spark Accelerated Deep Learning Inference for Large Scale Satellite Image Analytics [J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2020, 13: 271-283. doi: 10.1109/JSTARS.2019.2959707 CrossRef Google Scholar
[8]	SHI L Z, MENG X D, TSENG E, et al. SpaRC: Scalable Sequence Clustering Using Apache Spark [J]. Bioinformatics, 2018, 35(5): 760-768. Google Scholar
[9]	VENKATESH G, ARUNESH K. Map Reduce for Big Data Processing Based on Traffic Aware Partition and Aggregation [J]. Cluster Computing, 2019, 22(5): 12909-12915. Google Scholar
[10]	GARCÍA-GIL D, LUENGO J, GARCÍA S, et al. Enabling Smart Data: Noise Filtering in Big Data Classification [J]. Information Sciences, 2019, 479: 135-152. doi: 10.1016/j.ins.2018.12.002 CrossRef Google Scholar
[11]	张龙翔, 曹云鹏, 王海峰. 面向大数据复杂应用的GPU协同计算模型[J]. 计算机应用研究, 2020, 37(7): 2049-2053. Google Scholar
[12]	LAKSHMANAPRABU S K, SHANKAR K, ILAYARAJA M, et al. Random Forest for Big Data Classification in the Internet of Things Using Optimal Features [J]. International Journal of Machine Learning and Cybernetics, 2019, 10(10): 2609-2618. doi: 10.1007/s13042-018-00916-z CrossRef Google Scholar
[13]	ZOU H S, JIN Z Y. Comparative Study of Big Data Classification Algorithm Based on SVM [C]//2018 Cross Strait Quad-Regional Radio Science and Wireless Technology Conference (CSQRWC). Xuzhou: IEEE, 2018. Google Scholar
[14]	JOSEPH MANOJ R, ANTO PRAVEENA M D, VIJAYAKUMAR K. An ACO–ANN Based Feature Selection Algorithm for Big Data [J]. Cluster Computing, 2019, 22(2): 3953-3960. Google Scholar
[15]	ISCAN H, KAMAL L L, KODAZ H. A New Hybrid Classifier based on Bat Algorithm and Artificial Neural Networks [J]. Journal of Industrial Engineering Research, 2018, 4(4): 27-33. Google Scholar

Message Board

Parallel Big Data Classification Method Based on Adaptive Exponential Bat and Stacked Autoencoder

Abstract

References

Access History

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Access History

Other Articles By Authors