基于自适应指数蝙蝠和SAE的并行大数据分类

钱真坤; 周思吉

doi:10.13718/j.cnki.xsxb.2022.06.002

基于自适应指数蝙蝠和SAE的并行大数据分类

钱真坤¹,
周思吉²

1.
四川文理学院后勤服务处, 四川达州 635000

2.
四川文理学院信息化建设与服务中心, 四川达州 635000

基金项目: 四川省高校后勤协会2022-2023年度立项课题(20220602)

详细信息

作者简介:
钱真坤，硕士，实验师，主要从事计算机应用及软件工程研究 .

中图分类号: TP393

Parallel Big Data Classification Method Based on Adaptive Exponential Bat and Stacked Autoencoder

QIAN Zhenkun¹,
ZHOU Siji²

1.
Logistics Service, Sichuan University of Arts and Science, Dazhou Sichuan 635000, China

2.
Informatization Construction and Service Center, Sichuan University of Arts and Science, Dazhou Sichuan 635000, China

摘要: 为解决深度学习进行大数据分类时效率低的问题, 本文提出一种基于自适应指数蝙蝠和堆叠自编码器(SAE)的并行大数据分类方法. 在并行计算框架中, Map阶段使用自适应指数蝙蝠算法进行特征选择, 自适应指数加权移动平均值蝙蝠算法(AEB)由指数加权移动平均值(EWMA)和自适应权重策略得到. 将选择的特征作为Reduce输入进行大数据分类, Reduce阶段使用AEB算法训练的深度堆叠自动编码器(SAE)进行分类, 进一步提高了分类精度. 实验结果表明, 针对不同的训练数据百分比, 本文所提方法在准确度和真正例率(TPR)性能方面优于其他现有方法.
- 大数据 /
- MapReduce /
- 自适应指数蝙蝠算法 /
- 深度堆叠自动编码器
Abstract: A parallel big data classification method based on adaptive exponential bats and SAE has been proposed to solve the problem of low efficiency when classifying big data by deep learning. In the parallel computing framework, AEB algorithm is used to select features in the Map stage. AEB is obtained according to the exponential weighted moving average (EWMA) and adaptive weight strategy. Then the selected features are used as the input of Reduce for big data classification. In the Reduce stage, the deep stacked autoencoders trained by AEB algorithm is used for classification, which further improves the classification accuracy. The experimental results show that the proposed method is superior to other existing methods in terms of accuracy and TPR performance for different percentage of training data.
- big data /
- MapReduce /
- adaptive exponential bat algorithm /
- deep stacked autoencoders .
图 1 MapReduce框架

下载: 全尺寸图片幻灯片

图 2 自适应深度堆叠自动编码器大数据分类

下载: 全尺寸图片幻灯片

图 3 深度堆叠自动编码器网络体系结构

下载: 全尺寸图片幻灯片

图 4 Cleveland数据集的准确度对比结果

下载: 全尺寸图片幻灯片

图 5 Cleveland数据集的TPR对比结果

下载: 全尺寸图片幻灯片

图 6 Pima India数据集的准确度对比结果

下载: 全尺寸图片幻灯片

图 7 Pima India数据集的TPR对比结果

下载: 全尺寸图片幻灯片

图 8 Higgs大数据集的准确度对比结果

下载: 全尺寸图片幻灯片

[1]	孙倩, 陈昊, 李超. 基于改进人工蜂群算法与MapReduce的大数据聚类算法[J]. 计算机应用研究, 2020, 37(6): 1707-1710, 1764. doi: https://www.cnki.com.cn/Article/CJFDTOTAL-JSYJ202006021.htm
[2]	VARATHARAJAN R, MANOGARAN G, PRIYAN M K. A Big Data Classification Approach Using LDA with an Enhanced SVM Method for ECG Signals in Cloud Computing [J]. Multimedia Tools and Applications, 2018, 77(8): 10195-10215. doi: 10.1007/s11042-017-5318-1
[3]	XIE A, YIN F, XU Y, et al. Distributed Gaussian Processes Hyperparameter Optimization for Big Data Using Proximal ADMM [J]. IEEE Signal Processing Letters, 2019, 26(8): 1197-1201. doi: 10.1109/LSP.2019.2925532
[4]	QURESHI N M F, SIDDIQUI I F, UNAR M A, et al. An Aggregate MapReduce Data Block Placement Strategy for Wireless IoT Edge Nodes in Smart Grid [J]. Wireless Personal Communications, 2019, 106(4): 2225-2236. doi: 10.1007/s11277-018-5936-6
[5]	THIND J S, SIMON R. Implementation of Big Data in Cloud Computing with Optimized Apache Hadoop [C]//2019 3rd International conference on Electronics, Communication and Aerospace Technology (ICECA). Coimbatore: IEEE, 2019.
[6]	SAXENA A, CHAURASIA A, KAUSHIK N, et al. Handling Big Data Using MapReduce over Hybrid Cloud [C]. Ostrave: International Conference on Innovative Computing and Communications, 2019.
[7]	LUNGA D, GERRAND J, YANG L X, et al. Apache Spark Accelerated Deep Learning Inference for Large Scale Satellite Image Analytics [J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2020, 13: 271-283. doi: 10.1109/JSTARS.2019.2959707
[8]	SHI L Z, MENG X D, TSENG E, et al. SpaRC: Scalable Sequence Clustering Using Apache Spark [J]. Bioinformatics, 2018, 35(5): 760-768.
[9]	VENKATESH G, ARUNESH K. Map Reduce for Big Data Processing Based on Traffic Aware Partition and Aggregation [J]. Cluster Computing, 2019, 22(5): 12909-12915.
[10]	GARCÍA-GIL D, LUENGO J, GARCÍA S, et al. Enabling Smart Data: Noise Filtering in Big Data Classification [J]. Information Sciences, 2019, 479: 135-152. doi: 10.1016/j.ins.2018.12.002
[11]	张龙翔, 曹云鹏, 王海峰. 面向大数据复杂应用的GPU协同计算模型[J]. 计算机应用研究, 2020, 37(7): 2049-2053. doi: https://www.cnki.com.cn/Article/CJFDTOTAL-JSYJ202007026.htm
[12]	LAKSHMANAPRABU S K, SHANKAR K, ILAYARAJA M, et al. Random Forest for Big Data Classification in the Internet of Things Using Optimal Features [J]. International Journal of Machine Learning and Cybernetics, 2019, 10(10): 2609-2618. doi: 10.1007/s13042-018-00916-z
[13]	ZOU H S, JIN Z Y. Comparative Study of Big Data Classification Algorithm Based on SVM [C]//2018 Cross Strait Quad-Regional Radio Science and Wireless Technology Conference (CSQRWC). Xuzhou: IEEE, 2018.
[14]	JOSEPH MANOJ R, ANTO PRAVEENA M D, VIJAYAKUMAR K. An ACO–ANN Based Feature Selection Algorithm for Big Data [J]. Cluster Computing, 2019, 22(2): 3953-3960.
[15]	ISCAN H, KAMAL L L, KODAZ H. A New Hybrid Classifier based on Bat Algorithm and Artificial Neural Networks [J]. Journal of Industrial Engineering Research, 2018, 4(4): 27-33.

图( 8)

计量

文章访问数: 1722
HTML全文浏览数: 1722
PDF下载数: 224
施引文献: 0

全文HTML

在信息技术高速发展的社会，数据正以前所未有的速度增长^[1-2]，大数据作为一种新的战略资源正在推动创新，改变不同领域的研究以及人们的生活、思维方式^[3-4]. 分布式计算是一种大数据策略，常用的大数据框架之一是Hadoop，在Hadoop分布式文件系统(Hadoop Distributed File System，HDFS)上实现MapReduce并行计算^[5-6]. Apache Spark是另外一个常用的并行计算框架，Spark基于MapReduce算法实现分布式计算，与Hadoop框架的不同之处是：Job中间输出和结果可以保存在内存中，因此不需要读写HDFS，Spark的核心依然是MapReduce. Spark对于数据挖掘与机器学习等需要迭代的算法更友好，适应性更强^[7-8]. MapReduce由两个阶段组成：Map和Reduce，Map阶段处理输入的数据拆分，生成不同的键值对，Reduce阶段按键汇总在映射阶段获得的结果^[9].

大数据分类研究已经应用到各个行业，如金融、医疗、工业等. 文献[10]针对大数据分类中的噪声问题，提出两种消除噪声样本的大数据预处理方法：同质集合和异类集合过滤器，通过对大数据中噪声的处理得到高质量和干净的数据. 文献[11]提出一种Spark框架下K最邻近(KNN)分类器的网络大数据分类处理方法，该方法通过Map阶段分区K近邻操作，并通过Reduce阶段确定最终K近邻，同时对近邻的标签集合进行聚合，得出分类结果，但是该方法分类准确度较低. 文献[12]提出了物联网大数据的随机森林分类方法，并根据蜻蜓优化选取特征对电子医疗数据进行分类，但该方法仅考虑了目标和当前特征变量的数据. 文献[13]设计了一种线性支持向量机大数据分类方法，相较于传统支持向量机，该方法在训练速度和分类精度上具有明显的优势，但用于更大数据集时会影响性能. 文献[14]提出了蚁群优化-人工神经网络联合算法，该算法使用了深度人工神经网络，并进行了蚁群优化，提升了分类准确度. 文献[15]使用蝙蝠算法优化人工神经网络，提高了分类准确率.

本文在研究了大数据处理框架和现有大数据分类方法的基础上，提出了基于自适应指数蝙蝠和堆叠自编码器(Stacked AutoEncoder，SAE)的并行大数据分类方法，该方法根据大数据分类方法的MapReduce并行实现，设计了基于自适应指数蝙蝠算法. 在Map阶段进行特征选择，在Reduce阶段，使用AEB训练的深度堆叠自动编码器分类，得到分类结果. 实验结果表明，该方法能够实现较高精度的大数据分类.

4. 结论

针对大数据分类性能低的问题，本文提出一种自适应指数蝙蝠和SAE的并行大数据分类方法，该方法在Map阶段使用设计的自适应指数蝙蝠算法进行特征选择；在Reduce阶段使用经过AEB算法训练的深度堆叠自动编码器进行分类，进一步提升了分类性能. 使用不同的实验数据集对本文所提方法进行实验，不同百分比条件下的分类性能结果显示，本文所提方法能够以高精度实现大数据分类，且在准确度和TPR性能方面都优于现有其他方法，说明本文方法的有效性和优越性. 未来的工作将通过扩展本文所提出的方法来处理安全约束.

参考文献 (15)

姓名
	姓名不能为空！
邮箱
	邮箱不能为空！非法的邮箱地址。
手机号码
	电话不能为空！请输入有效手机号!
标题
	标题不能为空！
留言内容
	内容不能为空！
验证码
	验证码不能为空！验证码错误！

留言板

基于自适应指数蝙蝠和SAE的并行大数据分类

1.
四川文理学院后勤服务处, 四川达州 635000

2.
四川文理学院信息化建设与服务中心, 四川达州 635000

作者简介:
钱真坤，硕士，实验师，主要从事计算机应用及软件工程研究 .

Parallel Big Data Classification Method Based on Adaptive Exponential Bat and Stacked Autoencoder

1.
Logistics Service, Sichuan University of Arts and Science, Dazhou Sichuan 635000, China

2.
Informatization Construction and Service Center, Sichuan University of Arts and Science, Dazhou Sichuan 635000, China

计量

基于自适应指数蝙蝠和SAE的并行大数据分类

作者简介: 钱真坤，硕士，实验师，主要从事计算机应用及软件工程研究
1. 四川文理学院后勤服务处, 四川达州 635000

2. 四川文理学院信息化建设与服务中心, 四川达州 635000

English Abstract

Parallel Big Data Classification Method Based on Adaptive Exponential Bat and Stacked Autoencoder

全文HTML

2.1. Map阶段

2.2. Reduce阶段

目录

留言板

基于自适应指数蝙蝠和SAE的并行大数据分类

1. 四川文理学院 后勤服务处, 四川 达州 635000 2. 四川文理学院 信息化建设与服务中心, 四川 达州 635000

作者简介: 钱真坤，硕士，实验师，主要从事计算机应用及软件工程研究 .

Parallel Big Data Classification Method Based on Adaptive Exponential Bat and Stacked Autoencoder

1. Logistics Service, Sichuan University of Arts and Science, Dazhou Sichuan 635000, China 2. Informatization Construction and Service Center, Sichuan University of Arts and Science, Dazhou Sichuan 635000, China

计量

出版历程

基于自适应指数蝙蝠和SAE的并行大数据分类

作者简介: 钱真坤，硕士，实验师，主要从事计算机应用及软件工程研究 1. 四川文理学院 后勤服务处, 四川 达州 635000 2. 四川文理学院 信息化建设与服务中心, 四川 达州 635000

English Abstract

Parallel Big Data Classification Method Based on Adaptive Exponential Bat and Stacked Autoencoder

全文HTML

2.1. Map阶段

2.2. Reduce阶段

目录

1.
四川文理学院后勤服务处, 四川达州 635000

2.
四川文理学院信息化建设与服务中心, 四川达州 635000

作者简介:
钱真坤，硕士，实验师，主要从事计算机应用及软件工程研究 .

1.
Logistics Service, Sichuan University of Arts and Science, Dazhou Sichuan 635000, China

2.
Informatization Construction and Service Center, Sichuan University of Arts and Science, Dazhou Sichuan 635000, China

作者简介: 钱真坤，硕士，实验师，主要从事计算机应用及软件工程研究
1. 四川文理学院后勤服务处, 四川达州 635000

2. 四川文理学院信息化建设与服务中心, 四川达州 635000