Fuzzy Classification of Unbalanced Big Data Based on Boundary Condition GAN

YANG Lin; XU Hui-ying; MA Wen-long

doi:10.13718/j.cnki.xsxb.2021.07.013

2021 Volume 46 Issue 7

Article Contents

Previous Article Next Article

YANG Lin, XU Hui-ying, MA Wen-long. Fuzzy Classification of Unbalanced Big Data Based on Boundary Condition GAN[J]. Journal of Southwest China Normal University(Natural Science Edition), 2021, 46(7): 97-102. doi: 10.13718/j.cnki.xsxb.2021.07.013

Citation:

YANG Lin, XU Hui-ying, MA Wen-long. Fuzzy Classification of Unbalanced Big Data Based on Boundary Condition GAN[J]. Journal of Southwest China Normal University(Natural Science Edition), 2021, 46(7): 97-102. doi: 10.13718/j.cnki.xsxb.2021.07.013

Fuzzy Classification of Unbalanced Big Data Based on Boundary Condition GAN

1.
School of Information Engineering, Quzhou College of Technology, Quzhou Zhejiang 324000, China
2.
College of Mathematics and Computer Science, Zhejiang Normal University, Jinhua Zhejiang 321004, China

More Information

Received Date: 04/07/2020
Available Online: 20/07/2021
MSC: TP393

Abstract

Aiming at the imbalance problem in big data classification, an unbalanced big data fuzzy classification algorithm based on boundary condition generative adversarial networks (BCGAN) has been proposed. In this method, BCGAN oversampling method is proposed by introducing a boundary minority class to oversampling near the decision boundary of majority class data and minority class data, generating more appropriate minority class data to improve the classification performance. The processed balance data is transformed into probability index table, and the data and attributes are presented in the form of row and column respectively. The membership degree of the unique symbol in each data attribute is calculated, and then the data category is obtained by means of the correlative fuzzy naive Bayes (CFNB) classifier. Then, the parallel implementation of big data fuzzy classification in MapReduce framework is given. The experimental results show that the accuracy of the proposed method is better than that of other existing methods, indicating the feasibility and effectiveness of the proposed method.
- big data,
- imbalance,
- boundary condition generative adversarial network,
- correlative fuzzy naive bays

References

[1]	GHANI N A, HAMID S, TARGIO HASHEM I A, et al. Social Media Big Data Analytics: a Survey[J]. Computers in Human Behavior, 2019, 101: 417-428. doi: 10.1016/j.chb.2018.08.039 CrossRef Google Scholar
[2]	姜丽丽, 李叶飞, 豆龙龙, 等. 面向大数据的图模式挖掘概率算法[J]. 计算机应用研究, 2020, 37(12): 3545-3551. Google Scholar
[3]	GARCÍA-GIL D, LUENGO J, GARCÍA S, et al. Enabling Smart Data: Noise Filtering in Big Data Classification[J]. Information Sciences, 2019, 479: 135-152. doi: 10.1016/j.ins.2018.12.002 CrossRef Google Scholar
[4]	WANG Y C, KUNG L, BYRD T A. Big Data Analytics: Understanding Its Capabilities and Potential Benefits for Healthcare Organizations[J]. Technological Forecasting and Social Change, 2018, 126: 3-13. doi: 10.1016/j.techfore.2015.12.019 CrossRef Google Scholar
[5]	CHENG Y, CHEN K, SUN H M, et al. Data and Knowledge Mining with Big Data towards Smart Production[J]. Journal of Industrial Information Integration, 2018, 9: 1-13. doi: 10.1016/j.jii.2017.08.001 CrossRef Google Scholar
[6]	LUECHTEFELD T, MARSH D, ROWLANDS C, et al. Machine Learning of Toxicological Big Data Enables Read-across Structure Activity Relationships (RASAR) Outperforming Animal Test Reproducibility[J]. Toxicological Sciences, 2018, 165(1): 198-212. doi: 10.1093/toxsci/kfy152 CrossRef Google Scholar
[7]	VARATHARAJAN R, MANOGARAN G, PRIYAN M K. A Big Data Classification Approach Using LDA with an Enhanced SVM Method for ECG Signals in Cloud Computing[J]. Multimedia Tools and Applications, 2018, 77(8): 10195-10215. doi: 10.1007/s11042-017-5318-1 CrossRef Google Scholar
[8]	LAKSHMANAPRABU S K, SHANKAR K, ILAYARAJA M, et al. Random Forest for Big Data Classification in the Internet of Things Using Optimal Features[J]. International Journal of Machine Learning and Cybernetics, 2019, 10(10): 2609-2618. doi: 10.1007/s13042-018-00916-z CrossRef Google Scholar
[9]	张龙翔, 曹云鹏, 王海峰. 面向大数据复杂应用的GPU协同计算模型[J]. 计算机应用研究, 2020, 37(7): 2049-2053. Google Scholar
[10]	CARVALHO A M D, PRATI R C. Improving kNN Classification under Unbalanced Data. a New Geometric Oversampling Approach[C]//2018 International Joint Conference on Neural Networks (IJCNN). Rio de Janeiro: IEEE, 2018. Google Scholar
[11]	HASANIN T, KHOSHGOFTAAR T M, LEEVY J, et al. Investigating Random Undersampling and Feature Selection on Bioinformatics Big Data[C]//2019 IEEE Fifth International Conference on Big Data Computing Service and Applications (BigDataService). Newark: IEEE, 2019. Google Scholar
[12]	POLAT K. A Hybrid Approach to Parkinson Disease Classification Using Speech Signal: The Combination of SMOTE and Random Forests[C]//2019 Scientific Meeting on Electrical-Electronics & Biomedical Engineering and Computer Science (EBBT). Istanbul: IEEE, 2019. Google Scholar
[13]	郑建华, 刘双印, 贺超波, 等. 基于混合采样策略的改进随机森林不平衡数据分类算法[J]. 重庆理工大学学报(自然科学), 2019, 33(7): 113-123. Google Scholar
[14]	HASSIB E M, EL-DESOUKY A I, LABIB L M, et al. WOA+BRNN: an Imbalanced Big Data Classification Framework Using Whale Optimization and Deep Neural Network[J]. Soft Computing, 2020, 24(8): 5573-5592. doi: 10.1007/s00500-019-03901-y CrossRef Google Scholar
[15]	UTOMO O K, SURANTHA N, ISA S M, et al. Automatic Sleep Stage Classification Using Weighted ELM and PSO on Imbalanced Data from Single Lead ECG[J]. Procedia Computer Science, 2019, 157: 321-328. doi: 10.1016/j.procs.2019.08.173 CrossRef Google Scholar

Access History

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(3) / Tables(1)

Export Citation

PDF

XML

Article Metrics

Article views(672) PDF downloads(91) Cited by(0)

Access History

Other Articles By Authors

on this site
on Google Scholar

HTML

在大数据时代，数据已成为一种新的战略资源，是推动创新的重要因素，并且正在改变各个领域研究的方式以及人们的生活方式和思维方式^[1]，许多国家相继发布了一系列大数据技术计划，大力推动了大数据的研究和应用^[2-3]. 目前的研究一直致力于识别和分析每一个领域的巨大数据，大数据应用领域有医疗服务、银行业、市场营销等^[4]. Cheng等^[5]研究了大数据挖掘技术在智能生产中的应用.

大数据分类是识别输入大数据所属的类的过程，分类最流行的方法之一是通过使用给定的数据集训练机器学习算法来构造分类模型^[6]，Varatharajan等^[7]提出支持向量机的心电图大数据分类方法，将支持向量机(Support Vector Machine，SVM)模型与加权核函数方法结合使用，从输入的心电图信号中分类更多特征. Lakshmanaprabu等^[8]使用随机森林分类器，开发了基于物联网(IoT)的医疗系统大数据分析，使用改进的蜻蜓算法(Improved Dragonfly Algorithm，IDA)从数据库中选择最佳属性获得更好的分类. 张龙翔等^[9]提出面向分布式数据流大数据分类的多变量决策树，设计了几何轮廓相似度的多变量决策树用于大数据分类. 但是上述方法都没有考虑到不平衡数据的处理问题，如果数据集不平衡，机器学习等分类算法不能够正确学习少数类数据，倾向于占数据集很大比例的大多数类，这可能会导致分类结果有偏差和决策错误^[10].

为了减轻类不平衡问题，通常使用数据采样技术通过任一类中的数据数量来调整不平衡数据. 根据调整的类别，可以将它们分为欠采样技术和过采样技术^[11]. 欠采样在多数类中删除数据，直到其数目等于少数类中的数据数为止，欠采样技术由于平衡数据删除而遭受信息丢失的问题. 过采样技术为少数类生成数据与多数类平衡，常用的过采样方法包括合成少数类过采样技术(Synthetic Minority Over-sampling Technique，SMOTE)、自适应合成采样(Adaptive Synthetic Sampling，ADASYN)和边界SMOTE^[12]. 但是，过采样方法存在分类模型会被过度拟合到训练数据的问题. 另外，基于SMOTE的数据合成方法有时会产生多数类数据而不是少数类数据.

在研究现有大数据分类方法和采用方法的基础上，本文提出一种基于边界条件生成式对抗网络(Generative Adversarial Networks，GAN)的不平衡大数据分类方法，该方法利用条件GAN的类信息来产生少数类特征数据，然后在数据决策边界引入边界少数类到过样本，生成合适的少数类数据来提高分类性能. 基于相关因子和模糊理论，本文设计了相关模糊朴素贝叶斯分类方法对平衡大数据进行分类，并给出MapReduce框架下大数据分类的并行实现.

1. 边界条件GAN

BCGAN利用决策边界条件GAN的类信息生成合适的少数类数据来提高分类性能. GAN由一个生成器和一个鉴别器组成. 条件GAN的结构和基本学习方法与GAN相似，区别是条件GAN的发生器和鉴别器考虑给定的条件.为了提高分类准确性，本文搜索位于决策边界附近的少数类数据，将其与其他少数类数据区分开. 使用borderline-SMOTE的边界样本选择方法找到边界少数类数据，步骤为：对于少数类中的每个数据实例，使用k最近邻(k-Nearest Neighbor，k-NN)算法计算其k个最近的数据实例，并得出其子集. 对于每个子集，如果属于多数类数据样本的数量大于或等于子集的大小，则将子集中的少数类数据视为边界少数类数据. 由于数据样本距离决策边界较远，因此将其保留在少数类别中. 最终，边界少数类包含原始少数类中靠近决策边界的数据.

BCGAN的目标是沿着多数类和少数类之间的决策边界生成少数类数据，需要对BCGAN进行训练. ①为给定的多数类和少数类计算边界少数类；②将类别信息以及来自高斯分布的随机选择的噪声输入发生器，生成器根据给定的输入数据生成伪造数据；③鉴别器试图通过使用类信息来区分真实数据和生成的数据. 根据条件GAN的损失函数，生成器和鉴别器会更新参数，并且通过重复此过程使损失最小化. BCGAN生成器可以生成反映边缘少数群体特征的数据.

训练完BCGAN之后，基于噪声和边界少数类数据，生成器可以产生与实际边界少数类数据相似的少数类数据，直到多数和少数类数据相同. 此时，这两个类具有相同的数据大小，即得出平衡的数据. 将生成的数据与现有的训练数据进行组合，然后将它们用于训练分类器.

4. 结语

本文提出一种基于BCGAN的不平衡大数据模糊分类算法，该算法使用BCGAN在多数类数据和少数类数据的决策边界附近引入一个边界少数类到过样本，生成更合适的少数类数据来提高分类性能，处理不平衡大数据，得到利于分类的平衡大数据，然后设计了基于相互因子和模糊理论的CFNB分类器. 将得到的平衡数据转换成概率索引表，通过相互因子和隶属度的引入进一步提高大数据分类性能，最后给出了MapReduce框架下的并行实现，降低了分类时间. 实验结果表明，与现有其他方法比较，针对不平衡率数据集，本文算法具有最优的分类准确度和最低的分类时间，说明该方法具有可行性和有效性.

Figure (3) Table (1) Reference (15)

Name
	Name cannot be empty!
E-mail
	Mailbox cannot be empty! Mailbox cannot be empty!
Telephone
	Mobile number cannot be empty! Please enter a valid mobile number!
Title

Content
Verification Code

Message Board

Fuzzy Classification of Unbalanced Big Data Based on Boundary Condition GAN

Abstract

References

Access History

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Access History

Other Articles By Authors