基于统计词典和特征加强的多语言文本分类
Multiple Language Text Classification Method Based on Statistical Dictionary and Feature Enhancing
-
摘要: 在统计双语词典的基础上,提出一种特征加强的多语言文本分类方法.在执行文本分类时,考虑到其他语言的训练文本,使得多种语言的文本集合中均存在训练文本,放松了MLTC的要求.特征加强是一种交叉检查过程,即获取两种语言所有特征的卡方统计后,通过语言中相关特征的辨识力,再次对语言的特征辨识力进行评估,以提高分类的可信度.实验选择汉语或英语作为目标语言.实验结果表明:提出的方法具有更高的分类精度,且对训练集规格的敏感度更低.Abstract: Aiming at the problem that multiple language text classification (MLTC) can only solve single language text classification problem of multiple independent, on the basic of statistical bilingual dictionary, multiple language text classification based on feature enhancing has been proposed. In the implementation of text classification, the training texts of other languages have been taken into account, which makes the text of a variety of languages in the training texts. And it relaxes MLTC requirements. Feature enhancing is a processing of cross examination. After chi square statistics of all the features for the two languages is obtained, the identification of language feature is reassessed through the feature identification to improve the reliability of classification. Chinese or English is chosen as the target language in the experiment. Experimental results show that the proposed method has a higher classification accuracy, and the sensitivity of the training set is lower.
-
-
[1] 赖娟,金澎,洪艳伟.文本分类中的主动多域学习[J].西南师范大学学报(自然科学版),2014,39(7):108-114. [2] 罗远胜,王明文,勒中坚, 等.双语潜在语义对应分析及在跨语言文本分类中的应用研究[J].情报学报,2013,32(1):86-96. [3] 刘志红.多语种多类别体系下文本自动分类系统的研究与实现[D].沈阳:东北大学,2010. [4] FORTUNA B,DEMEESTER T,DEVELDER C.Towards Large-scale Event Detection and Extraction from News[C]//The Workshop on Large-Scale Online Learning & Decision Making.New York:IEEE Press,2014:1-3. [5] PRETTENHOFER P,STEIN B.Cross-language Text Classification Using Structural Correspondence Learning[C]//ACL 2010,Meeting of the Association for Computational Linguistics.New York:IEEE Press,2010:1118-1127. [6] 张金鹏,周兰江,线岩团, 等.基于跨语言语料的汉泰词分布表示[J].计算机工程与科学,2015,37(12):2358-2365. [7] 张玲玲,冀俊忠,贝飞, 等.基于句法分析和属性概率权重的跨语言情感分类算法[J].模式识别与人工智能,2015,28(11):1002-1012. [8] NI X,SUN J T,HU J,et al.Cross Lingual Text classification by Mining Multilingual Topics from Wikipedia[C]//Forth International Conference on Web Search and Web Data Mining,WSDM 2011.New York:IEEE press,2011:375-384. [9] WEI C P,LIN Y T,YANG C C.Cross-lingual Text Categorization:Conquering Language Boundaries in Globalized Environments[J].Information Processing & Management,2011,47(5):786-804. [10] 熊文新.Web、语料库与双语平行语料库的建设[J].图书情报工作,2013,57(10):128-135. [11] 司莉,庄晓喆,贾欢.近10年来国外多语言信息组织与检索研究进展与启示[J].中国图书馆学报,2015,34(4):112-126. -
计量
- 文章访问数: 652
- HTML全文浏览数: 466
- PDF下载数: 46
- 施引文献: 0