Research on News Text Summarizations Generation Based on MMR and WordNet

ZHANG Qi; FAN Yongsheng; JIN Duliang

doi:10.13718/j.cnki.xsxb.2023.05.011

2023 Volume 48 Issue 5

Article Contents

Previous Article Next Article

ZHANG Qi, FAN Yongsheng, JIN Duliang. Research on News Text Summarizations Generation Based on MMR and WordNet[J]. Journal of Southwest China Normal University(Natural Science Edition), 2023, 48(5): 77-86. doi: 10.13718/j.cnki.xsxb.2023.05.011

Citation:

ZHANG Qi, FAN Yongsheng, JIN Duliang. Research on News Text Summarizations Generation Based on MMR and WordNet[J]. Journal of Southwest China Normal University(Natural Science Edition), 2023, 48(5): 77-86. doi: 10.13718/j.cnki.xsxb.2023.05.011

Research on News Text Summarizations Generation Based on MMR and WordNet

College of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China

More Information

Received Date: 24/06/2022
Available Online: 20/05/2023
MSC: TP391.1

Abstract

In the process of extracting news text summarizations, traditional extraction algorithms have some problems, such as incomplete summarization of text content, redundancy of summary content and synonyms of different words are not considered in keyword extraction. An algorithm WMMR based on Maximal Marginal Relevance (MMR) and WordNet is proposed to generate news text summarizations. In order to optimize the sentence score in MMR algorithm, this algorithm comprehensively considers the influence of text similarity, keywords, sentence position information, clue words and other features on sentence weight. Among them, WordNet is introduced to merge synonyms when calculating the score of keywords. The effectiveness of the proposed algorithm is verified on NLPCC2017 public dataset. The results show that the ROUGE value of WMMR algorithm increases by 4 percentage points compared with TextRank algorithm and 7 percentage points compared with MMR algorithm. The universality of the proposed algorithm is verified on Shence Cup 2018 and SogouCS public datasets. The results show that the ROUGE value of the WMMR algorithm is improved compared with the traditional TextRank and MMR algorithms, which proves that the WMMR algorithm effectively improves the quality of generated summaries.
- news text summarization,
- extraction algorithm,
- maximal marginal relevance algorithm,
- WordNet,
- synonyms of different words

References

[1]	HIMA BINDU SRI S, DUTTA S R. A Survey on Automatic Text Summarization Techniques[J]. Journal of Physics: Conference Series, 2021, 2040(1): 012044. doi: 10.1088/1742-6596/2040/1/012044 CrossRef Google Scholar
[2]	李金鹏, 张闯, 陈小军, 等. 自动文本摘要研究综述[J]. 计算机研究与发展, 2021, 58(1): 1-21. Google Scholar
[3]	LUHN H P. The Automatic Creation of Literature Abstracts[J]. IBM Journal of Research and Development, 1958, 2(2): 159-165. doi: 10.1147/rd.22.0159 CrossRef Google Scholar
[4]	汪旭祥, 韩斌, 高瑞, 等. 基于改进TextRank的文本摘要自动提取[J]. 计算机应用与软件, 2021, 38(6): 155-160. doi: 10.3969/j.issn.1000-386x.2021.06.025 CrossRef Google Scholar
[5]	祝超群. 基于改进TextRank的中文文本摘要方法研究[D]. 武汉: 武汉邮电科学研究院, 2021. Google Scholar
[6]	程琨, 李传艺, 贾欣欣, 等. 基于改进的MMR算法的新闻文本抽取式摘要方法[J]. 应用科学学报, 2021, 39(3): 443-455. doi: 10.3969/j.issn.0255-8297.2021.03.010 CrossRef Google Scholar
[7]	余传明, 郭亚静, 朱星宇, 等. 基于最大边界相关度的抽取式文本摘要模型研究[J]. 情报科学, 2021, 39(2): 34-43. Google Scholar
[8]	ELBAROUGY R, BEHERY G, EL KHATIB A. Extractive Arabic Text Summarization Using Modified PageRank Algorithm[J]. Egyptian Informatics Journal, 2020, 21(2): 73-81. doi: 10.1016/j.eij.2019.11.001 CrossRef Google Scholar
[9]	ABDULATEEF S, KHAN N A, CHEN B L, et al. Multidocument Arabic Text Summarization Based on Clustering and Word2Vec to Reduce Redundancy[J]. Information, 2020, 11(2): 59. doi: 10.3390/info11020059 CrossRef Google Scholar
[10]	MILLER G A, BECKWITH R, FELLBAUM C, et al. Introduction to WordNet: an On-Line Lexical Database[J]. International Journal of Lexicography, 1990, 3(4): 235-244. doi: 10.1093/ijl/3.4.235 CrossRef Google Scholar
[11]	刘晓影, 王淮, 乌吉斯古愣. 基于GAN和中文词汇网的文本摘要技术[J]. 计算机科学, 2022: 49(12): 301-304. Google Scholar
[12]	BARUAH N, SARMA S K, BORKOTOKEY S. A Single Document Assamese Text Summarization Using a Combination of Statistical Features and Assamese WordNet[C]//Progress in Advanced Computing and Intelligent Engineering. Singapore: Springer, 2021: 125-136. Google Scholar
[13]	XIE N T, LI S J, REN H L, et al. Abstractive Summarization Improved by WordNet-Based Extractive Sentences[C]//CCF International Conference on Natural Language Processing and Chinese Computing. Berlin: Springer, 2018: 404-415. Google Scholar
[14]	MIHALCEA R, TARAU P. Textrank: Bringing order into texts[C]//Proceedings of the 2004 conference on empirical methods in natural language processing. Pennsylvania: Association for Computational Linguistics, 2004: 404-411. Google Scholar
[15]	DOM B, EIRON I, COZZI A, et al. Graph-Based Ranking Algorithms for E-Mail Expertise Analysis[C]//Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. New York: ACM, 2003: 42-48. Google Scholar
[16]	BRIN S, PAGE L. The Anatomy of a Large-Scale Hypertextual Web Search Engine[J]. Computer Networks and ISDN Systems, 1998, 30(1-7): 107-117. doi: 10.1016/S0169-7552(98)00110-X CrossRef Google Scholar
[17]	SALTON G, BUCKLEY C. Term-Weighting Approaches in Automatic Text Retrieval[J]. Information Processing & Management, 1988, 24(5): 513-523. Google Scholar
[18]	HERNÁNDEZ-CASTAÑEDA Á, GARCÍA-HERNÁNDEZ R A, LEDENEVA Y, et al. Extractive Automatic Text Summarization Based on Lexical-Semantic Keywords[J]. IEEE Access, 2020, 8: 49896-49907. doi: 10.1109/ACCESS.2020.2980226 CrossRef Google Scholar
[19]	CARBONELL J, GOLDSTEIN J. The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries[C]//Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. New York: ACM, 1998: 335-336. Google Scholar
[20]	侯圣峦, 张书涵, 费超群. 文本摘要常用数据集和方法研究综述[J]. 中文信息学报, 2019, 33(5): 1-16. Google Scholar
[21]	LIN C Y. Rouge: A Package for Automatic Evaluation of Summaries[C]//Proceedings of the Workshop on Text Summarization Branches Out. Pennsylvania: Association for Computational Linguistics, 2004: 74-81. Google Scholar
[22]	张琪, 范永胜. 基于改进T5 PEGASUS模型的新闻文本摘要生成研究[J/OL]. 电子科技: 1-7[2022-05-01]. DOI: 10.16180/j.cnki.issn1007-7820.2023.12.010. Google Scholar
[23]	曾昭霖, 严馨, 余兵兵, 等. 基于分层最大边缘相关的柬语多文档抽取式摘要方法[J]. 河北科技大学学报, 2020, 41(6): 508-517. Google Scholar
[24]	杭州网. 四川一失控奔驰半夜撞穿门卫室门卫大爷吓一跳[EB/OL]. (2015-07-21)[2022-02-01]. https://news.hangzhou.com.cn/shxw/content/2015-07/21/content_5854531.htm. Google Scholar
[25]	宁建飞, 刘降珍. 融合Word2Vec与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2016(6): 20-27. Google Scholar
[26]	陶兴, 张向先, 郭顺利, 等. 学术问答社区用户生成内容的W₂V-MMR自动摘要方法研究[J]. 数据分析与知识发现, 2020, 4(4): 109-118. Google Scholar

Access History

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(5) / Tables(5)

Export Citation

PDF

XML

Article Metrics

Article views(3670) PDF downloads(411) Cited by(0)

Access History

Other Articles By Authors

on this site
on Google Scholar

HTML

近年来随着移动互联网的兴起，各种新闻文章、科学论文等文本数据量爆炸式增长^[1]，如何让用户快速、准确地在海量互联网信息中获取具有代表性的内容已经成为一个急需解决的问题. 基于此，各种文本摘要算法应运而生.

文本摘要生成是从原始文本中获得最重要的部分并呈现给用户的过程，其目的是减少文本数量，提取出最相关的信息来简化文本内容，节省用户时间. 从文本摘要的获取方式上来看，可以将其分为抽取式和生成式^[2]：前者是对原始文本中的句子进行权重计算并排序，最终选择靠前的适量句子来组成摘要；后者是由模型根据文章大意生成新句子，摘要内容可以包含原始文本中不存在的词语或句子.

文献[3]首次提出“自动摘要”概念，开创了文本摘要的先河. 文献[3]认为文章中的词频和单词在句子中的相对位置是衡量词语是否重要的有效指标，最重要的句子就是包含重要词语的句子，而摘要则是将最重要的句子拼合起来. 目前，国内外学者针对抽取式文本摘要任务做了进一步研究^[4-9].

WordNet是一个英语词汇数据库^[10]，基于同义和反义来描述词语和概念间的语义关系类型. 文献[11]用GAN生成文本摘要，引入WordNet增强判别器的作用. 文献[12]提出了一种结合统计特征和阿萨姆语WordNet的单文档阿萨姆语文本摘要. 文献[13]利用基于WordNet的Lesk算法分析单词语义，改进句子排序算法，利用Seq2Seq双注意模型进行联合训练.

在传统抽取式算法中，TextRank算法只考虑文本的相似度，忽略文本的语义特征，从而导致摘要内容过度冗余. MMR算法则提出一种惩罚机制来解决冗余问题，但其抽取的文本摘要存在对原文概括能力不足的问题，且并未考虑诸多因素对摘要内容的影响. 本文基于此提出了一种MMR和WordNet的新闻文本摘要生成算法，有效解决了文本内容概括不全面、摘要内容冗余、关键词提取时出现异词同义的问题，该方法提高概括摘要内容能力的同时降低摘要内容的冗余度，提升了生成摘要的质量.

1. 算法介绍

1.1. TextRank算法

文献[14]借鉴谷歌的PageRank算法^[15]，提出了TextRank算法. 其基本思想是将新闻文本划分为词语或句子来构建图模型，迭代各个节点的权重直至收敛，并通过投票机制对这些词语或句子的重要性进行排序. TextRank算法的公式如(1)式所示：

其中：W_S(V_i)代表节点V_i的权重，W_ji代表两个节点V_i和V_j之间的相似程度，W_S(V_j)代表上一个节点V_j的权重，In(V_i)为指向V_i的节点集合，Out(V_j)为V_j指向的节点集合，求和运算代表节点V_i在新闻文本中总的权重^[16]. d为阻尼系数，用于做平滑处理，代表某一节点指向其余节点的概率，通常取0.85.

1.2. TF-IDF算法

TF-IDF算法分为TF和IDF，其中TF表示词频，IDF表示逆向文件频率^[17]. 该算法反映了词语在文本中的重要性，也反映了词语在数据集中的重要性^[18]. TF-IDF算法的公式如(2)式所示，具体计算过程如(3)式，(4)式所示：

其中：n_i，j表示在文本j中词语i的次数，${\sum _k {{n_{k, j}}} }$表示文本j中所有词语的总次数，|D|表示数据集中的文本数量，|{j：t_i∈d_j}|表示含有词语i的文本数量.

1.3. MMR算法

最大边界相关法(maximal marginal relevance，MMR)^[19]的基本思想是在保证句子与新闻文本之间相似性的同时，使文本摘要更加全面和多样. MMR算法公式下

其中：D代表整篇新闻文本，S代表候选摘要句集，V_i代表当前待抽取的句子，V_j代表目前已经抽取出的摘要句，λ是控制MMR算法摘要多样性的超参数，第一个相似度sim₁(V_i，D)表示句子V_i与整篇新闻文本的相似度，第二个相似度sim₂(V_i，V_j)表示句子V_i和V_j之间的相似度.

4. 结论

本文提出了一种基于MMR和WordNet的新闻文本摘要生成算法——WMMR. 该算法综合考虑文本相似度、关键词、句子位置信息、线索词等特征对句子权重的影响，从而优化MMR算法中的句子得分，并在计算关键词得分时引入WordNet合并同义词.

在三个公开数据集上验证本文算法的有效性. 实验结果表明，本文提出的WMMR算法ROUGE值均最高，整体上明显优于其它传统算法，有效地提升了生成摘要的质量. 但本文只进行了抽取式文本摘要的方法优化，后续将尝试进行生成式文本摘要的方法优化.

Figure (5) Table (5) Reference (26)

Name
	Name cannot be empty!
E-mail
	Mailbox cannot be empty! Mailbox cannot be empty!
Telephone
	Mobile number cannot be empty! Please enter a valid mobile number!
Title

Content
Verification Code

Message Board

Research on News Text Summarizations Generation Based on MMR and WordNet