面向农业科研办公的垂直搜索引擎研究与设计

李昀; 邓颖; 吴华瑞

doi:10.13718/j.cnki.xsxb.2020.09.008

面向农业科研办公的垂直搜索引擎研究与设计

1.
北京市农林科学院, 北京 100097

2.
国家农业信息化工程技术研究中心, 北京 100097

3.
北京市农业信息技术研究中心, 北京 100097

4.
农业农村部农业信息技术重点实验室, 北京 100097

基金项目: 2020年度农业农村部农业信息技术重点实验室建设项目(PT2020-03)

详细信息

作者简介:
李昀(1969-), 硕士, 高级工程师, 主要从事信息化管理应用研究 .

通讯作者: 吴华瑞, 博士, 研究员

中图分类号: S126

On Design of Vertical Search Engine toward Agricultural Scientific Research Office

1.
Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China

2.
National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China

3.
Beijing Research Center for Information Technology in Agriculture, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China

4.
Key Laboratory of Agri-informatics, Ministry of Agriculture, Beijing 100097, China

摘要: 在农业科研办公过程中, 科研人员进行信息检索的频率高, 信息需求精度高, 但传统的综合性搜索引擎检索农业实用技术、政策法规、专题数据等方向性比较强的农业信息, 通常返回结果数据量庞大、主旨范围宽泛, 导致内容不精准、搜索面太广, 筛选结果专业性不足；且现阶段主流的农业领域的垂直搜索引擎的搜索策略主要建立在传统的文本检索上, 在自身领域数据量有限的情况下, 搜索结果查全率不高, 且搜索结果没有排序依据(大多仅仅按信息发生时间为排序依据).本文对农业互联网信息搜索引擎进行了研究, 通过对各级农业管理部门网站、农业科研院所网站、农业新闻网站、农业商业网站等数据源的模块进行定位, 通过爬虫进行数据更新检测与定时抓取, 从数据源上有效减少不相关信息；基于数百个互联网数据源农业相关模块的信息抽取, 采用word2vec和本文提出的基于文本特征表达的doc2vec, 分别创建农业词向量、文档向量空间, 用来应对搜索关键词为无序词组和有序语句的搜索场景, 确保垂直搜索的智能和返回结果的准确.经过实验验证, 本文提出的doc2vec+tf-idf搜索算法能够在有序搜索中达到较高的准确率, 结合word2vec进行的无序搜索, 有针对地进行语义搜索, 可以进一步提高搜索引擎的查准率, 满足日益增长的对农业领域信息搜索的高效高质的需求.
- 农业信息搜索引擎 /
- 语义相似度 /
- word2vec /
- doc2vec /
- tf-idf /
- 文本智能搜索
Abstract: The disadvantage of using traditional comprehensive search engines in Agricultural area is that they returns too many results which are not accurate enough to match the requirement of the agricultural scientific research office due to its non-limited search coverage and using improperly semantic association algorithms. In this article, an Agricultural Web-Info Gathering system monitors have been mentioned, updated information been gathered and accumulated from specific modules of series of agricultural websites such as official websites of national and local agriculture management departments, official websites of agricultural college or research institutes, agriculture magazines websites, and agriculture commercial websites. Specification of data resource reduces non-related data, efficiently limited the search range. The search engine utilized word2vec and text feature based doc2vec models and took data of agriculture oriented websites as text corpus to build word vector space and document vector space to deal with non-ordered words set search and ordered sentence or paragraph search, in order to ensure the search result to be accurate as well as intelligent. According to the result of experiment it is proved that this system with doc2vec+tf-idf search algorithm has higher accuracy in sequential search for agricultural information. With the high performance of word2vec algorithm in nonsequential search, dynamically choosing corresponding algorithm for sequential/nonsequential search could further improve the accuracy of the search engine, and satisfied high quality data resource requirement of Agricultural information.
- agricultural vertical search engine /
- semantic similarity /
- word2vec /
- doc2vec /
- tf-idf /
- context based search .
图 1 农业科研互联网信息垂直搜索引擎框架设计

下载: 全尺寸图片幻灯片

图 2 DM模型

下载: 全尺寸图片幻灯片

图 3 DBOW模型

下载: 全尺寸图片幻灯片

图 4 3种模型在有序搜索内容条件下的相似度阈值与查准率关系

下载: 全尺寸图片幻灯片

图 5 3种模型在无序搜索内容条件下的相似度阈值与查准率关系

下载: 全尺寸图片幻灯片

[1]	李广丽, 刘觉夫.垂直搜索引擎系统的研究与实现[J].情报杂志, 2009, 28(10): 144-147, 169. doi: http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=qbzz200910034
[2]	肖冬梅.垂直搜索引擎研究[J].图书馆学研究, 2003(2): 87-89. doi: http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=tsgxyj200302030
[3]	许翰林, 王瑞, 王佳丽, 等.基于Lucene的新闻垂直搜索引擎设计与实现[J].电脑编程技巧与维护, 2018(2): 50-52. doi: http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=dnbcjqywh201802014
[4]	彭玉容, 杨捧, 高媛.农业搜索引擎的发展现状及关键技术研究[J].安徽农业科学, 2010, 38(20): 10971-10972, 10977. doi: http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=ahnykx201020181
[5]	王晓琴, 李书琴, 景旭, 等.基于Nutch的农业垂直搜索引擎研究[J].计算机工程与设计, 2014, 35(6): 2239-2243. doi: http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=jsjgcysj201406069
[6]	武婷婷.一种基于WebMagic和Mahout的信息搜集与推荐系统[J].软件导刊, 2016, 15(10): 1-3. doi: http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=rjdk201610001
[7]	吕太之, 毕家钦.基于Hadoop平台的岗位分析和推荐系统的构建[J].河北软件职业技术学院学报, 2017, 19(4): 1-4. doi: http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=hbgcjszyxyxb201704002
[8]	张婷婷, 刘凯, 王伟军.科研人员Web数据自动抓取模式及其开源解决方案[J].信息资源管理学报, 2015, 5(2): 21-27. doi: http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=xxzyglxb201502003
[9]	李佳欣, 潘伟. PhantomJS在Web自动化测试中的应用[J].计算机光盘软件与应用, 2013(18): 76-77, 80. doi: http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=jsjgprjyyy201318058
[10]	胡越, 张源伟, 雷军.自定规则的AJAX网页信息采集功能的设计[J].物联网技术, 2016, 6(9): 86-87. doi: http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=wlwjs201609034
[11]	李浩.基于评论的博客搜索引擎的设计与实现[D].重庆: 重庆大学, 2016.http://cdmd.cnki.com.cn/Article/CDMD-10611-1016908413.htm
[12]	doi: http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=dqxxkx201402003 ZHU J, HU B, SHAO H. Research of Lightweight Vector Geographic Data Management Based on Main Memory Database Redis [J]. Journal of Geo-Information Science, 2014, 16(2): 165-172.
[13]	GAO X B, FANG X M. High-Performance Distributed Cache Architecture Based on Redis[M]//Lecture Notes in Electrical Engineering. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013: 105-111.
[14]	ROEHM D, PAVEL R S, BARROS K, et al. Distributed Database Kriging for Adaptive Sampling (D2KAS) [J]. Computer Physics Communications, 2015, 192: 138-147. doi: 10.1016/j.cpc.2015.03.006
[15]	BALIS B, BUBAK M, HAREZLAK D, et al. Towards an Operational Database for Real-time Environmental Monitoring and Early Warning Systems [J]. Procedia Computer Science, 2017, 108: 2250-2259. doi: 10.1016/j.procs.2017.05.193
[16]	RIVEST R. The MD5 Message-Digest Algorithm[R]. RFC Editor, 1992.
[17]	SZYDLO M, YIN Y L. Collision-Resistant Usage of MD5 and SHA-1 via Message Preprocessing [J]. Topics in Cryptology - CT-RSA 2006, 2006: 99-114. DOI: 10. 1007/11605805_7.
[18]	HAVELIWALA T H. Topic-sensitive Pagerank: a Context-sensitive Ranking Algorithm for Web Search [J]. IEEE Transactions on Knowledge and Data Engineering, 2003, 15(4): 784-796. doi: 10.1109/TKDE.2003.1208999
[19]	doi: http://link.springer.com/article/10.1007%2FBF02985759 LANGVILLE A N, MEYER C D. Google's PageRank and Beyond [J]. Mathematical Intelligencer, 2011, 30(1): 68-69.
[20]	LORIGO L, KLEINBERG J, EATON R, et al. A Graph-Based Approach towards Discerning Inherent Structures in a Digital Library of Formal Mathematics [J]. Mathematical Knowledge Management, 2004: 220-235. DOI: 10. 1007/978-3-540-27818-4_16.
[21]	NOMURA S, OYAMA S, HAYAMIZU T, et al. Analysis and Improvement of HITS Algorithm for Detecting Web Communities [J]. Systems and Computers in Japan, 2004, 35(13): 32-42. doi: 10.1002/scj.10425
[22]	doi: https://tools.ietf.org/html/rfc1321 CHAKRABARTI S, DOM B E, GIBSON D, et al. Topic Distillation and Spectral Filtering [J]. Artificial Intelligence Review, 1999, 13(5-6): 409-435.
[23]	ARASU A, CHO J, GARCIA-MOLINA H, et al. Searching the Web [J]. ACM Transactions on Internet Technology (TOIT), 2001, 1(1): 2-43. doi: 10.1145/383034.383035
[24]	吴莉霞.浅谈搜索引擎优化策略[J].电脑知识与技术, 2014, 10(15): 3662-3664. doi: http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=dnzsyjs-itrzyksb201415072
[25]	赵谦, 荆琪, 李爱萍, 等.一种基于语义与句法结构的短文本相似度计算方法[J].计算机工程与科学, 2018, 40(7): 1287-1294. doi: http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=jsjgcykx201807021
[26]	冯高磊, 高嵩峰.基于向量空间模型结合语义的文本相似度算法[J].现代电子技术, 2018, 41(11): 157-161. doi: http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=xddzjs201811035
[27]	MIKOLOV T, CHEN K, CORRADO G, et al. Efficient Estimation of Word Representations in Vector Space [EB/OL]. 2013: arXiv: 1301. 3781[cs. CL]. https://arxiv.org/abs/1301.3781.
[28]	黄承慧, 印鉴, 侯昉.一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J].计算机学报, 2011, 34(5): 856-864. doi: http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=jsjxb201105009
[29]	朱命冬, 徐立新, 申德荣, 等.面向不确定文本数据的余弦相似性查询方法[J].计算机科学与探索, 2018, 12(1): 49-64. doi: http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=jsjkxyts201801006
[30]	HINTON G E. Learning Distributed Representations of Concepts[C]//In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, 1986, Amherst MA: Lawrence Erlbaum Associates, c1986: 1-12.
[31]	LE Q V, MIKOLOV T. Distributed Representations of Sentences and Documents [EB/OL]. 2014: arXiv: 1405. 4053[cs. CL]. https://arxiv.org/abs/1405.4053.
[32]	覃光华, 丁晶, 陈彬兵.预防过拟合现象的人工神经网络训练策略及其应用[J].长江科学院院报, 2002, 19(3): 59-61. doi: http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=cjkxyyb200203017
[33]	KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet Classification with Deep Convolutional Neural Networks[C]// 19th International Conference on Neural Information Processing Systems, November 12-15, 2012, Doha, Qatar: Springer, c2012: 1097-1105.
[34]	KARDARAS D K, KAPERONIS S, BARBOUNAKI S, et al. An Approach to Modelling User Interests Using TF-IDF and Fuzzy Sets Qualitative Comparative Analysis [J]. Artificial Intelligence Applications and Innovations, 2018: 606-615. DOI: 10. 1007/978-3-319-92007-8_51.
[35]	DHAR A, DASH N S, ROY K. Application of TF-IDF Feature for Categorizing Documents of Online Bangla Web Text Corpus [J]. Intelligent Engineering Informatics, 2018: 51-59. DOI: 10. 1007/978-981-10-7566-7_6.
[36]	凤元杰, 刘正春, 王坚毅.搜索引擎主要性能评价指标体系研究[J].情报学报, 2004(1): 63-68. doi: http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=qbxb200401012

图( 5)

计量

文章访问数: 3192
HTML全文浏览数: 3192
PDF下载数: 132
施引文献: 0

全文HTML

伴随农业信息化的快速发展, 农业科研协同办公平台中, 用户对科研信息的需求量和信息准确度越来越高, 且变化的增幅越来越大.然而面对巨大的网络信息资源, 用户在信息搜索时会查出很多与目标信息无关的网页^[1].同百度、谷歌等通用搜索引擎相比, 聚焦农业信息的垂直搜索引擎^[2-3]能为农业科研工作者提供更专业性的搜索结果.国外的农业垂直搜索引擎已经取得了一定的成果^[4], 如Agriscape Search, WEBAgriSearch等.我国的农业垂直搜索引擎出现相对较晚, 自2007年首个农业搜索引擎上线以来, 目前国内农业搜索引擎主要有农搜网、搜农网等, 仍然处在发展时期, 存在一些不完善的地方, 且尚无专注农业科研的搜索引擎.首先搜索结果中仍包含了大量的无效信息^[5], 搜索准确率和用户满意度较低；其次搜索结果过于模式化, 搜索结果都按照规定的分类模块显示, 而忽略了搜索的关键词是否与预设的分类有关联；农业领域信息缺乏, 目前存在的几个主流农业搜索引擎关注点大多在农产品市场价格方面, 而如研究热点、重大成果、实用技术、政策法规、领域热点等相关的信息非常稀少.构建智能化的农业科研办公平台是推动农业科研现代化、信息化发展的重要手段.本文在传统的农业垂直搜索引擎基础上, 保证数据源的精确性, 结合语义关联分析查询机制, 提供对农业信息的精确及时的检索查询, 为农业科研办公的智能化、信息化提供有力技术支撑.在农业科研办公平台中, 小部分数据来自于科研单位办公过程产生的以及手动输入的, 主要数据来源于外部互联网数据接入和抓取, 在不考虑合作数据对接共享的情况下, 如何高效获取平台外的信息成为亟待解决的问题, 而垂直搜索引擎是解决这一问题的工具.

3. 结论

本文通过人工精确定位数据源、爬虫系统自动抓取海量农业信息互联网信息, 通过doc2vec与tf-idf结合的神经网络算法进行语义相似度匹配搜索.实验结果, 证明本方法在有序文本搜索时具有较高的准确性, 但word2vec在无序离散的词汇组合搜索时则有更高的查准率.因此, 针对不同的文本搜索场景选用不同的搜索方法将进一步提高农业科研协同办公平台信息搜索引擎的性能.在下一步的研究工作中, 将对用户搜索内容的有序和无序分类进行判断, 以决策针对性的搜索方法, 达到更高的准确率；对文本进行预分类, 同时判断用户搜索内容的类型, 在同一分类下进行搜索, 通过缩小搜索范围, 降低搜索运算的时间、提升搜索效率.

参考文献 (36)

姓名
	姓名不能为空！
邮箱
	邮箱不能为空！非法的邮箱地址。
手机号码
	电话不能为空！请输入有效手机号!
标题
	标题不能为空！
留言内容
	内容不能为空！
验证码
	验证码不能为空！验证码错误！

留言板

面向农业科研办公的垂直搜索引擎研究与设计

1.
北京市农林科学院, 北京 100097

2.
国家农业信息化工程技术研究中心, 北京 100097

3.
北京市农业信息技术研究中心, 北京 100097

4.
农业农村部农业信息技术重点实验室, 北京 100097

作者简介:
李昀(1969-), 硕士, 高级工程师, 主要从事信息化管理应用研究 .

通讯作者: 吴华瑞, 博士, 研究员

On Design of Vertical Search Engine toward Agricultural Scientific Research Office

计量

面向农业科研办公的垂直搜索引擎研究与设计

通讯作者: 吴华瑞, 博士, 研究员

English Abstract

On Design of Vertical Search Engine toward Agricultural Scientific Research Office

Corresponding author: Hua-rui WU

全文HTML

1.1. 数据采集模块

1.2. 信息搜索模块

1.2.1. 搜索方法设计

2.1. 评价方法

2.2. 实验设计

2.3. 结果与分析

目录

留言板

面向农业科研办公的垂直搜索引擎研究与设计

1. 北京市农林科学院, 北京 100097 2. 国家农业信息化工程技术研究中心, 北京 100097 3. 北京市农业信息技术研究中心, 北京 100097 4. 农业农村部农业信息技术重点实验室, 北京 100097

作者简介: 李昀(1969-), 硕士, 高级工程师, 主要从事信息化管理应用研究 .

通讯作者: 吴华瑞, 博士, 研究员

On Design of Vertical Search Engine toward Agricultural Scientific Research Office

计量

出版历程

面向农业科研办公的垂直搜索引擎研究与设计

通讯作者: 吴华瑞, 博士, 研究员

English Abstract

On Design of Vertical Search Engine toward Agricultural Scientific Research Office

Corresponding author: Hua-rui WU

全文HTML

1.1. 数据采集模块

1.2. 信息搜索模块

1.2.1. 搜索方法设计

2.1. 评价方法

2.2. 实验设计

2.3. 结果与分析

目录

1.
北京市农林科学院, 北京 100097

2.
国家农业信息化工程技术研究中心, 北京 100097

3.
北京市农业信息技术研究中心, 北京 100097

4.
农业农村部农业信息技术重点实验室, 北京 100097

作者简介:
李昀(1969-), 硕士, 高级工程师, 主要从事信息化管理应用研究 .