On Design of Vertical Search Engine toward Agricultural Scientific Research Office

Yun LI; Ying DENG; Hua-rui WU

doi:10.13718/j.cnki.xsxb.2020.09.008

2020 Volume 45 Issue 9

Article Contents

Previous Article Next Article

Yun LI, Ying DENG, Hua-rui WU. On Design of Vertical Search Engine toward Agricultural Scientific Research Office[J]. Journal of Southwest China Normal University(Natural Science Edition), 2020, 45(9): 43-50. doi: 10.13718/j.cnki.xsxb.2020.09.008

Citation:

Yun LI, Ying DENG, Hua-rui WU. On Design of Vertical Search Engine toward Agricultural Scientific Research Office[J]. Journal of Southwest China Normal University(Natural Science Edition), 2020, 45(9): 43-50. doi: 10.13718/j.cnki.xsxb.2020.09.008

On Design of Vertical Search Engine toward Agricultural Scientific Research Office

1.
Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China
2.
National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China
3.
Beijing Research Center for Information Technology in Agriculture, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China
4.
Key Laboratory of Agri-informatics, Ministry of Agriculture, Beijing 100097, China

More Information

Corresponding author: Hua-rui WU
Received Date: 07/08/2020
Available Online: 20/09/2020
MSC: S126

Abstract

The disadvantage of using traditional comprehensive search engines in Agricultural area is that they returns too many results which are not accurate enough to match the requirement of the agricultural scientific research office due to its non-limited search coverage and using improperly semantic association algorithms. In this article, an Agricultural Web-Info Gathering system monitors have been mentioned, updated information been gathered and accumulated from specific modules of series of agricultural websites such as official websites of national and local agriculture management departments, official websites of agricultural college or research institutes, agriculture magazines websites, and agriculture commercial websites. Specification of data resource reduces non-related data, efficiently limited the search range. The search engine utilized word2vec and text feature based doc2vec models and took data of agriculture oriented websites as text corpus to build word vector space and document vector space to deal with non-ordered words set search and ordered sentence or paragraph search, in order to ensure the search result to be accurate as well as intelligent. According to the result of experiment it is proved that this system with doc2vec+tf-idf search algorithm has higher accuracy in sequential search for agricultural information. With the high performance of word2vec algorithm in nonsequential search, dynamically choosing corresponding algorithm for sequential/nonsequential search could further improve the accuracy of the search engine, and satisfied high quality data resource requirement of Agricultural information.
- agricultural vertical search engine,
- semantic similarity,
- word2vec,
- doc2vec,
- tf-idf,
- context based search

References

[1]	李广丽, 刘觉夫.垂直搜索引擎系统的研究与实现[J].情报杂志, 2009, 28(10): 144-147, 169. Google Scholar
[2]	肖冬梅.垂直搜索引擎研究[J].图书馆学研究, 2003(2): 87-89. Google Scholar
[3]	许翰林, 王瑞, 王佳丽, 等.基于Lucene的新闻垂直搜索引擎设计与实现[J].电脑编程技巧与维护, 2018(2): 50-52. Google Scholar
[4]	彭玉容, 杨捧, 高媛.农业搜索引擎的发展现状及关键技术研究[J].安徽农业科学, 2010, 38(20): 10971-10972, 10977. Google Scholar
[5]	王晓琴, 李书琴, 景旭, 等.基于Nutch的农业垂直搜索引擎研究[J].计算机工程与设计, 2014, 35(6): 2239-2243. Google Scholar
[6]	武婷婷.一种基于WebMagic和Mahout的信息搜集与推荐系统[J].软件导刊, 2016, 15(10): 1-3. Google Scholar
[7]	吕太之, 毕家钦.基于Hadoop平台的岗位分析和推荐系统的构建[J].河北软件职业技术学院学报, 2017, 19(4): 1-4. Google Scholar
[8]	张婷婷, 刘凯, 王伟军.科研人员Web数据自动抓取模式及其开源解决方案[J].信息资源管理学报, 2015, 5(2): 21-27. Google Scholar
[9]	李佳欣, 潘伟. PhantomJS在Web自动化测试中的应用[J].计算机光盘软件与应用, 2013(18): 76-77, 80. Google Scholar
[10]	胡越, 张源伟, 雷军.自定规则的AJAX网页信息采集功能的设计[J].物联网技术, 2016, 6(9): 86-87. Google Scholar
[11]	李浩.基于评论的博客搜索引擎的设计与实现[D].重庆: 重庆大学, 2016.http://cdmd.cnki.com.cn/Article/CDMD-10611-1016908413.htm Google Scholar
[12]	ZHU J, HU B, SHAO H. Research of Lightweight Vector Geographic Data Management Based on Main Memory Database Redis [J]. Journal of Geo-Information Science, 2014, 16(2): 165-172. Google Scholar
[13]	GAO X B, FANG X M. High-Performance Distributed Cache Architecture Based on Redis[M]//Lecture Notes in Electrical Engineering. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013: 105-111. Google Scholar
[14]	ROEHM D, PAVEL R S, BARROS K, et al. Distributed Database Kriging for Adaptive Sampling (D2KAS) [J]. Computer Physics Communications, 2015, 192: 138-147. doi: 10.1016/j.cpc.2015.03.006 CrossRef Google Scholar
[15]	BALIS B, BUBAK M, HAREZLAK D, et al. Towards an Operational Database for Real-time Environmental Monitoring and Early Warning Systems [J]. Procedia Computer Science, 2017, 108: 2250-2259. doi: 10.1016/j.procs.2017.05.193 CrossRef Google Scholar
[16]	RIVEST R. The MD5 Message-Digest Algorithm[R]. RFC Editor, 1992. Google Scholar
[17]	SZYDLO M, YIN Y L. Collision-Resistant Usage of MD5 and SHA-1 via Message Preprocessing [J]. Topics in Cryptology - CT-RSA 2006, 2006: 99-114. DOI: 10. 1007/11605805_7. CrossRef Google Scholar
[18]	HAVELIWALA T H. Topic-sensitive Pagerank: a Context-sensitive Ranking Algorithm for Web Search [J]. IEEE Transactions on Knowledge and Data Engineering, 2003, 15(4): 784-796. doi: 10.1109/TKDE.2003.1208999 CrossRef Google Scholar
[19]	LANGVILLE A N, MEYER C D. Google's PageRank and Beyond [J]. Mathematical Intelligencer, 2011, 30(1): 68-69. Google Scholar
[20]	LORIGO L, KLEINBERG J, EATON R, et al. A Graph-Based Approach towards Discerning Inherent Structures in a Digital Library of Formal Mathematics [J]. Mathematical Knowledge Management, 2004: 220-235. DOI: 10. 1007/978-3-540-27818-4_16. CrossRef Google Scholar
[21]	NOMURA S, OYAMA S, HAYAMIZU T, et al. Analysis and Improvement of HITS Algorithm for Detecting Web Communities [J]. Systems and Computers in Japan, 2004, 35(13): 32-42. doi: 10.1002/scj.10425 CrossRef Google Scholar
[22]	CHAKRABARTI S, DOM B E, GIBSON D, et al. Topic Distillation and Spectral Filtering [J]. Artificial Intelligence Review, 1999, 13(5-6): 409-435. Google Scholar
[23]	ARASU A, CHO J, GARCIA-MOLINA H, et al. Searching the Web [J]. ACM Transactions on Internet Technology (TOIT), 2001, 1(1): 2-43. doi: 10.1145/383034.383035 CrossRef Google Scholar
[24]	吴莉霞.浅谈搜索引擎优化策略[J].电脑知识与技术, 2014, 10(15): 3662-3664. Google Scholar
[25]	赵谦, 荆琪, 李爱萍, 等.一种基于语义与句法结构的短文本相似度计算方法[J].计算机工程与科学, 2018, 40(7): 1287-1294. Google Scholar
[26]	冯高磊, 高嵩峰.基于向量空间模型结合语义的文本相似度算法[J].现代电子技术, 2018, 41(11): 157-161. Google Scholar
[27]	MIKOLOV T, CHEN K, CORRADO G, et al. Efficient Estimation of Word Representations in Vector Space [EB/OL]. 2013: arXiv: 1301. 3781[cs. CL]. https://arxiv.org/abs/1301.3781. Google Scholar
[28]	黄承慧, 印鉴, 侯昉.一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J].计算机学报, 2011, 34(5): 856-864. Google Scholar
[29]	朱命冬, 徐立新, 申德荣, 等.面向不确定文本数据的余弦相似性查询方法[J].计算机科学与探索, 2018, 12(1): 49-64. Google Scholar
[30]	HINTON G E. Learning Distributed Representations of Concepts[C]//In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, 1986, Amherst MA: Lawrence Erlbaum Associates, c1986: 1-12. Google Scholar
[31]	LE Q V, MIKOLOV T. Distributed Representations of Sentences and Documents [EB/OL]. 2014: arXiv: 1405. 4053[cs. CL]. https://arxiv.org/abs/1405.4053. Google Scholar
[32]	覃光华, 丁晶, 陈彬兵.预防过拟合现象的人工神经网络训练策略及其应用[J].长江科学院院报, 2002, 19(3): 59-61. Google Scholar
[33]	KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet Classification with Deep Convolutional Neural Networks[C]// 19th International Conference on Neural Information Processing Systems, November 12-15, 2012, Doha, Qatar: Springer, c2012: 1097-1105. Google Scholar
[34]	KARDARAS D K, KAPERONIS S, BARBOUNAKI S, et al. An Approach to Modelling User Interests Using TF-IDF and Fuzzy Sets Qualitative Comparative Analysis [J]. Artificial Intelligence Applications and Innovations, 2018: 606-615. DOI: 10. 1007/978-3-319-92007-8_51. CrossRef Google Scholar
[35]	DHAR A, DASH N S, ROY K. Application of TF-IDF Feature for Categorizing Documents of Online Bangla Web Text Corpus [J]. Intelligent Engineering Informatics, 2018: 51-59. DOI: 10. 1007/978-981-10-7566-7_6. CrossRef Google Scholar
[36]	凤元杰, 刘正春, 王坚毅.搜索引擎主要性能评价指标体系研究[J].情报学报, 2004(1): 63-68. Google Scholar

Access History

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(5)

Export Citation

PDF

XML

Article Metrics

Article views(6732) PDF downloads(193) Cited by(0)

Access History

Other Articles By Authors

on this site
on Google Scholar

HTML

伴随农业信息化的快速发展, 农业科研协同办公平台中, 用户对科研信息的需求量和信息准确度越来越高, 且变化的增幅越来越大.然而面对巨大的网络信息资源, 用户在信息搜索时会查出很多与目标信息无关的网页^[1].同百度、谷歌等通用搜索引擎相比, 聚焦农业信息的垂直搜索引擎^[2-3]能为农业科研工作者提供更专业性的搜索结果.国外的农业垂直搜索引擎已经取得了一定的成果^[4], 如Agriscape Search, WEBAgriSearch等.我国的农业垂直搜索引擎出现相对较晚, 自2007年首个农业搜索引擎上线以来, 目前国内农业搜索引擎主要有农搜网、搜农网等, 仍然处在发展时期, 存在一些不完善的地方, 且尚无专注农业科研的搜索引擎.首先搜索结果中仍包含了大量的无效信息^[5], 搜索准确率和用户满意度较低；其次搜索结果过于模式化, 搜索结果都按照规定的分类模块显示, 而忽略了搜索的关键词是否与预设的分类有关联；农业领域信息缺乏, 目前存在的几个主流农业搜索引擎关注点大多在农产品市场价格方面, 而如研究热点、重大成果、实用技术、政策法规、领域热点等相关的信息非常稀少.构建智能化的农业科研办公平台是推动农业科研现代化、信息化发展的重要手段.本文在传统的农业垂直搜索引擎基础上, 保证数据源的精确性, 结合语义关联分析查询机制, 提供对农业信息的精确及时的检索查询, 为农业科研办公的智能化、信息化提供有力技术支撑.在农业科研办公平台中, 小部分数据来自于科研单位办公过程产生的以及手动输入的, 主要数据来源于外部互联网数据接入和抓取, 在不考虑合作数据对接共享的情况下, 如何高效获取平台外的信息成为亟待解决的问题, 而垂直搜索引擎是解决这一问题的工具.

3. 结论

本文通过人工精确定位数据源、爬虫系统自动抓取海量农业信息互联网信息, 通过doc2vec与tf-idf结合的神经网络算法进行语义相似度匹配搜索.实验结果, 证明本方法在有序文本搜索时具有较高的准确性, 但word2vec在无序离散的词汇组合搜索时则有更高的查准率.因此, 针对不同的文本搜索场景选用不同的搜索方法将进一步提高农业科研协同办公平台信息搜索引擎的性能.在下一步的研究工作中, 将对用户搜索内容的有序和无序分类进行判断, 以决策针对性的搜索方法, 达到更高的准确率；对文本进行预分类, 同时判断用户搜索内容的类型, 在同一分类下进行搜索, 通过缩小搜索范围, 降低搜索运算的时间、提升搜索效率.

Figure (5) Reference (36)

Name
	Name cannot be empty!
E-mail
	Mailbox cannot be empty! Mailbox cannot be empty!
Telephone
	Mobile number cannot be empty! Please enter a valid mobile number!
Title

Content
Verification Code

Message Board

On Design of Vertical Search Engine toward Agricultural Scientific Research Office

Abstract

References

Access History

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Access History

Other Articles By Authors