Natural Language Explanations in Visual Question Answering Systems Generated Using Artificial Intelligence Neural Network Architectures

YUAN Lei; WANG Kejun

doi:10.13718/j.cnki.xdzk.2024.10.018

2024 Volume 46 Issue 10

Article Contents

Previous Article Next Article

YUAN Lei, WANG Kejun. Natural Language Explanations in Visual Question Answering Systems Generated Using Artificial Intelligence Neural Network Architectures[J]. Journal of Southwest University Natural Science Edition, 2024, 46(10): 212-221. doi: 10.13718/j.cnki.xdzk.2024.10.018

Citation:

YUAN Lei, WANG Kejun. Natural Language Explanations in Visual Question Answering Systems Generated Using Artificial Intelligence Neural Network Architectures[J]. Journal of Southwest University Natural Science Edition, 2024, 46(10): 212-221. doi: 10.13718/j.cnki.xdzk.2024.10.018

Natural Language Explanations in Visual Question Answering Systems Generated Using Artificial Intelligence Neural Network Architectures

YUAN Lei¹,
WANG Kejun^2,3

1.
School of Information Engineering, Zhengzhou Technology and Business University, Zhengzhou 451400, China
2.
School of Information, Beijing Institute of Technology(Zhuhai Campus), Zhuhai Guangdong 519088, China
3.
College of Intelligent Systems Science and Engineering, Harbin Engineering University, Harbin 150001, China

More Information

Received Date: 19/12/2023
Available Online: 20/10/2024
MSC: TP393

Abstract

The interpretability of models has long been a prominent challenge in the field of artificial intelligence. In Visual Question Answering (VQA) systems, particularly, there is a critical need to facilitate collaborative reasoning between visual (image) and linguistic (question) components in order to generate answers that are both highly interpretable and reliable. However, existing methods often focus on separately handling visual and linguistic features, failing to capture the intricate interplay required for VQA and lacking in providing explanations for the answer generation process. To address these issues, this study explores and introduces an innovative approach, known as Interpretable Transformer-Based Path Visual Question Answering. This method begins by leveraging Transformer encoder layers to separately extract visual and linguistic features from pre-trained Convolutional Neural Network (CNN) and domain-specific language model (LM). Subsequently, decoder layers are embedded to upsample encoded features for the final VQA predictions. Extensive experiments conducted on challenging VQA-X datasets and e-SNLI-VE datasets validate the effectiveness of this approach. Experimental results indicated that the proposed method outperforms other state-of-the-art methods in qualitative and quantitative evaluations. This research not only contributes to elucidating single-image results in VQA models but also provides profound insights into understanding the behavior of VQA models.
- interpretable,
- vision systems,
- artificial intelligence,
- neural networks,
- converters

References

[1]	高楠, 彭鼎原, 傅俊英, 等. 基于专利IPC分类与文本信息的前沿技术演进分析——以人工智能领域为例[J]. 情报理论与实践, 2020, 43(4): 123-129. Google Scholar
[2]	朱翌, 李秀. 医学图像描述综述: 编码、解码及最新进展[J]. 中国图象图形学报, 2023, 28(7): 1990-2010. Google Scholar
[3]	叶仕俊, 张鹏程, 吉顺慧, 等. 人工智能软件系统的非功能属性及其质量保障方法综述[J]. 软件学报, 2023, 34(1): 103-129. Google Scholar
[4]	GUO Z H, HAN D Z. Sparse Co-Attention Visual Question Answering Networks Based on Thresholds[J]. Applied Intelligence, 2023, 53(1): 586-600. doi: 10.1007/s10489-022-03559-4 CrossRef Google Scholar
[5]	CONG F Z, XU S B, GUO L, et al. Anomaly Matters: an Anomaly-Oriented Model for Medical Visual Question Answering[J]. IEEE Transactions on Medical Imaging, 2022, 41(11): 3385-3397. doi: 10.1109/TMI.2022.3185113 CrossRef Google Scholar
[6]	秦志金, 赵菼菼, 李凡, 等. 多模态语义通信研究综述[J]. 通信学报, 2023, 44(5): 28-41. Google Scholar
[7]	朱明婷, 徐崇利. 人工智能伦理的国际软法之治: 现状、挑战与对策[J]. 中国科学院院刊, 2023, 38(7): 1037-1049. Google Scholar
[8]	TRUONG L X, PHAM V Q, VAN NGUYEN K. Transformer-Based Approaches for Multilingual Visual Question Answering[J]. International Journal of Asian Language Processing, 2022, 32(4): 1-18. Google Scholar
[9]	王虞, 孙海春. 视觉问答技术研究综述[J]. 计算机科学与探索, 2023, 17(7): 1487-1505. Google Scholar
[10]	高鸿斌, 毛金莹, 王会勇. K-VQA: 一种知识图谱辅助下的视觉问答方法[J]. 河北科技大学学报, 2020, 41(4): 315-326. Google Scholar
[11]	LI H Y, HAN D Z. Multimodal Encoders and Decoders with Gate Attention for Visual Question Answering[J]. Computer Science and Information Systems, 2021, 18(3): 1023-1040. doi: 10.2298/CSIS201120032L CrossRef Google Scholar
[12]	SHARMA H, SRIVASTAVA S. Visual Question Answering Model Based on the Fusion of Multimodal Features by a Two-Way Co-Attention Mechanism[J]. The Imaging Science Journal, 2021, 69(1-4): 177-189. doi: 10.1080/13682199.2022.2153489 CrossRef Google Scholar
[13]	GUO Z H, HAN D Z. Sparse Co-Attention Visual Question Answering Networks Based on Thresholds[J]. Applied Intelligence, 2023, 53(1): 586-600. doi: 10.1007/s10489-022-03559-4 CrossRef Google Scholar
[14]	ZHU H, TOGO R, OGAWA T, et al. Multimodal Natural Language Explanation Generation for Visual Question Answering Based on Multiple Reference Data[J]. Electronics, 2023, 12(10): 1-19. Google Scholar
[15]	BAZIY, RAHHALM M A, BASHMALL, et al. Vision-LanguageModel for Visual Question AnsweringinMedicalImagery[J]. Bioengineering, 2023, 10(3): 1-17. Google Scholar
[16]	XUX, WANGT, YANGY, et al. RadialGraph Convolutional Network for Visual Question Generation[J]. IEEE Transactions on Neural Networks and Learning Systems, 2021, 32(4): 1654-1667. Google Scholar
[17]	CAO F Q, LUO S W, NUNEZ F, et al. SceneGATE: Scene-Graph Based Co-Attention Networks for Text Visual Question Answering[J]. Robotics, 2023, 12(4): 1-18. Google Scholar
[18]	刘传. 基于门控图卷积网络和协同注意力的视觉问答[J]. 计算机与数字工程, 2023, 51(4): 860-865. Google Scholar

Access History

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(6) / Tables(3)

Export Citation

PDF

XML

Article Metrics

Article views(3792) PDF downloads(424) Cited by(0)

Access History

Other Articles By Authors

on this site
- YUAN Lei
- WANG Kejun
on Google Scholar
- YUAN Lei
- WANG Kejun

HTML

开放科学（资源服务）标识码（OSID）：
在过去的几年里，人工智能(Artificial Intelligence，AI)领域取得了显著进展，尤其是在自然语言处理(Natural Language Processing，NLP)和计算机视觉(Computer Vision，CV)领域^[1]. 这些领域的迅速发展引领了新一代智能系统崭露头角，其中包括自动化问答系统，它们的出现和发展引起了业界的广泛关注和研究.

视觉问答是人工智能领域的一项新兴课题，该课题结合了计算机视觉和自然语言处理两个学科领域的知识，其任务是把给定的视觉信息(图像)和与视觉信息相关的自然语言问题作为输入，生成的自然语言答案作为输出，即输入图像和与图像相关的文本问题，并输出确定的正确答案. 该领域的研究受到了广泛关注，因其具有广泛的应用前景，例如在虚拟助手、医疗诊断、自动驾驶和智能客服等领域^[2-5]. 然而，尽管VQA系统在回答问题方面取得了显著进展，但这些系统在解释决策和答案生成过程方面仍然存在挑战. 这些不足是由于VQA系统通常被视为黑盒子，用户难以理解为何系统会给出特定的答案所致^[6]. 这样缺乏解释性不仅限制了VQA系统在关键任务中的应用，还降低了用户对系统的信任度和接受度. 因此，提高VQA系统的可解释性成为当前研究的热点问题之一.

在VQA领域，可解释性意味着系统能够清晰地解释其答案生成的依据和过程，使用户能够理解系统的决策逻辑，这不仅包括系统对问题和图像的理解，还包括系统对答案生成路径的解释. 譬如在回答关于图像中物体的问题时，一个具有良好可解释性的VQA系统应该能够解释为何选择了某个特定的物体作为答案，并提供与这一选择相关的推理过程. 可解释性VQA系统的重要性不仅仅体现在用户交互和决策支持方面，还涉及到伦理和法律等更为广泛的社会问题. 在一些应用中VQA系统的决策可能会对人们的生活产生直接影响，因此这些决策必须能够被清晰地解释和追溯. 例如在医疗领域，AI系统用于辅助医生进行疾病诊断，可解释性AI系统可以提供关于诊断依据的详细信息，使医生和患者能够理解系统的决策逻辑，并作出明智的治疗决策. 在自动驾驶领域，可解释性也是一个重要的问题，自动驾驶车辆需要做出复杂的决策，例如避免碰撞、超车和停车等. 如果这些决策不能被解释，必将难以确定责任和安全性. 在金融领域，可解释性AI系统可以帮助分析师和投资者更好地理解市场趋势和交易建议，有助于制定更明智的投资策略，并降低金融风险. 一些国家和地区也出台了法规要求AI系统具有可解释性，以确保其决策公正且不受偏见影响^[7].

为了提高VQA系统的可解释性和效果，本文提出一种使用Transformer编码器层和解码器层的统一方法，充分利用完整Transformer架构的优势，融合视觉和语言特征为检索到的答案提供解释. 实验结果表明，与一些最先进的方法相比，本文方法可以更准确地生成答案，并且解释更合理、更接近事实. 本文的主要目的和意义是通过引入自然语言解释，提高VQA系统的可解释性和效果，旨在开发一种创新的神经网络体系结构(Transformer)，该体系结构不仅能够回答问题，还能够以自然语言的形式解释答案生成的过程，从而使用户更易理解系统的决策过程. 这种自然语言解释不仅有助于用户理解系统的决策，还可以提供关于答案推理过程的详细信息，使用户能够追踪答案的生成路径.

4. 结语

随着人工智能的快速发展，视觉问答系统作为自然语言处理和计算机视觉交叉领域的热点问题，已引起广泛关注. 在VQA系统中，特别是在需要用于关键任务或与人类用户互动的情况下，视觉问答系统的可解释性变得至关重要. 然而，以前的研究仅考虑输入图像，造成信息不足的状况，从而导致错误的答案和令人难以信服的解释. 为此，本文提出一种新颖且可解释生成路径的视觉问答方法. 该方法利用Transformer编码器和解码器层来嵌入VQA任务的视觉和语言特征；然后将低级图像特征嵌入到特定领域上下文信息中；最后利用这些信息来回答问题. 本文模型利用CNN优势在低层提取图像特征，并利用特定领域语言模型提取特定领域的上下文信息，通过Transformer捕获高层的全局依赖关系. 在两个流行的基准数据集(VQA-X和e-SNLI-VE)上体现了本文模型的先进性能，并通过大量实验证明了本文模型的有效性和可解释性. 本文不仅有助于用户理解系统决策，还可以提供关于答案推理过程的详细信息，使用户能够追踪答案的生成路径. 此外，本文还为深度学习和自然语言处理领域提供了一个创新方法，将注意力机制、LSTM和CNN相结合，以处理跨模态信息. 这一方法可以在其他领域的问题中得到应用，为多模态数据处理和可解释AI的研究提供了新的思路. 然而，尽管该方法提供了对VQA模型行为的解释，但对于复杂问题和答案仍存在局限性，特别是在面对抽象问题或需要长期推理的问题时，解释的复杂性可能会增加，需要更深入的解释机制. 未来的研究会致力于解决更复杂问题的解释，将涉及到更多的推理机制、对话式VQA或需要更多步骤推理的问题. 此外，研究人员还将探索该方法在不同领域的应用，如医疗、法律、教育等，以评估模型的域适应性.

Figure (6) Table (3) Reference (18)

Name
	Name cannot be empty!
E-mail
	Mailbox cannot be empty! Mailbox cannot be empty!
Telephone
	Mobile number cannot be empty! Please enter a valid mobile number!
Title

Content
Verification Code

Message Board

Natural Language Explanations in Visual Question Answering Systems Generated Using Artificial Intelligence Neural Network Architectures

Abstract

References

Access History

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Access History

Other Articles By Authors