A Systematic Comparison of Text and Image Encoders for Visual Question Answering: From RNN to LLM-Based Representations

Dik, Sümeyye Zülal; Hoşavcı, Reyhan; Akçelik, Zeliha Kaya; Kuş, Zeki; Aydın, Musa

doi:10.1109/ELECO69582.2025.11329297

A Systematic Comparison of Text and Image Encoders for Visual Question Answering: From RNN to LLM-Based Representations

dc.contributor.author	Dik, Sümeyye Zülal
dc.contributor.author	Hoşavcı, Reyhan
dc.contributor.author	Akçelik, Zeliha Kaya
dc.contributor.author	Kuş, Zeki
dc.contributor.author	Aydın, Musa
dc.date.accessioned	2026-04-24T09:14:29Z
dc.date.issued	2025
dc.department	FSM Vakıf Üniversitesi, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü
dc.description.abstract	This study presents a systematic comparative analysis of text and image encoder combinations for Visual Question Answering (VQA) using the EasyVQA dataset. We evaluate six text encoders (ELMo, BERT, RoBERTa, T5, SBERT, LLM2Vec) paired with three image encoders (ResNet-50, DenseNet-121, ViT) under both frozen and fine-tuned training scenarios. Our two-branch architecture processes images and questions separately before concatenating embeddings for classification. Results demonstrate significant performance variations between training strategies, with finetuning improving average accuracy from 93% to 96%. LLM2Vec achieved perfect performance (100% accuracy) with DenseNet-121 in frozen mode, while BERT and RoBERTa showed remarkable improvements through finetuning, reaching perfect scores with multiple image encoders. DenseNet-121 proved most stable across configurations. These findings reveal that modern LLM-based encoders excel with minimal adaptation, while traditional Transformer models benefit substantially from task-specific fine-tuning, providing crucial guidance for multimodal system design.
dc.identifier.citation	DİK, Sümeyye Zülal, Reyhan HOŞAVCI, Zeliha Kaya AKÇELİK, Zeki KUŞ & Musa AYDIN. "A Systematic Comparison of Text and Image Encoders for Visual Question Answering: From RNN to LLM-Based Representations". 2025 16th International Conference on Electrical and Electronics Engineering, (2025): 1-5.
dc.identifier.doi	10.1109/ELECO69582.2025.11329297
dc.identifier.endpage	5
dc.identifier.orcid	0009-0002-5629-6413
dc.identifier.orcid	0000-0003-3384-6670
dc.identifier.orcid	0009-0003-4897-0081
dc.identifier.orcid	0000-0001-8762-7233
dc.identifier.orcid	0000-0002-5825-2230
dc.identifier.scopus	2-s2.0-105034868226
dc.identifier.scopusquality	N/A
dc.identifier.startpage	1
dc.identifier.uri	https://hdl.handle.net/11352/6085
dc.indekslendigikaynak	Scopus
dc.language.iso	en
dc.publisher	ELECO
dc.relation.ispartof	2025 16th International Conference on Electrical and Electronics Engineering
dc.relation.publicationcategory	Konferans Öğesi - Uluslararası - Kurum Öğretim Elemanı
dc.rights	info:eu-repo/semantics/embargoedAccess
dc.title	A Systematic Comparison of Text and Image Encoders for Visual Question Answering: From RNN to LLM-Based Representations
dc.type	Conference Object

Dosyalar

Orijinal paket

Listeleniyor 1 - 1 / 1

İsim:: Dik
Boyut:: 1.09 MB
Biçim:: Adobe Portable Document Format

İndir

Lisans paketi

Listeleniyor 1 - 1 / 1

İsim:: license.txt
Boyut:: 1.17 KB
Biçim:: Item-specific license agreed upon to submission
Açıklama:

İndir

Koleksiyon

Bilgisayar Mühendisliği Bölümü
Biyomedikal Mühendisliği Bölümü
Scopus İndeksli Yayınlar Koleksiyonu
Yapay Zeka ve Veri Mühendisliği Bölümü