A Systematic Comparison of Text and Image Encoders for Visual Question Answering: From RNN to LLM-Based Representations

dc.contributor.authorDik, Sümeyye Zülal
dc.contributor.authorHoşavcı, Reyhan
dc.contributor.authorAkçelik, Zeliha Kaya
dc.contributor.authorKuş, Zeki
dc.contributor.authorAydın, Musa
dc.date.accessioned2026-04-24T09:14:29Z
dc.date.issued2025
dc.departmentFSM Vakıf Üniversitesi, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü
dc.description.abstractThis study presents a systematic comparative analysis of text and image encoder combinations for Visual Question Answering (VQA) using the EasyVQA dataset. We evaluate six text encoders (ELMo, BERT, RoBERTa, T5, SBERT, LLM2Vec) paired with three image encoders (ResNet-50, DenseNet-121, ViT) under both frozen and fine-tuned training scenarios. Our two-branch architecture processes images and questions separately before concatenating embeddings for classification. Results demonstrate significant performance variations between training strategies, with finetuning improving average accuracy from 93% to 96%. LLM2Vec achieved perfect performance (100% accuracy) with DenseNet-121 in frozen mode, while BERT and RoBERTa showed remarkable improvements through finetuning, reaching perfect scores with multiple image encoders. DenseNet-121 proved most stable across configurations. These findings reveal that modern LLM-based encoders excel with minimal adaptation, while traditional Transformer models benefit substantially from task-specific fine-tuning, providing crucial guidance for multimodal system design.
dc.identifier.citationDİK, Sümeyye Zülal, Reyhan HOŞAVCI, Zeliha Kaya AKÇELİK, Zeki KUŞ & Musa AYDIN. "A Systematic Comparison of Text and Image Encoders for Visual Question Answering: From RNN to LLM-Based Representations". 2025 16th International Conference on Electrical and Electronics Engineering, (2025): 1-5.
dc.identifier.doi10.1109/ELECO69582.2025.11329297
dc.identifier.endpage5
dc.identifier.orcid0009-0002-5629-6413
dc.identifier.orcid0000-0003-3384-6670
dc.identifier.orcid0009-0003-4897-0081
dc.identifier.orcid0000-0001-8762-7233
dc.identifier.orcid0000-0002-5825-2230
dc.identifier.scopus2-s2.0-105034868226
dc.identifier.scopusqualityN/A
dc.identifier.startpage1
dc.identifier.urihttps://hdl.handle.net/11352/6085
dc.indekslendigikaynakScopus
dc.language.isoen
dc.publisherELECO
dc.relation.ispartof2025 16th International Conference on Electrical and Electronics Engineering
dc.relation.publicationcategoryKonferans Öğesi - Uluslararası - Kurum Öğretim Elemanı
dc.rightsinfo:eu-repo/semantics/embargoedAccess
dc.titleA Systematic Comparison of Text and Image Encoders for Visual Question Answering: From RNN to LLM-Based Representations
dc.typeConference Object

Dosyalar

Orijinal paket

Listeleniyor 1 - 1 / 1
Yükleniyor...
Küçük Resim
İsim:
Dik
Boyut:
1.09 MB
Biçim:
Adobe Portable Document Format

Lisans paketi

Listeleniyor 1 - 1 / 1
Yükleniyor...
Küçük Resim
İsim:
license.txt
Boyut:
1.17 KB
Biçim:
Item-specific license agreed upon to submission
Açıklama: