A Systematic Comparison of Text and Image Encoders for Visual Question Answering: From RNN to LLM-Based Representations
| dc.contributor.author | Dik, Sümeyye Zülal | |
| dc.contributor.author | Hoşavcı, Reyhan | |
| dc.contributor.author | Akçelik, Zeliha Kaya | |
| dc.contributor.author | Kuş, Zeki | |
| dc.contributor.author | Aydın, Musa | |
| dc.date.accessioned | 2026-04-24T09:14:29Z | |
| dc.date.issued | 2025 | |
| dc.department | FSM Vakıf Üniversitesi, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü | |
| dc.description.abstract | This study presents a systematic comparative analysis of text and image encoder combinations for Visual Question Answering (VQA) using the EasyVQA dataset. We evaluate six text encoders (ELMo, BERT, RoBERTa, T5, SBERT, LLM2Vec) paired with three image encoders (ResNet-50, DenseNet-121, ViT) under both frozen and fine-tuned training scenarios. Our two-branch architecture processes images and questions separately before concatenating embeddings for classification. Results demonstrate significant performance variations between training strategies, with finetuning improving average accuracy from 93% to 96%. LLM2Vec achieved perfect performance (100% accuracy) with DenseNet-121 in frozen mode, while BERT and RoBERTa showed remarkable improvements through finetuning, reaching perfect scores with multiple image encoders. DenseNet-121 proved most stable across configurations. These findings reveal that modern LLM-based encoders excel with minimal adaptation, while traditional Transformer models benefit substantially from task-specific fine-tuning, providing crucial guidance for multimodal system design. | |
| dc.identifier.citation | DİK, Sümeyye Zülal, Reyhan HOŞAVCI, Zeliha Kaya AKÇELİK, Zeki KUŞ & Musa AYDIN. "A Systematic Comparison of Text and Image Encoders for Visual Question Answering: From RNN to LLM-Based Representations". 2025 16th International Conference on Electrical and Electronics Engineering, (2025): 1-5. | |
| dc.identifier.doi | 10.1109/ELECO69582.2025.11329297 | |
| dc.identifier.endpage | 5 | |
| dc.identifier.orcid | 0009-0002-5629-6413 | |
| dc.identifier.orcid | 0000-0003-3384-6670 | |
| dc.identifier.orcid | 0009-0003-4897-0081 | |
| dc.identifier.orcid | 0000-0001-8762-7233 | |
| dc.identifier.orcid | 0000-0002-5825-2230 | |
| dc.identifier.scopus | 2-s2.0-105034868226 | |
| dc.identifier.scopusquality | N/A | |
| dc.identifier.startpage | 1 | |
| dc.identifier.uri | https://hdl.handle.net/11352/6085 | |
| dc.indekslendigikaynak | Scopus | |
| dc.language.iso | en | |
| dc.publisher | ELECO | |
| dc.relation.ispartof | 2025 16th International Conference on Electrical and Electronics Engineering | |
| dc.relation.publicationcategory | Konferans Öğesi - Uluslararası - Kurum Öğretim Elemanı | |
| dc.rights | info:eu-repo/semantics/embargoedAccess | |
| dc.title | A Systematic Comparison of Text and Image Encoders for Visual Question Answering: From RNN to LLM-Based Representations | |
| dc.type | Conference Object |










