A Systematic Comparison of Text and Image Encoders for Visual Question Answering: From RNN to LLM-Based Representations

This study presents a systematic comparative analysis of text and image encoder combinations for Visual Question Answering (VQA) using the EasyVQA dataset. We evaluate six text encoders (ELMo, BERT, RoBERTa, T5, SBERT, LLM2Vec) paired with three image encoders (ResNet-50, DenseNet-121, ViT) under both frozen and fine-tuned training scenarios. Our two-branch architecture processes images and questions separately before concatenating embeddings for classification. Results demonstrate significant performance variations between training strategies, with finetuning improving average accuracy from 93% to 96%. LLM2Vec achieved perfect performance (100% accuracy) with DenseNet-121 in frozen mode, while BERT and RoBERTa showed remarkable improvements through finetuning, reaching perfect scores with multiple image encoders. DenseNet-121 proved most stable across configurations. These findings reveal that modern LLM-based encoders excel with minimal adaptation, while traditional Transformer models benefit substantially from task-specific fine-tuning, providing crucial guidance for multimodal system design.

Kaynak

2025 16th International Conference on Electrical and Electronics Engineering

Scopus Q Değeri

N/A

Künye

DİK, Sümeyye Zülal, Reyhan HOŞAVCI, Zeliha Kaya AKÇELİK, Zeki KUŞ & Musa AYDIN. "A Systematic Comparison of Text and Image Encoders for Visual Question Answering: From RNN to LLM-Based Representations". 2025 16th International Conference on Electrical and Electronics Engineering, (2025): 1-5.

Bağlantı

https://hdl.handle.net/11352/6085

Koleksiyon

Bilgisayar Mühendisliği Bölümü
Biyomedikal Mühendisliği Bölümü
Scopus İndeksli Yayınlar Koleksiyonu
Yapay Zeka ve Veri Mühendisliği Bölümü

Detaylı Öğe Kaydı

A Systematic Comparison of Text and Image Encoders for Visual Question Answering: From RNN to LLM-Based Representations

Dosyalar

Tarih

Yazarlar

Dergi Başlığı

Dergi ISSN

Cilt Başlığı

Yayıncı

Erişim Hakkı

Özet

Açıklama

Anahtar Kelimeler

Kaynak

WoS Q Değeri

Scopus Q Değeri

Cilt

Sayı

Künye

Bağlantı

Koleksiyon

Onay

İnceleme

Ekleyen

Referans Veren