A Systematic Comparison of Text and Image Encoders for Visual Question Answering: From RNN to LLM-Based Representations

Yükleniyor...
Küçük Resim

Tarih

Dergi Başlığı

Dergi ISSN

Cilt Başlığı

Yayıncı

ELECO

Erişim Hakkı

info:eu-repo/semantics/embargoedAccess

Özet

This study presents a systematic comparative analysis of text and image encoder combinations for Visual Question Answering (VQA) using the EasyVQA dataset. We evaluate six text encoders (ELMo, BERT, RoBERTa, T5, SBERT, LLM2Vec) paired with three image encoders (ResNet-50, DenseNet-121, ViT) under both frozen and fine-tuned training scenarios. Our two-branch architecture processes images and questions separately before concatenating embeddings for classification. Results demonstrate significant performance variations between training strategies, with finetuning improving average accuracy from 93% to 96%. LLM2Vec achieved perfect performance (100% accuracy) with DenseNet-121 in frozen mode, while BERT and RoBERTa showed remarkable improvements through finetuning, reaching perfect scores with multiple image encoders. DenseNet-121 proved most stable across configurations. These findings reveal that modern LLM-based encoders excel with minimal adaptation, while traditional Transformer models benefit substantially from task-specific fine-tuning, providing crucial guidance for multimodal system design.

Açıklama

Anahtar Kelimeler

Kaynak

2025 16th International Conference on Electrical and Electronics Engineering

WoS Q Değeri

Scopus Q Değeri

Cilt

Sayı

Künye

DİK, Sümeyye Zülal, Reyhan HOŞAVCI, Zeliha Kaya AKÇELİK, Zeki KUŞ & Musa AYDIN. "A Systematic Comparison of Text and Image Encoders for Visual Question Answering: From RNN to LLM-Based Representations". 2025 16th International Conference on Electrical and Electronics Engineering, (2025): 1-5.

Onay

İnceleme

Ekleyen

Referans Veren