EverydayMMQA

EverydayMMQA and OASIS resource page for culturally grounded spoken visual QA.

Spoken visual QA

EVERYDAYMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA

EverydayMMQA is a framework for building culturally grounded spoken visual QA resources. The paper introduces OASIS, a large-scale multimodal benchmark and training resource spanning English and Arabic varieties across 18 Arab countries.

Availability Public framework and dataset links will be added here once the release is public.

Background

EverydayMMQA targets a gap in multimodal evaluation: many current models perform well on standard visual question answering but still miss culturally grounded, everyday knowledge, especially in underrepresented languages. The framework organizes the full data creation pipeline, from culturally grounded topic and query generation through country-localized image retrieval, filtering, QA generation, speech generation, translation, and quality checking. Using this pipeline, the paper develops OASIS as a benchmark and training resource for spoken visual QA.

OASIS at a glance

0.92M

images in the final resource

14.8M

QA pairs across language varieties

3.7M

spoken questions

18

Arab countries covered

4

language varieties

4

input settings

9

top-level categories

31

subcategories

The paper also reports roughly 20K hours of generated speech for full coverage and 141 hours of human recordings for benchmark subsets.

Framework pipeline

The framework is designed as an end-to-end pipeline for culturally grounded spoken visual QA. It combines query generation, locale-aware image retrieval, filtering, multilingual QA generation, speech generation and recording, translation, and quality control into one reusable process.

  • Culturally grounded topic and query generation with model-assisted filtering.
  • Country-localized image retrieval using locale settings and license constraints.
  • Image deduplication, filtering, categorization, and metadata generation.
  • Open-ended, multiple-choice, and true-false QA generation per image.
  • Speech generation, human recording, translation, and final quality checks.

Dataset analysis

OASIS integrates speech, images, and text to support culturally grounded evaluation beyond simple object recognition. The resource spans English, Modern Standard Arabic, Egyptian Arabic, and Levantine Arabic across a balanced set of country-specific contexts.

Each image is paired with four QA instances: one open-ended question, one multiple-choice question, and two true-false questions. The benchmark supports four main input settings for evaluation.

Text Speech Text + Image Speech + Image
Open-ended Multiple-choice True / False Culturally grounded

Benchmarking

The paper evaluates a mix of closed and open multimodal models, including GPT-4.1, GPT-4o-audio, GPT-5, Gemini 2.5 Pro, Qwen2.5 Omni variants, Phi-4, and a fine-tuned Qwen2.5-3B-Omni model. The reported findings are consistent: visual grounding matters most, and smaller models improve substantially when the training signal aligns speech, text, and images.

Images shift the bottleneck

Adding the image produces large gains across models and moves the remaining challenge from recognition toward faithful answer generation.

Grounding narrows gaps

Visual grounding reduces cross-lingual and dialect gaps, especially for Arabic varieties that are harder in text-only settings.

Speech benefits most

Images act as a modality equalizer by recovering much of the performance lost to speech and transcript noise.

Fine-tuning helps compact models

Light multimodal fine-tuning makes smaller systems materially more stable and competitive, particularly on audio-linked inputs.

Citation

@article{alam2025everydaymmqa,
  title={EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA},
  author={Alam, Firoj and Shahroor, Ali Ezzat and Hasan, Md. Arid and Ali, Zien Sheikh and Bhatti, Hunzalah Hassan and Kmainasi, Mohamed Bayan and Chowdhury, Shammur Absar and Mousi, Basel and Dalvi, Fahim and Durrani, Nadir and Milic-Frayling, Natasa},
  journal={arXiv preprint arXiv:2510.06371},
  year={2025}
}

Copyright © Qatar Computing Research Institute.