Abhirama Subramanyam

Email  /  LinkedIn  /  Google Scholar  /  GitHub  /  Twitter /  CV

I'm a PMRF Ph.D. scholar in the Vision, Language and Learning Group (VL2G) led by Dr.Anand Mishra at IIT Jodhpur. My research focuses on developing open-source retrieval augmented generation (RAG) enabled large multimodal models (LMMs), specifically for knowledge-intensive question- answering tasks over multimodal data. These tasks include variants of visual question answering (VQA) and audio question answering (AQA), which require external knowledge reasoning.

Avatar

News

[For more news, scroll down]

Research

PontTuset When Big Models Train Small Ones: Label-Free Model Parity Alignment for Efficient Visual Question Answering using Small VLMs
Abhirama Subramanyam Penamakuri*, Navlika Singh*, Piyush Arora*, Anand Mishra. (* : equal contribution)
EMNLP (Long Main), 2025  
project page (coming soon) / arXiv (coming soon) / code

We introduced the Model Parity Aligner (MPA), a label-free framework that improves Small Vision-Language Models (S-VLMs) by aligning them with larger models. Unlike traditional distillation, MPA relies only on unlabeled images and parity-based supervision to target knowledge gaps. Across four VQA benchmarks (TextVQA, ST-VQA, ChartQA, OKVQA), MPA consistently boosts S-VLM performance while retaining efficiency, and even leverages closed-source models like GPT-4o to help compact models rival or surpass much larger ones.

PontTuset Mind the (Language) Gap: Towards Probing Numerical and Cross-Lingual Limits of LVLMs
Somraj Gautam*, Abhirama Subramanyam Penamakuri*, Abhishek Bhandari, Gaurav Harit. (* : equal contribution)
arXiv, 2025  
arXiv / dataset

Introduced MMCricBench-3K, a VQA benchmark that evaluates structure-aware, cross-lingual, multi-image, and mathematical reasoning of large vision-language models on cricket scorecards. The dataset includes 3,000 QA pairs across 1,463 English and Hindi scorecards, spanning tasks from simple retrieval to complex numerical analysis.

PontTuset Audiopedia: Audio QA with Knowledge
Abhirama Subramanyam Penamakuri*, Kiran Chhatre*, Akshat Jain. (*: equal contribution)
ICASSP, 2025   (Oral Presentation)
project page / arXiv / data / short talk

Audiopedia is introduced (with 3 subtasks, s-AQA, m-AQA and r-AQA), a novel Audio QA task, requiring audio comprehension and external knowledge reasoning. Additionally, a framework that combines Audio Entity Linking (AEL) and a Knowledge-Augmented Audio Multimodal Model (KA2LM) is proposed to enhance large audio language models for knowledge-intensive tasks.

PontTuset Visual Text Matters: Improving Text-KVQA with Visual Text Entity Knowledge-aware Large Multimodal Assistant
Abhirama Subramanyam Penamakuri, Anand Mishra
EMNLP (Long Main) , 2024
project page / arXiv / code / poster / slides / short talk

Text-KVQA is revisited with advancements in large multimodal models, introducing VisTEL, a method for visual text entity linking that leverages visual and textual cues. Additionally, KaLMA, a knowledge-aware assistant, is proposed to enhance LMMs by incorporating knowledge related to the visual text entity for improved accuracy.

PontTuset Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering
Abhirama Subramanyam Penamakuri, Manish Gupta, Mithun Das Gupta, Anand Mishra
IJCAI (Main Track), 2023   (Oral Presentation)
project page / arXiv / code / data / slides / short talk

RetVQA is introduced as a more challenging extension of traditional VQA, where a model retrieves relevant images from a pool to answer questions. The proposed MI-BART model, along with the new RETVQA dataset, achieves significant improvements in both accuracy and fluency over existing methods.

PontTuset COFAR: Commonsense and Factual Reasoning in Image Search
Prajwal Gatti, Abhirama Subramanyam Penamakuri, Revant Teotia, Anand Mishra, Shubhashis Sengupta,
Roshni Ramnani,
AACL-IJCNLP, 2022
project page / arXiv / code / data / slides / short talk

The COFAR dataset is introduced to evaluate image search involving commonsense and factual reasoning. To address this, KRAMT is proposed, integrating visual entities with encyclopedic knowledge and natural language queries for more accurate image retrieval.

PontTuset Contrastive Multi-View Textual-Visual Encoding: Towards One Hundred Thousand-Scale One-Shot Logo Identification
Nakul Sharma, Abhirama Subramanyam Penamakuri, Anand Mishra
ICVGIP, 2022
project page / arXiv / paper / code / data

Business logo identification in natural scenes using an open-set one-shot framework with multi-view textual-visual encoding, outperforming state-of-the-art techniques. The Wikidata Reference Logo Dataset (WiRLD) of 100,000 brand logos is introduced to study one-shot identification at scale.

PontTuset System and method for intelligent recruitement management
Subramanian Viswanathan, Janakiraman Pradeep, Inbasekaran Bharath Kumar, Roy Subhadeep, Ragavan Shankarri, S Madhuvani, Abhirama Subramanyam Penamakuri, Sirisha Kona.
US Patent (Granted), 2021

The invention presents an intelligent recruitment management system that automates the recruitment process through a recruitment intelligence platform, utilizing modules for requisition parsing, resume analysis, candidate submissions, and job matching. This platform allows recruiters to track all steps of the recruitment process efficiently.


Template Source: John Barron