Vidushee Vats

Multimodality and Generative Modelling

Contact: {X @ Y}, X=vatsvidushee, Y=gmail.com

Biography

Hi, I am Vidushee. I am a Research Intern at Halmstad University, working on any-to-any multimodal generation and diffusion alignment, advised by Shengzhi Li (Meta Llama) and Dr. Prayag Tiwari. Previously, I worked as an AI Engineer at TUM & SAP, building AI agents for automated UI navigation and testing. I was a Computer Vision Research Affiliate at Georgia Institute of Technology, advised by Prof. Vijay Madisetti, where I worked on visual grounding in vision-language models. I graduated with a B.Tech in Computer Science (AI) from Bennett University. [...]

Name

Family name: 馬(Mǎ) means horse.

Given name: 平(Píng)川(chuān) means flat and level ground without geographical barrier.

Quoted from a Chinese idiom 一馬平川, my full name means the flat ground that one can ride straight across and thus implies enjoying a smooth life.

眾峰來自天目山，勢若駿馬奔平川。

宋·蘇軾《東坡全集·卷三·遊徑山》

There are lots of mountains from Tianmu Mountain, which have the might of fine horses galloping across flat ground.

Su Shi of the Song dynasty in Visit Jing Mountain

Experience

Halmstad University

Research Intern

Sep 2024 - Present

Working on any-to-any multimodal generation with diffusion alignment, advised by Shengzhi Li and Dr. Prayag Tiwari. Added VLM support in the transformers library for training pipelines. Modified NExT-GPT architecture to use FLUX instead of SDXL as image decoder.

Technical University of Munich & SAP

AI Engineer

Mar 2025 - May 2025

Built AI agents for automated UI navigation and testing. Developed a framework to convert traditional scripts to AI-based robust UI navigation. Constructed a pipeline to heal failed testing scripts using AI.

Georgia Institute of Technology

Computer Vision Research Affiliate

Aug 2024 - Dec 2024

Designed a plug-and-play Visual Grounding framework. Outperformed SOTA models and GPT-4V by mIoU of 0.11 & 0.48 on RefCOCOg. Extended the grounding framework to accept multiple modalities (image, video).

Zocket AI

Computer Vision Intern

Apr 2024 - Aug 2024

Built a pipeline to fetch visually semantic videos from 100K+ videos with sub-0.5s latency. Designed an AI module for cross-platform ad resizing. Fine-tuned inpainting models for text-based background generation.

Research

Think to Ground: Improving Spatial Reasoning in LLMs for better Visual Grounding

Karun Sharma, Vidushee Vats

ICLR 2025: Workshop on Reasoning and Planning for LLMs / Paper / bibtex

Improving spatial reasoning in LLMs for better visual grounding. This work focuses on enhancing the spatial understanding capabilities of large language models to improve their performance on visual grounding tasks.

@inproceedings{sharma2025think,
  title={Think to Ground: Improving Spatial Reasoning in LLMs for better Visual Grounding},
  author={Sharma, Karun and Vats, Vidushee},
  booktitle={ICLR 2025 Workshop on Reasoning and Planning for Large Language Models},
  year={2025}
}

LLaVA-PlantDiag: Integrating Vision-Language Abilities for Conversational Plant Pathology Diagnosis

Karun Sharma, Vidushee Vats, Abhinendra Singh, Rahul Sahani, Deepak Rai, and Ashok Sharma

IJCNN 2024 / Paper / bibtex

@INPROCEEDINGS{10651096,
  author={Sharma, Karun and Vats, Vidushee and Singh, Abhinendra and Sahani, Rahul and Rai, Deepak and Sharma, Ashok},
  booktitle={2024 International Joint Conference on Neural Networks (IJCNN)},
  title={LLaVA-PlantDiag: Integrating Large-scale Vision-Language Abilities for Conversational Plant Pathology Diagnosis},
  year={2024},
  pages={1-7},
  doi={10.1109/IJCNN60899.2024.10651096}
}

An Improved Hybrid Model for Target Detection

Ujjwal Gupta, Roshan Golash, Vidushee Vats, and Karun Sharma

ICETCI 2023 / Paper / bibtex

@INPROCEEDINGS{10330945,
  author={Gupta, Ujjwal and Golash, Roshan and Vats, Vidushee and Sharma, Karun},
  booktitle={2023 International Conference on Emerging Techniques in Computational Intelligence (ICETCI)},
  title={An Improved Hybrid Model for Target Detection},
  year={2023},
  pages={265-270},
  doi={10.1109/ICETCI58599.2023.10330945}
}