[Vidushee :)]
Vidushee Vats
Multimodality and Generative Modelling
Contact: {X @ Y}, X=vatsvidushee, Y=gmail.com
Biography
Hi, I am Vidushee. I am a Research Intern at Halmstad University, working on any-to-any multimodal generation and diffusion alignment, advised by Shengzhi Li (Meta Llama) and Dr. Prayag Tiwari. Previously, I worked as an AI Engineer at TUM & SAP, building AI agents for automated UI navigation and testing. I was a Computer Vision Research Affiliate at Georgia Institute of Technology, advised by Prof. Vijay Madisetti, where I worked on visual grounding in vision-language models. I graduated with a B.Tech in Computer Science (AI) from Bennett University. [...]
Name

Family name: () means horse.

Given name: (Píng)(chuān) means flat and level ground without geographical barrier.

Quoted from a Chinese idiom 一馬平川, my full name means the flat ground that one can ride straight across and thus implies enjoying a smooth life.

眾峰來自天目山,勢若駿平川

There are lots of mountains from Tianmu Mountain, which have the might of fine horses galloping across flat ground.

Experience
Halmstad University Logo

Halmstad University

Research Intern

Sep 2024 - Present

Working on any-to-any multimodal generation with diffusion alignment, advised by Shengzhi Li and Dr. Prayag Tiwari. Added VLM support in the transformers library for training pipelines. Modified NExT-GPT architecture to use FLUX instead of SDXL as image decoder.

TUM & SAP Logo

Technical University of Munich & SAP

AI Engineer

Mar 2025 - May 2025

Built AI agents for automated UI navigation and testing. Developed a framework to convert traditional scripts to AI-based robust UI navigation. Constructed a pipeline to heal failed testing scripts using AI.

Georgia Tech Logo

Georgia Institute of Technology

Computer Vision Research Affiliate

Aug 2024 - Dec 2024

Designed a plug-and-play Visual Grounding framework. Outperformed SOTA models and GPT-4V by mIoU of 0.11 & 0.48 on RefCOCOg. Extended the grounding framework to accept multiple modalities (image, video).

Zocket Logo

Zocket AI

Computer Vision Intern

Apr 2024 - Aug 2024

Built a pipeline to fetch visually semantic videos from 100K+ videos with sub-0.5s latency. Designed an AI module for cross-platform ad resizing. Fine-tuned inpainting models for text-based background generation.

Research
[ICLR 2025]
Karun Sharma, Vidushee Vats
Improving spatial reasoning in LLMs for better visual grounding. This work focuses on enhancing the spatial understanding capabilities of large language models to improve their performance on visual grounding tasks.
@inproceedings{sharma2025think,
  title={Think to Ground: Improving Spatial Reasoning in LLMs for better Visual Grounding},
  author={Sharma, Karun and Vats, Vidushee},
  booktitle={ICLR 2025 Workshop on Reasoning and Planning for Large Language Models},
  year={2025}
}
@INPROCEEDINGS{10651096,
  author={Sharma, Karun and Vats, Vidushee and Singh, Abhinendra and Sahani, Rahul and Rai, Deepak and Sharma, Ashok},
  booktitle={2024 International Joint Conference on Neural Networks (IJCNN)},
  title={LLaVA-PlantDiag: Integrating Large-scale Vision-Language Abilities for Conversational Plant Pathology Diagnosis},
  year={2024},
  pages={1-7},
  doi={10.1109/IJCNN60899.2024.10651096}
}
[ICETCI 2023]
Ujjwal Gupta, Roshan Golash, Vidushee Vats, and Karun Sharma
@INPROCEEDINGS{10330945,
  author={Gupta, Ujjwal and Golash, Roshan and Vats, Vidushee and Sharma, Karun},
  booktitle={2023 International Conference on Emerging Techniques in Computational Intelligence (ICETCI)},
  title={An Improved Hybrid Model for Target Detection},
  year={2023},
  pages={265-270},
  doi={10.1109/ICETCI58599.2023.10330945}
}