A comprehensive multimodal AI application that leverages Qwen2.5-VL models for both image and video understanding tasks. This application provides an intuitive web interface for analyzing visual content using state-of-the-art vision-language models.
Note
Demo usage is available now; this link may or may not be available in the future : https://huggingface.co/spaces/prithivMLmods/Qwen2.5-VL
Video.Understanding.mp4
- Dual Model Support: Choose between Qwen2.5-VL-7B-Instruct and Qwen2.5-VL-3B-Instruct models
- Image Analysis: Upload and analyze images with natural language queries
- Video Understanding: Process videos with intelligent frame sampling and analysis
- Real-time Streaming: Get responses as they are generated with streaming output
- Advanced Configuration: Fine-tune generation parameters for optimal results
- Interactive Examples: Pre-loaded examples for both image and video inference
A powerful multimodal AI model developed by Alibaba Cloud that excels at understanding both text and images. This Vision-Language Model (VLM) is designed to handle various visual understanding tasks, including image understanding, video analysis, and multilingual support.
An instruction-tuned vision-language model from Alibaba Cloud, built upon the Qwen2-VL series. It excels at understanding and generating text related to both visual and textual inputs, making it capable of tasks like image captioning, visual question answering, object localization, long video understanding, and structured data extraction.
- Clone the repository:
git clone https://github.com/PRITHIVSAKTHIUR/Qwen2.5-VL-Video-Understanding.git
cd Qwen2.5-VL-Video-Understanding
- Install the required dependencies:
pip install -r requirements.txt
- Python 3.10+
- CUDA-compatible GPU (recommended for optimal performance)
- At least 41GB RAM
- Internet connection for model downloads
- Start the application:
python app.py
-
Open your web browser and navigate to the provided local URL (typically
http://localhost:7860
) -
Choose between Image Inference or Video Inference tabs
- Enter your query in the text box
- Upload an image file
- Select your preferred model
- Adjust advanced parameters if needed
- Click Submit to get analysis
- Enter your query describing what you want to analyze
- Upload a video file
- The system will automatically extract 10 evenly spaced frames
- Select your preferred model
- Click Submit for comprehensive video understanding
- Max New Tokens: Control the length of generated responses (1-2048)
- Temperature: Adjust creativity vs consistency (0.1-4.0)
- Top-p: Nucleus sampling parameter (0.05-1.0)
- Top-k: Top-k sampling parameter (1-1000)
- Repetition Penalty: Reduce repetitive outputs (1.0-2.0)
The application uses intelligent video downsampling to extract 10 representative frames from uploaded videos. Each frame is processed with its timestamp to provide temporal context for analysis.
The application is optimized for GPU acceleration using CUDA when available. It automatically falls back to CPU processing if no GPU is detected.
Models are loaded with float16 precision to optimize memory usage while maintaining performance quality.
├── app.py # Main application file
├── requirements.txt # Python dependencies
├── README.md # Project documentation
├── images/ # Example images directory
│ ├── 1.jpg
│ └── 2.jpg
└── videos/ # Example videos directory
├── 1.mp4
├── 2.mp4
└── 3.mp4
MAX_INPUT_TOKEN_LENGTH
: Maximum input token length (default: 4096)
- JPEG, PNG, BMP, TIFF
- Recommended resolution: Up to 2048x2048 pixels
- MP4, AVI, MOV, MKV
- Recommended duration: Up to 60 seconds for optimal processing
- Automatic frame extraction at 10 evenly spaced intervals
- Document analysis and data extraction
- Chart and graph interpretation
- Object detection and recognition
- Scene understanding
- OCR and text extraction
- Activity recognition
- Scene transition analysis
- Object tracking
- Content summarization
- Advertisement analysis
- 7B model provides more detailed and accurate responses but requires more computational resources
- 3B model offers faster processing with good quality results
- GPU acceleration significantly improves response times
- Video processing time scales with video duration and complexity
- CUDA Out of Memory: Reduce max_new_tokens or use the 3B model
- Slow Processing: Ensure GPU acceleration is available
- Model Loading Errors: Check internet connection for initial model downloads
- Video Format Issues: Convert videos to MP4 format if experiencing problems
- Minimum 28GB GPU memory for 7B model
- Minimum 16GB GPU memory for 3B model
- 48GB system RAM recommended for optimal performance
Contributions are welcome! Please feel free to submit pull requests or open issues for bugs and feature requests.
This project is licensed under the MIT License. See the LICENSE file for details.
- Alibaba Cloud for developing the Qwen2.5-VL models
- Hugging Face for the transformers library
- Gradio for the web interface framework
For questions and support, please open an issue on the GitHub repository or refer to the official Qwen documentation.
If you use this application in your research, please cite the original Qwen2.5-VL papers and this repository.