Vision
Learn how to use vision capabilities to understand images and videos with the REST API
Several Animus models have vision capabilities, meaning the models can take images or videos as input and answer questions about them. This page explains how to use these capabilities in your applications.
Quickstart
Images are made available to the model in two main ways: by passing a link to the image or by passing the Base64 encoded image directly in the request. Images can be passed in the user messages.
Our models are best at answering general questions about what is present in the images. You can ask it what color objects are, what’s in your fridge, or to describe a scene in detail.
Uploading Base64 encoded images
If you have an image or set of images locally, you can pass them to the model in Base64 encoded format:
Video Processing
Our vision models can also process videos. When a video is submitted, the API will extract frames at specific intervals and generate captions for each frame. This allows for detailed understanding of the video content.
How Video Frame Extraction Works
When processing videos, our system follows these rules:
- The system analyzes the total length of the video
- It determines at what timestamps to take frames, with a maximum of 60 frames per video
- Each frame caption requires approximately 200-300 tokens
- The number of frames captured is optimized to stay within the maximum token limit of the language model
This approach allows for comprehensive video analysis while maintaining efficient token usage.
Processing Long Videos with Webhooks
For long videos or batch processing tasks, we recommend using webhooks to receive notifications when processing is complete. This asynchronous approach is more efficient for:
- Videos exceeding several minutes in length
- Batch processing of multiple videos
- Processing high-resolution video content
- Workflows where you don’t need immediate results
Webhooks allow your application to receive a notification when video processing is complete, rather than having your application wait for a response. This is particularly useful for longer videos that may take significant time to process.
To learn more about setting up and using webhooks for video processing, see our Webhooks guide.
Categories, Tags, and Metadata
In addition to basic descriptions, our vision models can also return structured metadata about images and videos:
The metadata types you can request include:
- categories: General content categories
- participants: People or objects identified in the media
- actions: Activities or motions detected
- scene: Description of the setting or environment
- tags: Relevant keywords associated with the content
API Reference
Endpoints
For image and simple video analysis:
For detailed metadata extraction:
For checking video processing status:
Request Parameters for media/completions
Parameter | Type | Description |
---|---|---|
model | string | The vision model to use (e.g., “animuslabs/Qwen2-VL-NSFW-Vision-1.2”) |
messages | array | Array of message objects that include text and image content |
Each message in the messages
array has the following structure:
Field | Type | Description |
---|---|---|
role | string | The role of the message’s author. Can be “system”, “user”, or “assistant” |
content | array | Array of content objects, which can be of type “text” or “image_url” |
Content objects can be:
- Text objects:
- Image URL objects:
Request Parameters for media/categories
Parameter | Type | Description |
---|---|---|
media_url | string | URL to the image or video to analyze |
metadata | array | Types of metadata to extract (categories, participants, actions, scene, tags) |
use_scenes | boolean | When true, enables scene detection for videos to select one frame per scene |
Best practices for vision requests
To get the most accurate and useful responses when using our vision capabilities:
- Ask clear questions: Be specific about what you want to know about the image
- Use high-quality images: Clearer images lead to better analysis
- Optimize image size: Resize large images to reduce token usage while maintaining sufficient detail
- Combine with text: Provide context along with the image for more relevant responses
- Experiment with prompting: Try different question formats to get the information you need
Use cases
Our vision capabilities can be used for a variety of applications:
- Content description: Generate detailed descriptions of images
- Visual question answering: Answer specific questions about image contents
- Product identification: Identify products or objects in images
- Scene understanding: Analyze and describe complex scenes
- Educational tools: Create interactive learning experiences with visual content
Limitations
While our vision models are powerful, they have some limitations:
- They may struggle with small text within images
- Complex spatial relationships can be challenging
- Very domain-specific visual tasks may require specialized models
- The models don’t have perfect visual recognition for all objects