Vision

Several Animus models have vision capabilities, meaning the models can take images or videos as input and answer questions about them. This page explains how to use these capabilities in your applications.

Quickstart

Images are made available to the model in two main ways: by passing a link to the image or by passing the Base64 encoded image directly in the request. Images can be passed in the user messages.

// Using fetch to make a direct API call
async function analyzeImage() {
  const response = await fetch('https://api.animusai.co/v2/media/completions', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${process.env.ANIMUS_API_KEY}`
    },
    body: JSON.stringify({
      model: "animuslabs/Qwen2-VL-NSFW-Vision-1.2",
      messages: [
        {
          role: "user",
          content: [
            { type: "text", text: "What's in this image?" },
            {
              type: "image_url",
              image_url: {
                url: "https://example.com/image.jpg"
              }
            }
          ]
        }
      ]
    })
  });
  
  const data = await response.json();
  console.log(data.choices[0].message.content);
}

analyzeImage();

Our models are best at answering general questions about what is present in the images. You can ask it what color objects are, what’s in your fridge, or to describe a scene in detail.

Uploading Base64 encoded images

If you have an image or set of images locally, you can pass them to the model in Base64 encoded format:

// Function to encode the image to base64
async function getBase64(imagePath) {
  // In a Node.js environment
  const fs = require('fs');
  return fs.readFileSync(imagePath, { encoding: 'base64' });
}

async function analyzeBase64Image() {
  // Get base64 encoded image
  const base64Image = await getBase64('path/to/your/image.jpg');
  
  const response = await fetch('https://api.animusai.co/v2/media/completions', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${process.env.ANIMUS_API_KEY}`
    },
    body: JSON.stringify({
      model: "animuslabs/Qwen2-VL-NSFW-Vision-1.2",
      messages: [
        {
          role: "user",
          content: [
            { type: "text", text: "What is in this image?" },
            {
              type: "image_url",
              image_url: {
                url: `data:image/jpeg;base64,${base64Image}`
              }
            }
          ]
        }
      ]
    })
  });
  
  const data = await response.json();
  console.log(data.choices[0].message.content);
}

analyzeBase64Image();

Video Processing

Our vision models can also process videos. When a video is submitted, the API will extract frames at specific intervals and generate captions for each frame. This allows for detailed understanding of the video content.

How Video Frame Extraction Works

When processing videos, our system follows these rules:

The system analyzes the total length of the video
It determines at what timestamps to take frames, with a maximum of 60 frames per video
Each frame caption requires approximately 200-300 tokens
The number of frames captured is optimized to stay within the maximum token limit of the language model

This approach allows for comprehensive video analysis while maintaining efficient token usage.

Processing Long Videos with Webhooks

For long videos or batch processing tasks, we recommend using webhooks to receive notifications when processing is complete. This asynchronous approach is more efficient for:

Videos exceeding several minutes in length
Batch processing of multiple videos
Processing high-resolution video content
Workflows where you don’t need immediate results

Webhooks allow your application to receive a notification when video processing is complete, rather than having your application wait for a response. This is particularly useful for longer videos that may take significant time to process. To learn more about setting up and using webhooks for video processing, see our Webhooks guide.

Categories, Tags, and Metadata

In addition to basic descriptions, our vision models can also return structured metadata about images and videos:

// Example request for analyzing media with categories and tags
async function analyzeMediaWithMetadata() {
  const response = await fetch('https://api.animusai.co/v2/media/categories', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${process.env.ANIMUS_API_KEY}`
    },
    body: JSON.stringify({
      media_url: "https://example.com/video.mp4",
      metadata: ["categories", "participants", "actions", "scene", "tags"],
      use_scenes: true // Enable scene detection for videos
    })
  });
  
  const data = await response.json();
  console.log(data);
  
  // For videos, you can check the processing status with the job_id
  if (data.job_id) {
    checkVideoProcessingStatus(data.job_id);
  }
}

// Check the status of a video processing job
async function checkVideoProcessingStatus(jobId) {
  const response = await fetch(`https://api.animusai.co/v2/media/categories/${jobId}`, {
    method: 'GET',
    headers: {
      'Authorization': `Bearer ${process.env.ANIMUS_API_KEY}`
    }
  });
  
  const statusData = await response.json();
  console.log(`Status: ${statusData.status}, Completion: ${statusData.percent_complete}%`);
  
  if (statusData.status === "COMPLETED") {
    console.log("Video processing results:", statusData.results);
    // Results contain timestamp-based categories and other metadata
  }
}

The metadata types you can request include:

categories: General content categories
participants: People or objects identified in the media
actions: Activities or motions detected
scene: Description of the setting or environment
tags: Relevant keywords associated with the content

API Reference

Endpoints

For image and simple video analysis:

POST https://api.animusai.co/v2/media/completions

For detailed metadata extraction:

POST https://api.animusai.co/v2/media/categories

For checking video processing status:

GET https://api.animusai.co/v2/media/categories/{job_id}

Request Parameters for media/completions

Parameter	Type	Description
model	string	The vision model to use (e.g., “animuslabs/Qwen2-VL-NSFW-Vision-1.2”)
messages	array	Array of message objects that include text and image content

Each message in the messages array has the following structure:

Field	Type	Description
role	string	The role of the message’s author. Can be “system”, “user”, or “assistant”
content	array	Array of content objects, which can be of type “text” or “image_url”

Content objects can be:

Text objects:

{
  "type": "text",
  "text": "Your text here"
}

Image URL objects:

{
  "type": "image_url",
  "image_url": {
    "url": "https://example.com/image.jpg" // or data:image/jpeg;base64,BASE64_ENCODED_IMAGE
  }
}

Request Parameters for media/categories

Parameter	Type	Description
media_url	string	URL to the image or video to analyze
metadata	array	Types of metadata to extract (categories, participants, actions, scene, tags)
use_scenes	boolean	When true, enables scene detection for videos to select one frame per scene

Best practices for vision requests

To get the most accurate and useful responses when using our vision capabilities:

Ask clear questions: Be specific about what you want to know about the image
Use high-quality images: Clearer images lead to better analysis
Optimize image size: Resize large images to reduce token usage while maintaining sufficient detail
Combine with text: Provide context along with the image for more relevant responses
Experiment with prompting: Try different question formats to get the information you need

Use cases

Our vision capabilities can be used for a variety of applications:

Content description: Generate detailed descriptions of images
Visual question answering: Answer specific questions about image contents
Product identification: Identify products or objects in images
Scene understanding: Analyze and describe complex scenes
Educational tools: Create interactive learning experiences with visual content

Limitations

While our vision models are powerful, they have some limitations:

They may struggle with small text within images
Complex spatial relationships can be challenging
Very domain-specific visual tasks may require specialized models
The models don’t have perfect visual recognition for all objects

Next Steps

Text Generation

Learn how to generate text with the REST API

Moderation

Implement content moderation in your applications

Webhooks

Set up webhooks for video processing notifications

API Reference

Complete API documentation and reference

Get Started

SDK Features

Advanced SDK

REST API Integration

Models

Quickstart

Uploading Base64 encoded images

Video Processing

How Video Frame Extraction Works

Processing Long Videos with Webhooks

Categories, Tags, and Metadata

API Reference

Endpoints

Request Parameters for media/completions

Request Parameters for media/categories

Best practices for vision requests

Use cases

Limitations

Next Steps

Text Generation

Moderation

Webhooks

API Reference

Get Started

SDK Features

Advanced SDK

REST API Integration

Models

​Quickstart

​Uploading Base64 encoded images

​Video Processing

​How Video Frame Extraction Works

​Processing Long Videos with Webhooks

​Categories, Tags, and Metadata

​API Reference

​Endpoints

​Request Parameters for media/completions

​Request Parameters for media/categories

​Best practices for vision requests

​Use cases

​Limitations

​Next Steps

Text Generation

Moderation

Webhooks

API Reference

Quickstart

Uploading Base64 encoded images

Video Processing

How Video Frame Extraction Works

Processing Long Videos with Webhooks

Categories, Tags, and Metadata

API Reference

Endpoints

Request Parameters for media/completions

Request Parameters for media/categories

Best practices for vision requests

Use cases

Limitations

Next Steps