Several Animus models have vision capabilities, meaning the models can take images or videos as input and answer questions about them. This page explains how to use these capabilities in your applications.

Quickstart

Images are made available to the model in two main ways: by passing a link to the image or by passing the Base64 encoded image directly in the request. Images can be passed in the user messages.

// Using fetch to make a direct API call
async function analyzeImage() {
  const response = await fetch('https://api.animusai.co/v2/media/completions', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${process.env.ANIMUS_API_KEY}`
    },
    body: JSON.stringify({
      model: "animuslabs/Qwen2-VL-NSFW-Vision-1.2",
      messages: [
        {
          role: "user",
          content: [
            { type: "text", text: "What's in this image?" },
            {
              type: "image_url",
              image_url: {
                url: "https://example.com/image.jpg"
              }
            }
          ]
        }
      ]
    })
  });
  
  const data = await response.json();
  console.log(data.choices[0].message.content);
}

analyzeImage();

Our models are best at answering general questions about what is present in the images. You can ask it what color objects are, what’s in your fridge, or to describe a scene in detail.

Uploading Base64 encoded images

If you have an image or set of images locally, you can pass them to the model in Base64 encoded format:

// Function to encode the image to base64
async function getBase64(imagePath) {
  // In a Node.js environment
  const fs = require('fs');
  return fs.readFileSync(imagePath, { encoding: 'base64' });
}

async function analyzeBase64Image() {
  // Get base64 encoded image
  const base64Image = await getBase64('path/to/your/image.jpg');
  
  const response = await fetch('https://api.animusai.co/v2/media/completions', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${process.env.ANIMUS_API_KEY}`
    },
    body: JSON.stringify({
      model: "animuslabs/Qwen2-VL-NSFW-Vision-1.2",
      messages: [
        {
          role: "user",
          content: [
            { type: "text", text: "What is in this image?" },
            {
              type: "image_url",
              image_url: {
                url: `data:image/jpeg;base64,${base64Image}`
              }
            }
          ]
        }
      ]
    })
  });
  
  const data = await response.json();
  console.log(data.choices[0].message.content);
}

analyzeBase64Image();

Video Processing

Our vision models can also process videos. When a video is submitted, the API will extract frames at specific intervals and generate captions for each frame. This allows for detailed understanding of the video content.

How Video Frame Extraction Works

When processing videos, our system follows these rules:

  1. The system analyzes the total length of the video
  2. It determines at what timestamps to take frames, with a maximum of 60 frames per video
  3. Each frame caption requires approximately 200-300 tokens
  4. The number of frames captured is optimized to stay within the maximum token limit of the language model

This approach allows for comprehensive video analysis while maintaining efficient token usage.

Processing Long Videos with Webhooks

For long videos or batch processing tasks, we recommend using webhooks to receive notifications when processing is complete. This asynchronous approach is more efficient for:

  • Videos exceeding several minutes in length
  • Batch processing of multiple videos
  • Processing high-resolution video content
  • Workflows where you don’t need immediate results

Webhooks allow your application to receive a notification when video processing is complete, rather than having your application wait for a response. This is particularly useful for longer videos that may take significant time to process.

To learn more about setting up and using webhooks for video processing, see our Webhooks guide.

Categories, Tags, and Metadata

In addition to basic descriptions, our vision models can also return structured metadata about images and videos:

// Example request for analyzing media with categories and tags
async function analyzeMediaWithMetadata() {
  const response = await fetch('https://api.animusai.co/v2/media/categories', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${process.env.ANIMUS_API_KEY}`
    },
    body: JSON.stringify({
      media_url: "https://example.com/video.mp4",
      metadata: ["categories", "participants", "actions", "scene", "tags"],
      use_scenes: true // Enable scene detection for videos
    })
  });
  
  const data = await response.json();
  console.log(data);
  
  // For videos, you can check the processing status with the job_id
  if (data.job_id) {
    checkVideoProcessingStatus(data.job_id);
  }
}

// Check the status of a video processing job
async function checkVideoProcessingStatus(jobId) {
  const response = await fetch(`https://api.animusai.co/v2/media/categories/${jobId}`, {
    method: 'GET',
    headers: {
      'Authorization': `Bearer ${process.env.ANIMUS_API_KEY}`
    }
  });
  
  const statusData = await response.json();
  console.log(`Status: ${statusData.status}, Completion: ${statusData.percent_complete}%`);
  
  if (statusData.status === "COMPLETED") {
    console.log("Video processing results:", statusData.results);
    // Results contain timestamp-based categories and other metadata
  }
}

The metadata types you can request include:

  • categories: General content categories
  • participants: People or objects identified in the media
  • actions: Activities or motions detected
  • scene: Description of the setting or environment
  • tags: Relevant keywords associated with the content

API Reference

Endpoints

For image and simple video analysis:

POST https://api.animusai.co/v2/media/completions

For detailed metadata extraction:

POST https://api.animusai.co/v2/media/categories

For checking video processing status:

GET https://api.animusai.co/v2/media/categories/{job_id}

Request Parameters for media/completions

ParameterTypeDescription
modelstringThe vision model to use (e.g., “animuslabs/Qwen2-VL-NSFW-Vision-1.2”)
messagesarrayArray of message objects that include text and image content

Each message in the messages array has the following structure:

FieldTypeDescription
rolestringThe role of the message’s author. Can be “system”, “user”, or “assistant”
contentarrayArray of content objects, which can be of type “text” or “image_url”

Content objects can be:

  1. Text objects:
{
  "type": "text",
  "text": "Your text here"
}
  1. Image URL objects:
{
  "type": "image_url",
  "image_url": {
    "url": "https://example.com/image.jpg" // or data:image/jpeg;base64,BASE64_ENCODED_IMAGE
  }
}

Request Parameters for media/categories

ParameterTypeDescription
media_urlstringURL to the image or video to analyze
metadataarrayTypes of metadata to extract (categories, participants, actions, scene, tags)
use_scenesbooleanWhen true, enables scene detection for videos to select one frame per scene

Best practices for vision requests

To get the most accurate and useful responses when using our vision capabilities:

  1. Ask clear questions: Be specific about what you want to know about the image
  2. Use high-quality images: Clearer images lead to better analysis
  3. Optimize image size: Resize large images to reduce token usage while maintaining sufficient detail
  4. Combine with text: Provide context along with the image for more relevant responses
  5. Experiment with prompting: Try different question formats to get the information you need

Use cases

Our vision capabilities can be used for a variety of applications:

  • Content description: Generate detailed descriptions of images
  • Visual question answering: Answer specific questions about image contents
  • Product identification: Identify products or objects in images
  • Scene understanding: Analyze and describe complex scenes
  • Educational tools: Create interactive learning experiences with visual content

Limitations

While our vision models are powerful, they have some limitations:

  • They may struggle with small text within images
  • Complex spatial relationships can be challenging
  • Very domain-specific visual tasks may require specialized models
  • The models don’t have perfect visual recognition for all objects

Next Steps