Learn how to use vision capabilities to understand images and videos with the REST API
Parameter | Type | Description |
---|---|---|
model | string | The vision model to use (e.g., “animuslabs/Qwen2-VL-NSFW-Vision-1.2”) |
messages | array | Array of message objects that include text and image content |
messages
array has the following structure:
Field | Type | Description |
---|---|---|
role | string | The role of the message’s author. Can be “system”, “user”, or “assistant” |
content | array | Array of content objects, which can be of type “text” or “image_url” |
Parameter | Type | Description |
---|---|---|
media_url | string | URL to the image or video to analyze |
metadata | array | Types of metadata to extract (categories, participants, actions, scene, tags) |
use_scenes | boolean | When true, enables scene detection for videos to select one frame per scene |