Image Vision

When to Use Multimodal Capabilities?

The latest LLM models from OpenAI (GPTo and GPT-4 Turbo), Gemini 1.5 (Flash, Pro), and the Claude 3.x family from Anthropic are multimodal. This means that in addition to text, they can also "see" images, "read" PDFs, "listen" to audio files, and "view" videos. These documents are used as inputs, similar to text, for generating their responses.

Currently, all these models can "see" images. In beta, the Gemini Vertex AI family can also "read" PDFs, "listen" to audio files, and "view" videos. It is likely that in the coming months, all these models, and possibly others, will become multimodal across all document types.

For now, we recommend primarily using image vision, which works very well. To do this, simply upload one or more images in a message and ask a "Native" type Mate using one of the aforementioned models to describe the image, explain its content, meaning, or extract textual information from it. This method of injecting knowledge into Mates is very useful and further amplifies their collaborative and task-performing capabilities.

Steps to Follow

  1. Upload Images:
    • Click on the "attachment" button in the message to upload images.
    • The images will appear as attachments to the message. A preview of the image will be generated once it is sent. You can click on it to view the image in full size.
  2. Engage with a Mate:
    • After uploading the images, send a message directing a Mate to analyze the image(s). For example:
      • "Please describe the content of this image."
      • "Can you explain the meaning of this picture?"
      • "Extract the textual information from this image."
  3. Interpret Results:
    • The Mate will use its vision capabilities to analyze the images and provide detailed descriptions, explanations, or extracted information.
    • This enhances the Mate's ability to collaborate and perform tasks by using visual inputs as part of its knowledge base.

Importance of Using Multimodal Capabilities

Leveraging the vision capabilities of multimodal LLM models is extremely useful because it allows Mates to gain a deeper understanding of visual content, which can be crucial in collaborative tasks. By incorporating images, PDFs, audio, and video into the conversation, you provide richer inputs for Mates, enhancing their ability to generate accurate and contextually relevant responses. This not only improves task performance but also fosters more dynamic and effective collaboration.


Was this article helpful?