Maira Evaluation

Overview

Maira can evaluate conversations and score them based on their accuracy.

There are two ways evaluations can run for conversations

Evaluation can run automatically after each conversation
Evaluation can run manually for specified conversations.

What is a conversation?

When we send a request with a query to POST /v1/maira/ask and get a response, this is considered as a Conversation. Each conversation has a conversation_id, the query, the response along with many other information.

How to find a conversation_id

There are two ways to get a conversation_id:

When we hit the endpoint POST/ v1/gpt/ask or POST/ v1/gpt/ask/vision, in the response body, we can find the conversation_id of that conversation.
We can hit the endpoint /v1/gpt/conversations, and in response, we will get list of all conversations of the specified time range and filters with their details, including the Conversation IDs. See the example at the bottom of this tutorial.

How evaluation works

We have seen in the Maira Integration tutorial, that there are two types of conversation:

question - Only answers the given query/question without considering previous context/conversations
chat - Answers the query considering the previous context/conversations

When when evaluating question type conversations, evaluation score is provided based on the relevance with the response and the document used to generate the response.

When when evaluating chat type conversations, evaluation score is provided based on the relevance with the response and the document used to generate the response, plus relevance with the the previous context/conversations.

Thus, score are provided against each document that is referred for the answer generation (See the example at the bottom)

Automatic Evaluation

To Run evaluation automatically for each conversation, we have to set "is_auto_evaluation": true in the profile setting (Refer to the "Profiles" tutorial). Then when we use that profile's profile_id to POST/v1/gpt/ask (refer to Maira Integration tutorial), the conversation gets evaluated automatically.

Manual Evaluation

To Run Evaluation manually for a conversation, you can use the endpoint POST/v1/gpt/conversations/evaluations

Here is an example request body for this endpoint

{
  "conversation_id": "string",
  "start_date": "2032-01-01",
  "end_date": "2032-12-31"
}

NOTE

To evaluate a single conversation: Provide the conversation_id.
To evaluate multiple conversations within a date range: Provide both start_date and end_date.
Either provide conversation_id OR both start_date and end_date.
NOTE THAT If both provided date range filtering will be given priority.

Example of a "Conversation" and "Evaluation Score"

Here is an example below, from the response of GET /v1/gpt/conversations/{conversation_id} endpoint.

Note that under request_body section, we can see the full "prompt" send to Maira, via POST /v1/maira/ask or POST /v1/maira/vision, including the "Query" and the "Profile detail".

Next we see the response and and which "sections" were used as reference to generate this response.

Then in the references sections, you can see each document with their detail, that was referenced for this response generation. Note that it has a parameter called similarity_score, which indicates the similarity percentage between User’s Query and the Documents used for Response generation.

{
  "detail": {
    "response": {
      "process_time": 16.089167355006794,
      "request_body": {
            "user_id": "008da05175a82d8686e7a04fe2db7393",
            "member_id": "232",
            "query": "This is a sample query",
            "conversation_type": "question",
            "gpt_profile_id": "19473b2d-8d96-488d-9d37-6480d4fe5b81",
            "context_preference": {
            "preferred": {
                "project_id": "12345",
                "client_id": "22"
            }
            },
            "conversation_metadata": {
            "project_id": "12345",
            "client_id": "22"
            },
            "result_includes": [],
            "top_k": 20,
            "keyword_ngram": 3,
            "is_keyword_enabled": true,
            "name": "Company related information",
            "intro": "Please answer questions in the same language using the provided context as truthfully as possible. It is better to provide short or no answer than incorrect ones as it is an example conversation",
            "system": "You are an expert in company ABC and its products.",
            "model": "gpt-40",
            "temperature": 0,
            "top_p": 1,
            "frequency_penalty": 0,
            "presence_penalty": 0,
            "stop": [
            "AI:",
            "Human:"
            ],
            "response_format": null,
            "search_max_token": 5000,
            "completion_token": 2500,
            "vision_settings": {
            "resolution": "low",
            "is_image_context_enabled": true
            },
            "chat_history_length": 3,
            "is_personalizer_only": false,
            "dataset_tags": null,
            "is_auto_evaluation": false,
            "intro_token": 44,
            "system_token": 34
      },
      "request_url": "/v1/gpt/ask",
      "response_status": 200,
      "response": "This a sample response to a sample question",
      "sections": [
        "sample_idx_1",
        "sample_idx_12",
        "sample_idx_17",
        "sample_idx_55"
      ],
      "references": [
        {
          "section_id": "sample_idx_1",
          "index": "idx",
          "content": "Sample date that has a lot of information about the company",
          "payload": {},
          "similarity_score": "85.09",
          "dataset_id": "91bf8ad4-e3c2-4988-8985-51499672557a"
        },
        {
          "section_id": "sample_idx_12",
          "index": "idx",
          "content": "Technical specification details for the personalizer",
          "payload": {},
          "similarity_score": "80.45",
          "dataset_id": "d9f8cabc-3b3d-4ef8-bd88-abcde34f7298"
        },
        {
          "section_id": "sample_idx_17",
          "index": "idx",
          "content": "Detailed troubleshooting guide for common issues",
          "payload": {},
          "similarity_score": "75.67",
          "dataset_id": "67cd90df-dcba-4bb6-9f12-a71b9cfa8291"
        },
        {
          "section_id": "sample_idx_55",
          "index": "idx",
          "content": "Customer reviews and testimonials",
          "payload": {},
          "similarity_score": "72.35",
          "dataset_id": "fc74efab-2171-4e67-b8e1-3b6d8e67f5f1"
        }
      ],
      "tokens": 5588,
      "usage": {
        "completion_tokens": 460,
        "prompt_tokens": 5128,
        "total_tokens": 5588,
        "completion_tokens_details": {
          "reasoning_tokens": 0
        },
        "query_tokens": 168
      },
      "conversation_id": "fdc20c1a-0761-425b-a2a0-b0621786e8a2",
      "gpt_profile_id": "19473b2d-8d96-488d-9d37-6480d4fe5b81",
      "conversation_context": [],
      "created_at": "2024-09-22T13:16:36.274173+09:00"
    }
  }
}