Maira Evaluation
Overview
Maira can evaluate conversations and score them based on their accuracy.
There are two ways evaluations can run for conversations
- Evaluation can run automatically after each conversation
- Evaluation can run manually for specified conversations.
What is a conversation?
When we send a request with a query to POST /v1/gpt/ask
and get a response, this is considered as a Conversation. Each conversation has a conversation_id
, the query
, the response
along with many other information.
How to find a conversation_id
There are two ways to get a conversation_id
:
- When we hit the endpoint
POST/ v1/gpt/ask
orPOST/ v1/gpt/ask/vision
, in the response body, we can find theconversation_id
of that conversation. - We can hit the endpoint
/v1/gpt/conversations
, and in response, we will get list of all conversations of the specified time range and filters with their details, including the Conversation IDs. See the example at the bottom of this tutorial.
How evaluation works
We have seen in the Maira Integration
tutorial, that there are two types of conversation:
question
- Only answers the given query/question without considering previous context/conversationschat
- Answers the query considering the previous context/conversations
When when evaluating question
type conversations, evaluation score is provided based on the relevance with the query and the document used to generate the response.
When when evaluating chat
type conversations, evaluation score is provided based on the relevance with the query and the document used to generate the response, plus relevance with the the previous context/conversations.
Thus, score are provided against each document that is referred for the answer generation (See the example at the bottom)
Automatic Evaluation
To Run evaluation automatically for each conversation, we have to set "is_auto_evaluation": true
in the profile setting (Refer to the "Profiles" tutorial). Then when we use that profile's profile_id
to POST/v1/gpt/ask
(refer to Maira Integration tutorial), the conversation gets trained automatically.
Manual Evaluation
To Run Evaluation manually for a conversation, you can use the endpoint POST/v1/gpt/conversations/evaluations
Here is an example request body for this endpoint
{
"conversation_id": "string",
"start_date": "2032-01-01",
"end_date": "2032-12-31"
}
NOTE
- To evaluate a single conversation: Provide the
conversation_id
. - To evaluate multiple conversations within a date range: Provide both
start_date
andend_date
. - Either provide
conversation_id
OR bothstart_date
andend_date
. - NOTE THAT If both provided date range filtering will be given priority.
Example of a "Conversation" and "Evaluation Score"
Here is an example below, from the response of GET /v1/gpt/conversations/{conversation_id}
endpoint.
Note that under request_body
section, we can see the full "prompt" send to Maira, via POST /v1/gpt/ask
or POST /v1/gpt/ask/vision
, including the "Query" and the "Profile detail".
Next we see the response and and which "sections" were used as reference to generate this response.
Then in the references
sections, you can see each document with their detail, that was referenced for this response generation. Note that it has a parameter called similarity_score
, which is the result of the "Evaluation". This score indicates how similarity was found
{
"detail": {
"response": {
"process_time": 16.089167355006794,
"request_body": {
"user_id": "008da05175a82d8686e7a04fe2db7393",
"member_id": "232",
"query": "This is a sample query",
"conversation_type": "question",
"gpt_profile_id": "19473b2d-8d96-488d-9d37-6480d4fe5b81",
"context_preference": {
"preferred": {
"project_id": "12345",
"client_id": "22"
}
},
"conversation_metadata": {
"project_id": "12345",
"client_id": "22"
},
"result_includes": [],
"top_k": 20,
"keyword_ngram": 3,
"is_keyword_enabled": true,
"name": "Company related information",
"intro": "Please answer questions in the same language using the provided context as truthfully as possible. It is better to provide short or no answer than incorrect ones as it is an example conversation",
"system": "You are an expert in company ABC and its products.",
"model": "gpt-40",
"temperature": 0,
"top_p": 1,
"frequency_penalty": 0,
"presence_penalty": 0,
"stop": [
"AI:",
"Human:"
],
"response_format": null,
"search_max_token": 5000,
"completion_token": 2500,
"vision_settings": {
"resolution": "low",
"is_image_context_enabled": true
},
"chat_history_length": 3,
"is_personalizer_only": false,
"dataset_tags": null,
"is_auto_evaluation": false,
"intro_token": 44,
"system_token": 34
},
"request_url": "/v1/gpt/ask",
"response_status": 200,
"response": "This a sample response to a sample question",
"sections": [
"sample_idx_1",
"sample_idx_12",
"sample_idx_17",
"sample_idx_55"
],
"references": [
{
"section_id": "sample_idx_1",
"index": "idx",
"content": "Sample date that has a lot of information about the company",
"payload": {},
"similarity_score": "85.09",
"dataset_id": "91bf8ad4-e3c2-4988-8985-51499672557a"
},
{
"section_id": "sample_idx_12",
"index": "idx",
"content": "Technical specification details for the personalizer",
"payload": {},
"similarity_score": "80.45",
"dataset_id": "d9f8cabc-3b3d-4ef8-bd88-abcde34f7298"
},
{
"section_id": "sample_idx_17",
"index": "idx",
"content": "Detailed troubleshooting guide for common issues",
"payload": {},
"similarity_score": "75.67",
"dataset_id": "67cd90df-dcba-4bb6-9f12-a71b9cfa8291"
},
{
"section_id": "sample_idx_55",
"index": "idx",
"content": "Customer reviews and testimonials",
"payload": {},
"similarity_score": "72.35",
"dataset_id": "fc74efab-2171-4e67-b8e1-3b6d8e67f5f1"
}
],
"tokens": 5588,
"usage": {
"completion_tokens": 460,
"prompt_tokens": 5128,
"total_tokens": 5588,
"completion_tokens_details": {
"reasoning_tokens": 0
},
"query_tokens": 168
},
"conversation_id": "fdc20c1a-0761-425b-a2a0-b0621786e8a2",
"gpt_profile_id": "19473b2d-8d96-488d-9d37-6480d4fe5b81",
"conversation_context": [],
"created_at": "2024-09-22T13:16:36.274173+09:00"
}
}
}