Skip to content

Maira dataset management

This tutorial will go through all the endpoints for Maira dataset management and how to use them. You will find all the endpoints in our sandbox, under "MyGPT Dataset" section.

Concepts

  • Datasets: A Dataset is a collection of "documents"(described below). For instance, single "dataset" may contain "documents" with information related to company, such as company address, contact detail, product description, policies, etc. Another dataset might contain employee information, customer information, etc. We will see later why you might want to have different datasets. Each dataset has a unique ID
  • Documents: Individual units of information in a "dataset", such as company addresses or product descriptions. These documents are used for training the application. Every response from Maira includes a list of the document(s) referenced to generate the response. Each "document" has a unique identifier called IDX. The IDXs are unique within the same Dataset.

Dataset Example

Here is a sample dataset with four documents:

idx Heading Content
1 History of Gigalogy 2020: Gigalogy was born in Tokyo. Jan, 2020: Gigalogy opened an office in Dhaka. 2024: Released Maira
2 Address of Gigalogy Tokyo Office The Gigalogy Inc. Head Office is located in: 2-chōme-12-11 Ōhashi, Meguro City, Tokyo 153-0044
3 Phone Number of Tokyo Office Phone: +813-4500-7914
4 Main products The best personalizer and Maira

This dataset has four documents. Notice the first column, idx, is a unique identifier column here for each document. Each document can have multiple keys. In this case, we have Heading and Content.


Next, we will go through the operations and endpoints related to managing datasets and documents for Maira.

View the datasets and documents

Note that you will be able to "view" your datasets, only after you have created them. Regarding creation and management of "datasets" and "documents", we have discussed them in the later sections.

View list of all dataset

To view list of all Datasets, use endpoint GET /v1/gpt/datasets. Here you will get list of all your datasets, with their metadata. Use the start parameter to specify the index number of the first result to show and the size parameter to specify the number of results to show.

View summary of a dataset

To view summary of a particular Dataset, use GET /v1/gpt/datasets/{dataset_id}. In the response, along with the metadata of the Dataset, you will find the summary of the documents it holds. Here is an example response below:

{
  "response": {
    "dataset": {
      "name": "SampleDataset",
      "idx_column_name": "id",
      "is_idx_fillup_if_empty": true,
      "secondary_idx_column": "secondary_id",
      "image_url_column": "image_url",
      "description": "This is a sample dataset used for demonstration purposes.",
      "dataset_id": "12345",
      "filterable_fields": ["field1", "field2", "field3"],
      "created_at": "2024-08-30T10:00:00.000000+09:00",
      "updated_at": "2024-08-30T10:05:00.000000+09:00",
      "documents_count": {
        "total": 1000,
        "active": 980,
        "text_trained": 960,
        "image_trained": 20
      }
    }
  }
}

Notice that in the documents_count object, it displays the total count of documents and their respective statuses. Let's take a closer look at the document_counts object:

  • total: This shows the sum total number of documents in the dataset
  • active: Number of active documents in the dataset. Active documents will be used for answer generation if trained and if not yet trained, will be trained in the next training cycle. Inactive(Archived) documents are not being used for answer generation.
  • text_trained: Number of trained datasets. There could be reasons why certain documents might not be trained in a dataset. For example, some documents are added later in the dataset, after which, the dataset was not trained. in this case, there can be untrained documents, which are also not being used for answer generation.
  • image_trained: Same as text_trained, this shows how many documents have gone through image training.

Note that text trained documents are used in answer generation through POST /v1/gpt/ask endpoint and Image trained are used for answer generation through POST /v1/gpt/ask/vision endpoint.

View all documents of a dataset

To view all documents of particular dataset, use endpoint GET /v1/gpt/datasets/{dataset_id}/documents Here is an example response:

{
  "response": {
    "total_hits": 521,
    "returned_hits": 3,
    "documents": [
      {
        "id": "123e4567-e89b-12d3-a456-426614174000",
        "data": {
          "idx": "doc001",
          "content": "Sample content for document 1.",
          "tokens": "50"
        },
        "active_status": "active",
        "status": "trained",
        "image_train_status": "trained",
        "created_at": "2024-01-01T12:00:00+09:00",
        "updated_at": "2024-01-02T12:00:00+09:00"
      },
      {
        "id": "234f5678-f89c-23d4-b567-526725274111",
        "data": {
          "idx": "doc002",
          "content": "Sample content for document 2.",
          "tokens": "40"
        },
        "active_status": "active",
        "status": "trained",
        "image_train_status": "trained",
        "created_at": "2024-01-02T13:00:00+09:00",
        "updated_at": "2024-01-03T13:00:00+09:00"
      },
      {
        "id": "345g6789-g90d-34e5-c678-627836375222",
        "data": {
          "idx": "doc003",
          "content": "Sample content for document 3.",
          "tokens": "45"
        },
        "active_status": "active",
        "status": "trained",
        "image_train_status": "trained",
        "created_at": "2024-01-03T14:00:00+09:00",
        "updated_at": "2024-01-04T14:00:00+09:00"
      }
    ]
  }
}

View a single document

To view a single document, from any dataset, use the endpoint GET /v1/gpt/datasets/{dataset_id}/documents/{document_id}

This endpoint takes the dataset ID and the document ID. Here is an example response below:

{
  "detail": {
    "dataset_id": "7f3036cb-59b5-4046-a23f-66f2427329df",
    "data": {
      "idx": "6dfdbb0c-1c8d-4ba5-b137-31d099908c84",
      "content": "This is the content of a text document, full of useful information",
      "tokens": "101.0"
    },
    "status": "trained",
    "image_train_status": null,
    "active_status": "active",
    "created_at": "2024-05-24T12:21:02.715715+09:00",
    "content": "idx:6dfdbb0c-1c8d-4ba5-b137-31d099908c84\n content:This is the content of a text document, full of useful information\n tokens:101.0"
  }
}

Datasets management

Create dataset

To create a new dataset, use the endpoint POST /v1/gpt/datasets. Here is an example request body:

{
  "name": "My Dataset",
  "idx_column": "id",
  "is_idx_fillup_if_empty": true,
  "secondary_idx_column": "slug",
  "image_url_column": "image_url",
  "description": "This is a dataset containing information about...",
  "tags": "tag1, tag2, tag3",
  "filterable_fields": "field1,field2,field3"
}

Let's go through the parameters

  • name: Give a name to your dataset, so that you can identify it later easily.
  • idx_column: This should specify the column name containing the data unit's(document) ID.
  • is_idx_fillup_if_empty: If your dataset contains data units without IDs and you want to automatically fill them with UUIDs, set is_idx_fillup_if_empty to true. By default, this parameter is set to false.
  • image_url_column: If your dataset includes images and you want to use the vision API with it, provide the image_url_column, which should specify the column name containing the data unit's image URL. In profile, when is_image_context_enabled is set to true, the images of the relevant documents will be sent to LLM as well, along with the provided image from the user
  • description: The description of the dataset.
  • tags: Comma separated list of tags of the dataset. For example, you could set internal tag for a dataset that is supposed to be used only in applications related to company's internal applications. A dataset for customer service may have a tag customer_service. A dataset can have multiple tags e.g. [customer_service, marketing].
  • filterable_fields: Comma separated list of fields that you will use to filter your context during response generations.

Once you have successfully created a dataset, you will get the dataset_id

Update or Delete a dataset

With PUT /v1/gpt/datasets/{dataset_id} endpoint, you can update or edit a dataset. Note, not only the metadata, but you can also update the data (Documents) of the dataset with this endpoint (via CSV/JSON file upload). Be cautious, that if new documents have the same IDX value as existing documents, this will replace the existing documents.

To add/remove documents from a dataset, there are other endpoints that we will discuss later in below section

To delete a dataset, use DELETE /v1/gpt/datasets/{dataset_id} with the dataset_id

Documents management

We have multiple endpoints that support document management within and across multiple datasets efficiently. At a high level, in this tutorial, the endpoints for document management are divided into two kinds:

  1. Create documents,
  2. Edit documents in bulk,
  3. Edit single documents.

Below we will go through each of those endpoints.

Create documents

With POST /v1/gpt/datasets/{dataset_id}/documents, you can create a document under a particular dataset. You have to pass the dataset_id to specify the dataset. Here is a sample request body:

{
  "documents": [
    {
      "key1": "value1"
    },
    {
      "key1": "value1",
      "key2": "value2"
    }
  ],
  "is_background_task": true
}

Under documents, pass the list of objects to be created. A random IDX for each document will be created and keys and values will be added to each document.

Edit documents in bulk

Bulk update documents under a single dataset

To Update documents in bulk within a single dataset use PUT /v1/gpt/datasets/{dataset_id}/documents. Here is an example request body

{
  "documents": [
    {
      "your_idx_column": "abc",
      "key1": "value1",
      "key2": "value2"
    },
    {
      "your_idx_column": "def",
      "key1": "value1",
      "key2": "value2"
    }
  ],
  "is_background_task": true
}

For example, if I have a dataset, where the IDX column is id, and I want to update documents 101 and 102, my request body will look like this

{
  "documents": [
    {
      "id": 101,
      "key1": "updated value",
      "key2": "updated "
    },
    {
      "id": 102,
      "key1": "value1",
      "key2": "value2"
    }
  ],
  "is_background_task": true
}

Here key1 and key2 could be existing or new keys for the dataset.

Bulk Update document metadata within a dataset

Using the endpoint PATCH /v1/gpt/datasets/{dataset_id}/documents, you can bulk update metadata of documents within a specific dataset. Here is an example request body:

{
  "ids": [
    "idx-1",
    "idx-2"
  ],
  "is_update_all": false,
  "status": "trained",
  "image_train_status": "trained",
  "active_status": "archived"
}

Here, is_update_all determines if all documents should be updated, default is false. If is_update_all parameter is set to true, the ids parameter will be ignored. to update metadata of specific documents, provide a list of document IDs to update their metadata.

You can specify what to update using below parameters:

  • status: Status of text training (possible values are trained, untrained).
  • image_train_status: Status of image training (possible values are trained, untrained).
  • active_status: Status of document (possible values are archived, active).

Delete multiple documents under a dataset

To delete multiple documents under a dataset, use the endpoint. DELETE/v1/gpt/datasets/{dataset_id}/documents

Example request body:

{
  "ids": [
    "string"
  ],
  "is_delete_all": false
}

Here you have to pass the dataset_id to specify the dataset. If is_delete_all is true, ids value is ignored. Be extremely careful here to not delete all datasets accidentally.

Edit single documents

Update a single document

To update a single document, use PUT /v1/gpt/datasets/{dataset_id}/documents/{document_id} You have to specify the dataset_id and document_id in the header. In the request body, pass the Key and value you want to update. Here is an example request body

{
  "data": {
    "key1": "value1",
    "key2": "value2"
  }
}

Update meta-data of a single document

A document has three metadata can you can update using PATCH /v1/gpt/datasets/{dataset_id}/documents/{document_id} and specifying dataset_id and document_id. Here is an example request body:

{
  "status": "trained",
  "image_train_status": "trained",
  "active_status": "archived"
}

Delete a single document

To delete a single document, we can use DELETE /v1/gpt/datasets/{dataset_id}/documents/{document_id} and specify dataset_id and document_id