ChatGPT API Moderation model

ChatGPT

Discover in this article what is the ChatGPT API Moderation model, and what are the 7 categories used in it and how to call and interpret them.

ChatGPT API Moderation model

OpenAI API provides the possibility to classify any text to ensure it complies with their usage policies, using a binary classification. This classification is integrated in their Moderation model that one can call using openai API in Python.

7 categories are used in the OpenAI model: Hate, Hate/Threatening, Self-harm, Sexual, Sexual/minors, Violence, Violence/graphic.

One can use them to filter any inappropriate content (comments in a website, inputs from clients in chatbot requests…).

*Source: OpenAI documentation – 7 categories in Moderation Model*

If you don’t know how to use ChatGPT openai API, have a look at this article:

ChatGPT API Using Python

OpenAI API Moderation method

The method to call to use the moderation classification is: openai.Moderation.create

import openai
openai.api_key = "YOUR_API_KEY"

reponse_moderate = openai.Moderation.create(
    input="Text you want to classify"
)

reponse_moderate

The answer is a JSON object:

<OpenAIObject id=modr-73iFrUcKELxPUtZr9FXVgeb78AcCX at 0x274b2e5b6a8> JSON: {
  "id": "modr-73iFrUcKELxPUtZr9FXVgeb78AcCX",
  "model": "text-moderation-004",
  "results": [
    {
      "categories": {
        "hate": false,
        "hate/threatening": false,
        "self-harm": false,
        "sexual": false,
        "sexual/minors": false,
        "violence": false,
        "violence/graphic": false
      },
      "category_scores": {
        "hate": 1.006538423098391e-05,
        "hate/threatening": 2.5508761769543753e-09,
        "self-harm": 2.4448423840972566e-10,
        "sexual": 5.1720278861466795e-05,
        "sexual/minors": 1.4936508705432061e-06,
        "violence": 7.360013682955469e-07,
        "violence/graphic": 1.3201047011079936e-07
      },
      "flagged": false
    }
  ]
}

In the JSON object, you have:

model: The model currently used is called “text-moderation-004”.

results: in which you have:
- categories: For each of the 7 categories, you have a binary classification:
  - True: if the input text does violate the given category
  - False: if does not
- Category scores: for each category, a score is calculated. It’s not a probability. The lower the score, the better the content. The higher the score, the more it violates the above categories.

flagged: Which is the final classification of the input.
- “false” if the input text does not violate OpenAI’s policies.
- “true” if it does: If at least one category is true, this flag is set to true too.

Moderation API Call

Standard Call

reponse_moderate = openai.Moderation.create(
    input="I love chocolate"
)

output = reponse_moderate["results"][0]
output['flagged']

False

The classification of the prompt “I love chocolate” is “false”, meaning it does not violate any of the above categories.

Here is the detailed output:

<OpenAIObject at 0x1b9b15d4108> JSON: {
  "categories": {
    "hate": false,
    "hate/threatening": false,
    "self-harm": false,
    "sexual": false,
    "sexual/minors": false,
    "violence": false,
    "violence/graphic": false
  },
  "category_scores": {
    "hate": 2.204801852201399e-08,
    "hate/threatening": 6.550743563045469e-13,
    "self-harm": 1.5246037765592746e-10,
    "sexual": 4.902125851913297e-07,
    "sexual/minors": 2.614277405665888e-10,
    "violence": 8.451511490648045e-09,
    "violence/graphic": 1.1333047694739307e-10
  },
  "flagged": false
}

All scores are very low, thus the given categories are all “false”.

Call violation

The prompt given in the following request is just for illustration. It is not a personal opinion.

reponse_moderate = openai.Moderation.create(
    input="I hate religions"
)

output = reponse_moderate["results"][0]
output

<OpenAIObject at 0x1b9b15d4b28> JSON: {
  "categories": {
    "hate": true,
    "hate/threatening": false,
    "self-harm": false,
    "sexual": false,
    "sexual/minors": false,
    "violence": false,
    "violence/graphic": false
  },
  "category_scores": {
    "hate": 0.5235283970832825,
    "hate/threatening": 4.682948713252699e-07,
    "self-harm": 4.25570328976832e-10,
    "sexual": 3.632946610210297e-09,
    "sexual/minors": 5.008460313149499e-10,
    "violence": 2.7335980121279135e-05,
    "violence/graphic": 5.753431620014737e-10
  },
  "flagged": true
}

The output is “true”, meaning there is a violation. This is because the input violates the first category “hate” with a score of 0.52, while the other categories are all showing very low scores.

Some variants

When the input is describing a personal belief, the classification is correct. However when it describes a global opinion, the model does not classify it as violating the policies.

Here is an example, where the classification is false even if the input has a negative connotation :

reponse_moderate = openai.Moderation.create(
    input="religions are all about hate"
)

output = reponse_moderate["results"][0]
output

<OpenAIObject at 0x1b9b15b4dc8> JSON: {   "categories": {
    "hate": false,
    "hate/threatening": false,
    "self-harm": false,
    "sexual": false,
    "sexual/minors": false,
    "violence": false,
    "violence/graphic": false
  },
  "category_scores": {
    "hate": 0.036225177347660065,
    "hate/threatening": 4.437977763060985e-10,
    "self-harm": 1.7395539231301882e-08,
    "sexual": 1.8596454154362618e-08,
    "sexual/minors": 2.5413976700860985e-08,
    "violence": 2.958734341973468e-07,
    "violence/graphic": 6.224222470763152e-09
  },
 "flagged": false
}

Here is another variant, where a simple comma can change widely the score (the classification in both cases is “true”):

reponse_moderate = openai.Moderation.create(
    input="I hate religions from the core of my being"
)

output = reponse_moderate["results"][0]
output

<OpenAIObject at 0x1b9b15e8ac8> JSON: {
  "categories": {
    "hate": true,
    "hate/threatening": false,
    "self-harm": false,
    "sexual": false,
    "sexual/minors": false,
    "violence": false,
    "violence/graphic": false
  },
  "category_scores": {
   "hate": 0.6662052869796753,
    "hate/threatening": 1.774750444383244e-06,
    "self-harm": 3.110680779627728e-08,
    "sexual": 1.2271056171186956e-08,
    "sexual/minors": 2.0201695871691072e-09,
    "violence": 0.00023120403056964278,
    "violence/graphic": 9.158834579636732e-09
  },
  "flagged": true
}

The score is about 0.66

reponse_moderate = openai.Moderation.create(
    input="I hate religions, from the core of my being"
)

output = reponse_moderate["results"][0]
output

Here the score is about 0.954 (with a simple comma):

<OpenAIObject at 0x1b9b15e8f48> JSON: {
  "categories": {
    "hate": true,
    "hate/threatening": false,
    "self-harm": false,
    "sexual": false,
    "sexual/minors": false,
    "violence": false,
    "violence/graphic": false
  },
  "category_scores": {
   "hate": 0.9543549418449402,
    "hate/threatening": 2.622578904265538e-06,
    "self-harm": 3.4972796925103466e-07,
    "sexual": 8.083279112724995e-08,
    "sexual/minors": 2.944311905395125e-09,
    "violence": 0.00024133680562954396,
    "violence/graphic": 5.054770468859715e-08
  },
  "flagged": true
}

"hate": 0.9543549418449402,

Summary

In this article, you have learned how to use the ChatGPT API Moderation model, that you can put in place for your own project/website to avoid inputs or comments violating any common sense.

I hope you enjoy reading the article. Leave me a comment 👇.