ChatGPT API Temperature and Top_p

Learn what are ChatGPT API temperature and top_p parameters in Python and how to use them.

What is Temperature in ChatGPT API?

The ChatGPT API temperature parameter is a hyperparameter used in language models (like GPT-2, GPT-3, BERT) to control the randomness of the generated text. It is used in the ChatGPT API in the ChatCompletion function (more details below).

It controls how much the model should take into account low-probability words when generating the next token in the sequence.

The parameters control the degree of randomness or creativity (depending on the task) in the generated output by modifying the probabilities of activation functions like softmax function:

softmax(x) = exp(x/temperature) / sum(exp(x/temperature))

A lower value of the temperature parameter will lead to a more predictable and deterministic output, while a higher value will produce a more random and surprising output.

According to OpenAI documentation, In gpt-3.5-turbo temperature values are between 0 and 2. The default value is set to 1.

This value can be adjusted by the user when generating the text to fit the desired output.

Let’s see how it works. In the following, I will use the ChatGPT API openai in Python.

ChatGPT API – Python

If you don’t know how to set up and use the ChatGPT API with Python, have a look at this article:

ChatGPT API Using Python

Temperature = 1: Default value

import openai
openai.api_key = "YOUR_API_KEY"

%%time
response=openai.ChatCompletion.create(
    model='gpt-3.5-turbo',
    messages=[{"role":"user",
              "content":"What are the different examples of activation functions used in deep learning models, their definition and their formulas?"}
            ]
)

first_response = response.choices[0].message.content
print(first_response)

I get a list of the activation functions (not complete but representative), with a short definition and the formula. It takes 56.4s to have the answer :

You need to know that the API is very slow compared to the GUI chat (Have a look at the end of the article “API Performance”).

1. Sigmoid Function:
The sigmoid function takes any range real number and returns a value between 0 and 1. It is one of the most commonly used activation functions in deep learning.

Formula: f(x) = 1/(1+e^-x)

2. ReLU Function:
ReLU stands for Rectified Linear Unit. It is a non-linear activation function, which returns the input if it's positive, and if not, it returns zero.

Formula: f(x) = max(0,x)

3. Tanh Function:
The Tanh function is a popular activation function that is symmetric around the origin, which means it returns values between -1 and 1.

Formula: f(x) = (e^x - e^-x) / (e^x + e^-x)

4. Leaky ReLU Function:
Leaky ReLU is a variant of the ReLU function, which allows a small, non-zero gradient when the input is negative, solving the "dying ReLU" problem.

Formula: f(x) = max(0.01x, x)

5. Softmax Function:
The Softmax function is another commonly used activation function. It returns an output in the range of [0,1] and ensures that the sum of all output values is 1, which makes it ideal for multi-class classification problems.

Formula: f(x) = e^xi / ∑(e^xj)
Wall time: 56.4 s

Temperature = 0

%%time
response=openai.ChatCompletion.create(
    model='gpt-3.5-turbo',
    messages=[{"role":"user",
              "content":"What are the different examples of activation functions used in deep learning models?"}
            ],
    temperature = 0
)

second_response = response.choices[0].message.content

print(second_response)

I get more straightforward answers, and it takes 39.9s (less than the first call), but the first call has given valid answers too.

1. Sigmoid Function: The sigmoid function is a commonly used activation function in deep learning models. It maps any input value to a value between 0 and 1. The formula for the sigmoid function is:

f(x) = 1 / (1 + e^-x)

2. ReLU Function: The Rectified Linear Unit (ReLU) function is another popular activation function used in deep learning models. It maps any input value less than 0 to 0 and any input value greater than 0 to the input value itself. The formula for the ReLU function is:

f(x) = max(0, x)

3. Tanh Function: The hyperbolic tangent (tanh) function is similar to the sigmoid function, but it maps any input value to a value between -1 and 1. The formula for the tanh function is:

f(x) = (e^x - e^-x) / (e^x + e^-x)

4. Softmax Function: The softmax function is commonly used in the output layer of deep learning models for classification tasks. It maps any input value to a probability distribution over multiple classes. The formula for the softmax function is:

f(x_i) = e^x_i / sum(e^x_j) for j = 1 to n

where n is the number of classes and x_i is the input value for class i.

5. Leaky ReLU Function: The Leaky ReLU function is a modified version of the ReLU function that allows for a small non-zero gradient for negative input values. The formula for the Leaky ReLU function is:

f(x) = max(0.01x, x)
Wall time: 39.9 s

Temperature = 1.9

%%time
response=openai.ChatCompletion.create(
    model='gpt-3.5-turbo',
    messages=[{"role":"user",
              "content":"What are the different examples of activation functions used in deep learning models?"}
            ],
    temperature = 1.9
)

third_response = response.choices[0].message.content

print(response.choices[0].message.content)

The answer nothing to do with the question and the request takes a lot of time 4 min 54s:

here is a quick overview of the answer:

Now let’s have a look at another parameter called top_p which is also used in the ChatGPT API.

What is top_p in ChatGPT API?

The top_p parameter can also be used to control the randomness of the outputs.

OpenAI documentation recommends modifying either temperature or top_p, but not both.

Top_p sampling is also called nucleus sampling, in which a probability threshold is set (Default value =1 in the API). This threshold represents the proportion of the probability distribution to consider for the next word. In other words, It consists of selecting the top words from the probability distribution, having the highest probabilities that add up to the given threshold.

For example, if we set a top_p of 0.05, it means that the model, once it generated the probability distribution, will only be considering the tokens that have the highest probabilities, and sum up to 5%. Then the model will be randomly selecting the next token among these 5% tokens, according to its likelihood.

The top_p sampling is highly correlated to the quality and the size of the dataset used to train the model. In Machine learning, as there are huge datasets with good quality, the answers are not that different when modifying the value of top_p.

You can run this code, and see how the outputs are different:

import time

top_ps=[0, 0.1,0.5,0.8,1 ]

start_time=time.time()
for top_p_v in top_ps:
    print(top_p_v)
    response=openai.ChatCompletion.create(
        model='gpt-3.5-turbo',
        messages=[{"role":"user",
                  "content":"What are the different examples of activation functions used in deep learning models?"}
                ],
        top_p=top_p_v
    )
    call_time=time.time()
    print("duration", call_time-start_time)
    start_time=call_time
   
    print(response.choices[0].message.content)
    print("")

I give here just the first (top_p=0) and the last (top_p=1) outputs:

As you can see there are more words in top_p=1 than the ones in top_p=0. But fundamentally, as said before, as the training datasets of machine learning are huge with a good quality, we end up with good results in both cases.

0
duration 65.72348475456238
There are several activation functions used in deep learning models, including:

1. Sigmoid function: It maps any input value to a value between 0 and 1. It is commonly used in binary classification problems.

2. ReLU (Rectified Linear Unit) function: It maps any input value to 0 if it is negative, and to the same value if it is positive. It is commonly used in convolutional neural networks (CNNs).

3. Tanh (Hyperbolic Tangent) function: It maps any input value to a value between -1 and 1. It is commonly used in recurrent neural networks (RNNs).

4. Softmax function: It maps any input value to a probability distribution over multiple classes. It is commonly used in multi-class classification problems.

5. Leaky ReLU function: It is similar to the ReLU function, but it allows a small positive slope for negative input values. It is used to prevent the dying ReLU problem.

6. ELU (Exponential Linear Unit) function: It is similar to the ReLU function, but it allows a small negative slope for negative input values. It is used to prevent the dying ReLU problem and improve the performance of deep neural networks.

7. Swish function: It is a recently proposed activation function that is similar to the sigmoid function, but it has a smoother curve and can improve the performance of deep neural networks.

1
duration 66.14234447479248
There are several examples of activation functions used in deep learning models, including:

1. Sigmoid function: The sigmoid function is a common activation function, which maps the input to a value between 0 and 1. It is primarily used in binary classification problems.

2. ReLU: The Rectified Linear Unit (ReLU) is one of the most widely used activation functions in deep learning. It maps the input to 0 if negative or to the same value if positive.

3. Tanh function: Tanh is a hyperbolic tangent function that maps the input to a value between -1 and 1. It is used in classification and regression problems.

4. Softmax function: The softmax function is a popular activation function used at the output layer of neural networks for multi-class classification.

5. LeakyReLU: LeakyReLU is an upgraded version of ReLU, where the negative values have a small non-zero positive slope to prevent dead neurons.

6. SeLU: Scaled exponential linear units (SeLU) is an activation function designed to improve the performance of deep neural networks. It combines the benefits of ReLU and tanh without being affected by vanishing gradients.

7. Swish: Swish is a new activation function, which is reported to outperform traditional functions because of its smoothness, having a thresholding effect, and computational efficiency.

API Performance

Note that the API performance is not that good, I have 40 s average time when I make a call. The GUI is way faster. I’m sending small requests, with average prompt_tokens of 50 and get completion_tokens around 300.

This slowness could only be observed on the free version. It seems that the ChatGPT Plus (non free version) is faster. I didn’t test it, so I cannot assure it’s the case.

Summary

You have learned how to use Temperature and top_p parameters in ChatGPT API using Python.

The temperature controls the correctness of the answer. Higher values of this parameter will lead to more random and surprising answers, while low values will generate the most probable outputs. I usually use the default value of temperature, however if the results start to become too varied, I lower the value.

The top_p parameter can also be used to impact the outputs. Lower values will give shortened answers. One cannot modify both parameters at the same time.

To note that there is also another way to let ChatGPT be more creative: We can instruct ChatGPT the tone it can use to answer our questions. How to influence the tone of ChatGPT answers will be explained in another article.