Tokenization In OpenAI API: Tiktoken

ChatGPT, FEATURED

Learn what is Tokenization in OpenAI API and explore the Tiktoken library in Python. Learn how to use it to make a call to the ChatGPT API.

What is tiktoken? and What is Tokenization in OpenAI API?

Tiktoken is an open-source tool developed by OpenAI that is utilized for tokenizing text.

Tokenization is when you split a text string to a list of tokens. Tokens can be letters, words or grouping of words (depending on the text language).

For example, “I’m playing with AI models“ can be transformed to this list [“I”,”‘m”,” playing”,” with”,” AI”,” models”].

Then these tokens can be encoded in integers.

In fact, this example reflects the tiktoken functionality, ie the tokenization used in OpenAI API.

Before using any NLP models, you need to tokenize your dataset to be processed by the model.

Furthermore, OpenAI uses a technique called byte pair encoding (BPE) for tokenization. BPE is a data compression algorithm that replaces the most frequent pairs of bytes in a text with a single byte. This reduces the size of the text and makes it easier to process.

Why to use Tokenization in OpenAI API – the Tiktoken library?

You can use tiktoken to count tokens, because:

You need to know whether the text your are using is very long to be processed by the model

You need to have an understanding of the costs associated with OpenAI API calls, as they are calculated on a per-token basis.

For example, if you are using GPT-3.5-turbo model you will be charged: $0.002 / 1K tokens

You can play with it here, in OpenAI platform: https://platform.openai.com/tokenizer

How to count the number of tokens using tiktoken library?

Install and Import

First you need to install it:

pip install tiktoken

Then you import the library and start using it:

import tiktoken

Encoding

Different encodings are used in openai: cl100k_base, p50k_base, gpt2.

These encodings depend on the model you are using:

For gpt-4, gpt-3.5-turbo, text-embedding-ada-002, you need to use cl100k_base.

All this information is already included in OpenAI API, you don’t need to remember it. Therefore, you can call the encoding using 2 methods:

If you know the exact encoding name:

encoding = tiktoken.get_encoding("cl100k_base")

Alternatively, you can allow the OpenAI API to provide a suitable tokenization method based on the model you are using:

encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
print(encoding)

<Encoding 'cl100k_base'>

Get access to the whole Resources:

OK! GO FOR IT

Tokenization

Let’s tokenize this text:

text = "I'm playing with AI models"

This will return a list of tokens integer:

tokens_integer=encoding.encode(text)
tokens_integer

[40, 2846, 5737, 449, 15592, 4211]

Now, we can count the number of tokens generated by our text:

print(f"{len(tokens_integer)} is the number of tokens in my text")

6 is the number of tokens in my text

It’s worth mentioning that we can obtain the corresponding token string for each integer token by utilizing the ‘encoding.decode_single_token_bytes()‘ function (each string will be a bytes ‘b’ string)

tokens_string = [encoding.decode_single_token_bytes(token) for token in tokens_integer]
tokens_string

[b'I', b"'m", b' playing', b' with', b' AI', b' models']

Did you notice the space before each word? This is how it works in OpenAI with tiktoken.

Compare the tokens number using tiktoken before and after the API call

Now let’s apply the tokenization to openai API calls.

We have a message and want to count the number of tokens we are sending.

Also, we want to know the number of tokens the API is actually taken into account.

According to openai documentation, one needs to use the model gpt-3.5-turbo-0301 instead of gpt-3.5-turbo. The latest is changing over time (The same for gpt-4-0314 instead of gpt-4).

he way the number of tokens is counted, is different from one model to another.

Let’s have a look on how to do it with the gpt-3.5-turbo-0301 model.

Count the number of token in the message to be sent using the API:

message =[{
        "role": "user",
        "content": "Explain to me how tolenization is working in OpenAi models?",
    }]

tokens_per_message = 4 
# every message follows <|start|>{role/name}n{content}<|end|>n

num_tokens = 0
num_tokens += tokens_per_message

for key, value in message[0].items():
    text=value
    num_tokens+=len(encoding.encode(value))
    print(f"{len(encoding.encode(value))} is the number of token included in {key}")

num_tokens += 3
# every reply is primed with <|start|>assistant<|message|>

print(f"{num_tokens} number of tokens to be sent in our request")

Results:

1 is the number of token included in role
15 is the number of token included in content
23 number of tokens to be sent in our request

Count the number of token considered by openai API in our call

If you don’t know how to use OpenAI API (openai library), read my article:

ChatGPT API Using Python – Machine Learning Example

import openai

openai.api_key='YOUR_API_KEY'

response = openai.ChatCompletion.create(
        model='gpt-3.5-turbo-0301',
        messages=message,
        temperature=0,
        max_tokens=200
    )

num_tokens_api = response["usage"]["prompt_tokens"]

print(f"{num_tokens_api} number of tokens used by the API")

23 number of tokens used by the API

The number of tokens is the same as what we calculated using ‘tiktoken’.

Furthermore, let’s count the number of tokens in ChatGPT answer :

resp=response["choices"][0]["message"].content
len(encoding.encode(resp))

200 is also the number we give as input in the request: max_tokens.

Want To Learn Some Trend Following Algorithms Using ChatGPT?

Download for free my sample ebook

SEND IT TO ME!

Summary

In this article, you have learned how the tiktoken library is working in OpenAI API. An upcoming article will guide you through an end-to-end project that utilizes this library, starting from collecting a raw text dataset, tokenizing and embedding it, to utilizing gpt-3.5-turbo to ask questions and obtain answers like the ChatGPT Web UI.