Enable ChatGPT To Answer Recent Events

ChatGPT

As you may know, ChatGPT models were trained in data only up until September 2021. So what are the steps to follow if you want to enable ChatGPT to answer recent events? For example the 2022 FIFA World Cup winner in Qatar?

This is exactly what I’m going to show you in this article. You can then use these steps to train GPT models on your own data, and get answer-questions chat adapted to your own use.

How to use Wikipedia API to build a dataset using Python?

Before outlining the various steps to enable ChatGPT, using the GPT-3.5-Turbo model, to answer to questions regarding recent events, I will gather information on the 2022 FIFA World Cup in Qatar through the use of the Wikipedia API.

To discover how I did it, have a look at this article (because including the details here will be very long):

How to build a corpus using Wikipedia API in Python

Here is the dataset I built and prepared to be used:

df=pd.read_csv("embeddings_2022_fifa_world_cup.csv",index_col=False)  
## When loading df from csv, need to convert string embedding to list type
import ast
df['embedding'] = df['embedding'].apply(ast.literal_eval)
df.head()

Having obtained the desired corpus, my objective now is to ask a question and retrieve the answer from the information it contains. However, prior to reaching this stage, let me establish the meaning of a similarity score.

Similarity Score

In the following, we will calculate a similarity score between the question we want to ask and every item in the corpus, which enables us to provide the GPT-3.5 model with the most pertinent context for generating an answer to our inquiry.

We will be using cosine distance to calculate this score.

First we will generate an embedding vector (x) for our question, and then compare it to the embedding vector for each input in our corpus (y), by using this relationship:

similarity_score = 1 – spatial.distance.cosine(x, y)

from scipy import spatial

def similarity_score_top_inputs(
    query: str, #our question
    df: pd.DataFrame, #inputs corpus
    top_n: int = 50,
    embedding_model : str="text-embedding-ada-002",
    similarity_score_func=lambda x, y: 1 - spatial.distance.cosine(x, y),
) -> tuple[list[float],list[str]]:
    """Returns top_n inputs similar to the query."""

    #Embedding vector of our query
    query_embedding_response = openai.Embedding.create(
        model=embedding_model,
        input=query,
    )
    query_embedding = query_embedding_response["data"][0]["embedding"]
 
    strings_similarity_score = [
        (similarity_score_func(query_embedding, row["embedding"]), row["text"])
        for i, row in df.iterrows()
    ]
   
    strings_similarity_score.sort(key=lambda x: x[0], reverse=True)
    similarity_score, inputs= zip(*strings_similarity_score)
   
    return similarity_score[:top_n], inputs[:top_n]

Here are the top 10 inputs similar to our query “Who is the winner of world cup 2022?” :

similarity_scores, inputs= similarity_score_top_inputs("Who is the winner of world cup 2022?", df, top_n=10)
for similarity_score, string in zip(similarity_scores, inputs):
    print(f"{similarity_score=:.3f}, \n{string[:200]}\n")

Here, I print only the top 3 outputs:

similarity_score=0.858,
2022 FIFA World Cup

The 2022 FIFA World Cup was the 22nd FIFA World Cup, the quadrennial world 
championship for national football teams organized by FIFA. It took place in 
Qatar from 20 November to 1…

similarity_score=0.856,
2022 FIFA World Cup seeding

Personnel involved

The 32 nations involved in the 2022 World Cup (29 of which were known) were drawn
into eight groups of four. Two of the remaining three spots were fill…

similarity_score=0.855,
2022 FIFA World Cup final

Background

Argentina had won the World Cup twice before, in 1978 and 1986. They had also 
finished as losing finalists thrice, in 1930, 1990 and 2014. After the 2014 final l…

Prompt preparation

We will send in the OpenAI API, the whole relevant sources to our question:

We will use the similarity function that we just built, to get the most relevant inputs (top 10 for example) to our question.
We will build a message text to incorporate this context
We will check also the number of tokens to send using Tiktoken library

def num_tokens(text: str, model: str = "gpt-3.5-turbo") -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

def message_preparation(
    query: str,
    df: pd.DataFrame,
    token_budget: int,
    model: str,
    top_n: int = 10
) -> str:
    """Return an organized message to ask to GPT, with relevant inputs text pulled from the corpus 'df'."""
    similarity_score, inputs_string = similarity_score_top_inputs(query, df, top_n)
   
    question = f"\n\nQuestion: {query}"

    introduction = 'Use the below articles from Wikipedia to answer the question. If the answer cannot be found in the articles, write "I could not find an answer."'
    message = introduction + '\n\nWikipedia article section:\n'
   
    for input_str in inputs_string:
        next_article = f'"""\n{input_str}\n"""\n'
        whole_message = message + next_article + question
        if (
            num_tokens(whole_message, model=model)
            > token_budget
        ):
            break
        else:
            message += next_article
    return message + question

Here is how the prompt will look like for our question “Who is the winner of world cup 2022?” by returning the top 3 relevant inputs from our corpus:

query = "Who is the winner of the World Cup 2022?"
print(message_preparation(query, df, token_budget=4096, model='gpt-3.5-turbo', top_n=3))

Before asking GPT-3.5-Turbo model by giving the detailed context that we just built, let’s try asking using the standard way we already know:

Ask ChatGPT, GPT-3.5-Turbo model, without context

messages = [
        {"role": "user", "content": 'Who is the winner of world cup 2022?'},
]
response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=messages,
        temperature=0
    )
response_message = response["choices"][0]["message"]["content"]
response_message

This is what it would appear like if I had not taken into account the context in my message and had relied solely on the data that the model was trained on up until September 2021:

As an AI language model, I do not have the ability to predict future events. 
Therefore, I cannot provide an answer to this question.