Blogging about AI and LLMs (Large Language Models) in Finance

Enable ChatGPT To Answer Recent Events

Enable ChatGPT To Answer Recent Events
Author's image: Generated partially with DALL.E

Table of Contents

As you may know, ChatGPT models were trained in data only up until September 2021. So what are the steps to follow if you want to enable ChatGPT to answer recent events? For example the 2022 FIFA World Cup winner in Qatar

This is exactly what I’m going to show you in this article. You can then use these steps to train GPT models on your own data, and get answer-questions chat adapted to your own use.

How to use Wikipedia API to build a dataset using Python?

Before outlining the various steps to enable ChatGPT, using the GPT-3.5-Turbo model, to answer to questions regarding recent events, I will gather information on the 2022 FIFA World Cup in Qatar through the use of the Wikipedia API

To discover how I did it, have a look at this article (because including the details here will be very long):

Here is the dataset I built and prepared to be used: 

df=pd.read_csv("embeddings_2022_fifa_world_cup.csv",index_col=False)  
## When loading df from csv, need to convert string embedding to list type
import ast
df['embedding'] = df['embedding'].apply(ast.literal_eval)
df.head()

Having obtained the desired corpus, my objective now is to ask a question and retrieve the answer from the information it contains. However, prior to reaching this stage, let me establish the meaning of a similarity score.

Similarity Score

In the following, we will calculate a similarity score between the question we want to ask and every item in the corpus, which enables us to provide the GPT-3.5 model with the most pertinent context for generating an answer to our inquiry.

We will be using cosine distance to calculate this score.

First we will generate an embedding vector (x) for our question, and then compare it to the embedding vector for each input in our corpus (y), by using this relationship:

similarity_score = 1 – spatial.distance.cosine(x, y)
from scipy import spatial

def similarity_score_top_inputs(
    query: str, #our question
    df: pd.DataFrame, #inputs corpus
    top_n: int = 50,
    embedding_model : str="text-embedding-ada-002",
    similarity_score_func=lambda x, y: 1 - spatial.distance.cosine(x, y),
) -> tuple[list[float],list[str]]:
    """Returns top_n inputs similar to the query."""

    #Embedding vector of our query
    query_embedding_response = openai.Embedding.create(
        model=embedding_model,
        input=query,
    )
    query_embedding = query_embedding_response["data"][0]["embedding"]
 
    strings_similarity_score = [
        (similarity_score_func(query_embedding, row["embedding"]), row["text"])
        for i, row in df.iterrows()
    ]
   
    strings_similarity_score.sort(key=lambda x: x[0], reverse=True)
    similarity_score, inputs= zip(*strings_similarity_score)
   
    return similarity_score[:top_n], inputs[:top_n]

Here are the top 10 inputs similar to our query “Who is the winner of world cup 2022?” :

similarity_scores, inputs= similarity_score_top_inputs("Who is the winner of world cup 2022?", df, top_n=10)
for similarity_score, string in zip(similarity_scores, inputs):
    print(f"{similarity_score=:.3f}, \n{string[:200]}\n")

Here, I print only the top 3 outputs:

similarity_score=0.858,
2022 FIFA World Cup

The 2022 FIFA World Cup was the 22nd FIFA World Cup, the quadrennial world 
championship for national football teams organized by FIFA. It took place in 
Qatar from 20 November to 1…

similarity_score=0.856,
2022 FIFA World Cup seeding

Personnel involved

The 32 nations involved in the 2022 World Cup (29 of which were known) were drawn
into eight groups of four. Two of the remaining three spots were fill…

similarity_score=0.855,
2022 FIFA World Cup final

Background

Argentina had won the World Cup twice before, in 1978 and 1986. They had also 
finished as losing finalists thrice, in 1930, 1990 and 2014. After the 2014 final l…

Prompt preparation

We will send in the OpenAI API, the whole relevant sources to our question:

  • We will use the similarity function that we just built, to get the most relevant inputs (top 10 for example) to our question.
  • We will build a message text to incorporate this context
  • We will check also the number of tokens to send using Tiktoken library
def num_tokens(text: str, model: str = "gpt-3.5-turbo") -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

def message_preparation(
    query: str,
    df: pd.DataFrame,
    token_budget: int,
    model: str,
    top_n: int = 10
) -> str:
    """Return an organized message to ask to GPT, with relevant inputs text pulled from the corpus 'df'."""
    similarity_score, inputs_string = similarity_score_top_inputs(query, df, top_n)
   
    question = f"\n\nQuestion: {query}"

    introduction = 'Use the below articles from Wikipedia to answer the question. If the answer cannot be found in the articles, write "I could not find an answer."'
    message = introduction + '\n\nWikipedia article section:\n'
   
    for input_str in inputs_string:
        next_article = f'"""\n{input_str}\n"""\n'
        whole_message = message + next_article + question
        if (
            num_tokens(whole_message, model=model)
            > token_budget
        ):
            break
        else:
            message += next_article
    return message + question

Here is how the prompt will look like for our question “Who is the winner of world cup 2022?” by returning the top 3 relevant inputs from our corpus:

query = "Who is the winner of the World Cup 2022?"
print(message_preparation(query, df, token_budget=4096, model='gpt-3.5-turbo', top_n=3))

Before asking GPT-3.5-Turbo model by giving the detailed context that we just built, let’s try asking using the standard way we already know:

Ask ChatGPT, GPT-3.5-Turbo model, without context

messages = [
        {"role": "user", "content": 'Who is the winner of world cup 2022?'},
]
response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=messages,
        temperature=0
    )
response_message = response["choices"][0]["message"]["content"]
response_message

This is what it would appear like if I had not taken into account the context in my message and had relied solely on the data that the model was trained on up until September 2021:

As an AI language model, I do not have the ability to predict future events. 
Therefore, I cannot provide an answer to this question.

Enable ChatGPT to answer recent events: Using GPT-3.5-Turbo model with detailed context related to the World Cup event:

Now let’s incorporate in our request to the OpenAI API, in the “message” parameter, the context that we just built:

def chat(
    query: str,
    df: pd.DataFrame = df,
    model: str = "gpt-3.5-turbo",
    token_budget: int = 4096,
    top_n: int = 50
) -> str:
    """Ask GPT-3.5-Turbo model our question by giving a detailed message with a relevant context."""
    message = message_preparation(query, df, token_budget=token_budget, model=model, top_n=top_n)

    messages = [
        {"role": "user", "content": message},
    ]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0
    )
    response_message = response["choices"][0]["message"]["content"]
    return response_message

Winner: 

Now we can ask ChatGPT questions related to the World Cup event, and get its answer: 

chat('Who is the winner of world cup 2022?')
Argentina.

Here is more detailed answer, it’s really brilliant:

chat('Who is the winner of world cup 2022? answer with details' )
Argentina won the 2022 FIFA World Cup after defeating France 4-2 on penalties
following a 3-3 draw after extra time in the final. It was Argentina’s third 
title and their first since 1986, as well being the first nation from outside of 
Europe to win the tournament since 2002. French player Kylian Mbappé became the 
first player to score a hat-trick in a World Cup final since Geoff Hurst in the 
1966 final and won the Golden Boot as he scored the most goals (eight) during the 
tournament. Argentine captain Lionel Messi was voted the tournament's best 
player, winning the Golden Ball. Teammates Emiliano Martínez and Enzo Fernández 
won the Golden Glove, awarded to the tournament's best goalkeeper, and the Young
Player Award, awarded to the tournament's best young player, respectively.

Defeated team: 

chat('Who is the defeated team at the final of world cup 2022?')
The defeated team at the final of World Cup 2022 was France.

Moroccan team:

chat('How well did the Moroccan team in the world cup 2022?')
The Moroccan team advanced to the round of 16 and later reached the semi-finals in the 2022 FIFA World Cup. They became the first African team to win their group since Nigeria in 1998. It was also the first time Morocco advanced to the knockout stage since 1986.
chat('What are the different teams that Morocco has met during the world cup 2022?')
Morocco has played against Croatia, Belgium, and Canada during the 2022 FIFA World Cup.

The answer is not complete (it didn’t mention Spain, Portugal, France…), and this is probably because of the way used by wikipedia to retrieve tables in the articles…

German team:

chat('At which stage of the competition the german team was defeated in the world cup 2022?')
The German team was eliminated from the group stage of the 2022 FIFA World Cup.

Summary

Steps to follow:

  • Load dataset using Wikipedia API: Prepare it, split the text if too long, generate embedding for each input
  • Compute the similarity score between my query and each input in the corpus using embeddings and cosine distance
  • Prepare the final prompt to send to GPT-3.5-Turbo model: incorporating the whole context, with the top n similar inputs to our question
  • Ask GPT-3.5-Turbo with the whole context

✅ Full version of the notebook here 👉: chat_GPT_models_on_recent_events.ipynb

Related Articles

EXPLORE

STAY INFORMED

Leave a Reply

Your email address will not be published. Required fields are marked *