Detect AI Content with LLM Part 2: Data Generation and Pre-processing

Dataset Used

For this project, I utilized data from Kaggle which comprises of test_essays.csv, train_essays.csv and train_prompts.csv. {test|train}_essays.csv consists of four columns: id (unique identifier), prompt_id (identified the prompt that was used to generate the essay), text (the essay) and generated (1 if generated by LLM, 0 if generated by human). train_prompts.csv includes prompt_id (unique identifier), prompt_name (title of the prompt), instructions (instruction given to students) and source_text (the text of the article the essays were written in response to).

In train_essays.csv, it has 1378 rows of data. However, only 3 of them are generated by the Large Language model. Hence in order to balance the training dataset, I decided to use GPT-3.5, GPT-4, Llama and Falcon to each generate 200 essays for each prompt.

Data Generation

Firstly, I am using OpenAI API to access their trained model GPT-3.5. Below is the code I employed to generate content based on prompts in the train_prompts.csv

GPT-3.5-turbo

import asyncio
from openai import AsyncOpenAI
import pandas as pd

client = AsyncOpenAI(
    api_key="my_gpt_api_key",
)

df = pd.read_csv('train_prompts.csv')

essays_per_prompt = 200
all_generated_essays = []

pre_prompt = "You will be provided with a theme for an essay. You need to write an essay based on the theme provided as if you were a student. Your essay needs to be unique and convincing. Output nothing but the essay."

async def generate_essays(prompt):
    generated_essays = []
    prompt_content = pre_prompt + prompt
    for _ in range(essays_per_prompt):
        response = await client.chat.completions.create(
            messages=[
                {
                    "role": "user",
                    "content": prompt_content
                }
            ],
            model="gpt-3.5-turbo",
            temperature=0.75
        )
        generated_essay = response.choices[0].message.content
        generated_essays.append(generated_essay)
    return generated_essays

async def main():
    tasks = [generate_essays(
        f"Prompt: {row['prompt_name']}\nInstructions: {row['instructions']}\nSource Text: {row['source_text']}"
    ) for index, row in df.iterrows()]
    all_generated_essays = await asyncio.gather(*tasks)
    new_df = pd.DataFrame(columns=['Prompt', 'Generated_Essay'])

    for i, essays_for_prompt in enumerate(all_generated_essays):
        prompt_name = f'{i}'
        prompt_df = pd.DataFrame({'Prompt': [prompt_name] * len(essays_for_prompt), 'Generated_Essay': essays_for_prompt})
        new_df = pd.concat([new_df, prompt_df], ignore_index=True)

    new_df.to_csv('result_gpt.csv', index=False)

asyncio.run(main())

Description of the code:

Import necessary modules: asyncio, AsyncOpenAI and pandas
Set up the OpenAI API with AsyncOpenAI
Load data from train_prompts.csv, define essay_per_prompt and add a pre_prompt to provide context to the model.
Define generate_essays function to generate specified number of essays by asynchronously making API calls to the OpenAI GPT-3.5-turbo model. temperature parameter controls the randomness of the generated text, 0.75 is usually a good starting point.
Define main function to run the asynchronous generation of essays for each prompt, then organise them into a Dataframe and output it to a CSV file for further training usage.

The current free limit of my OpenAI API account does not support the amount of work stated above, hence I did updrade my account to increase the limit. The total execution time for this code is approximately 4-5 hours.

GPT-4

The code for GPT-4 will be identical to GPT-3.5-turbo, with the only difference being the replacement of model name with “gpt-4”. Again, even though I have upgraded my account to Tier 1, it sometimes still exceeds the rate limit set by OpenAI. Hence I have generated a reduced amount of data, less than 200 each time to mitigate this issue.

Llama 70b and Falcon 180b

When I attempted to use Llama 70b and Falcon 180b, I have paid the equivalent cost of two Yakun breakfast sets to OpenAI, and waited several hours for the data to be generated. So, instead of investing more money and time in generating new data, I opted to search the internet to find existing LLM-generated essays with the same prompts. Fortunately, I found this dataset: https://www.kaggle.com/datasets/nbroad/daigt-data-llama-70b-and-falcon180b

However, if you are interested to generate the data yourself, I do have Python code for Llama model. The models are trained and stored in Replicate’s cloud (https://replicate.com/meta/llama-2-70b-chat). All you need to do is request your API key from Replicate, and then use the provided code below.

import os
import replicate
import pandas as pd

os.environ["REPLICATE_API_TOKEN"] = "my_replicate_api_key"

df = pd.read_csv('train_prompts.csv')
results = []

for index, row in df.iterrows():
    pre_prompt = "You will be provided with a theme for an essay. You need to write an essay based on the theme provided as if you were a student. Your essay needs to be unique and convincing. Output nothing but the essay."
    prompt_input = f"Prompt: {row['prompt_name']} Instructions: {row['instructions']} Source Text: {row['source_text']}"

    for _ in range(200):
        output = replicate.run(
            'a16z-infra/llama13b-v2-chat:df7690f1994d94e96ad9d568eac121aecf50684a0b0963b25a41cc40061269e5',
            input={
                "prompt": f"{pre_prompt} {prompt_input} ",
                "temperature": 0.75,
                "top_p": 0.9,
                "max_length": 4096,
                "repetition_penalty": 1
            }
        )
        full_response = " ".join(output)
        
        print(full_response)

        results.append({
            'Response': full_response,
            'generated': 1
        })

results_df = pd.DataFrame(results)

results_df.to_csv('result_llama2.csv', index=False)
print("Generated responses saved to result_llama2.csv")

Data Cleaning (Down-Sampling) and Data Preparation (Train Test Split)

balanced_dataset = dataset_raw.groupby('generated').apply(lambda x: x.sample(dataset_raw['generated'].value_counts().min()))

dataset = balanced_dataset.reset_index(drop=True)

dataset['generated'].value_counts()

dataset['prompt_id'].value_counts()

dataset.head()

from sklearn.model_selection import train_test_split

df = dataset.reset_index(drop=True)

x = df['text']
y = df['generated']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)
x_train.reset_index(drop=True, inplace=True)
x_test.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)

Balancing the data:

The initial dataset, dataset_raw, is grouped by a column named ‘generated’.
For each group, a lambda function is applied to sample a number of instances equal to the minimum count of occurrences in any group. This effectively balances the dataset by ensuring an equal number of samples from each category. Resetting the Index:
The index of the dataset is reset to avoid any potential issues with indexing. Train test split:
The dataset is split into training and testing sets with a 75%-25% ratio, and a random seed (random_state=42).

Dataset Used#

Data Generation#

GPT-3.5-turbo#

GPT-4#

Llama 70b and Falcon 180b#

Data Cleaning (Down-Sampling) and Data Preparation (Train Test Split)#