Compute Engine
With data generation and pre-processing completed, it is time to fine tune our model with the collected data. So before I actually do the training, I thought I can finely use Google Colab Free Tier Account to do all the computation, since TPU is available to us. Hoever, my assumption is proved to be incorrect. I tried to use Google Colab TPU for the XLNET fine-tuning process, the process took more than half a day (until it finally returned error because my browser disconnected from the internet, how frustrating it is!) and it still could not finish a three epochs fine-tuning.
So after facing multiple failed attempts with fine-tuning models in Google Colab Free Tier, I then sought an alternative, a more elegent option - ASPIRE2A from the National Supercomputing Centre
. To expedite the process, I reduced the amount of data from 2000 to 200, because even ASPIRE2A tok 8 minutes to fine-tune the model with 3 data; So, if I tune XLNET with 2000 data and a total of 3 epochs, it will take 11++ days (estimated) which will exceed the maximum wall time for my user account in ASPIRE2A.
ASPIRE2A
I was introducted to ASPIRE2A during an admission test for the university’s High Performance Computing Club. Honestly, before the test, I was unaware of its existence. Basically it is a High Performance Computing Cluster that is offered to all Singapore University students and researchers. We will have to register for an account, and then we can submit our computationally intensive tasks to the compute nodes, and we can request for multiples nodes with multiples CPUs/GPUs and RAM, depeneding on the nature of our tasks. For my fine-tuning process, I transferred the .ipynb file to my ASPIRE2A account, loaded the Python module, installed necessary dependencies, submitted the job to the queue, and then I safely turned off my laptop. The results were checked a few hours / a day later.
My PBS file
PBS (Portable Batch System) files are script files used in high-performance computing environments to submit and manage batch jobs. For user, we have to configure the PBS file, then submit the job to the server using command qsub
. If you want to interactively react to the compute node, you can use qsub -I
to activate the interactive node. If you want to ssh into the compute nodes later, do remember to set up the project ID.
#!/bin/bash
#PBS -N AI
#PBS -l select=1:ncpus=128:mem=500G
#PBS -l walltime=24:00:00
#PBS -j oe
#PBS -o out-run-min-electra.txt
#PBS -q ai
cd ai
module load python/3.10.9
source venv/bin/activate
jupyter nbconvert --execute --to notebook index_min_electra.ipynb
Thoughts
If you’re currently not a student in Singapore, of course you can opt for paid version of Google Cloud Computing Nodes, or maybe AWS compute nodes, or even setting up a server in your room to run the tasks. But honestly if you’re in Singapore, having the chance to play with the resources of National Super Computing Centre is pretty exciting, so make sure you register for one and play with it! Ps: Let’s hope that by the time you register for it they still give you same amount of credits for your user account ><
Result Comparisons
For this project, I will be using XLNET, RoBERTa and ELECTRA for the fine-tuning task, and then test it with test data that have not been feed into the fine-tuning process.
<To Do 1: Add more explanation in result comparison, focusing on model architecture and training methodology.>
<To Do 2: Add hybrid model in another post>
So far the best performance is ELECTRA, with an accuracy of 98%.
XLNET
Accuracy on the test set: 0.96
XLNET Code
from transformers import XLNetTokenizer, XLNetForSequenceClassification
from torch.utils.data import Dataset, DataLoader
import torch
from sklearn.metrics import accuracy_score
tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
model = XLNetForSequenceClassification.from_pretrained('xlnet-large-cased')
x_train_str = x_train.astype(str).tolist()
x_test_str = x_test.astype(str).tolist()
X_train_encodings = tokenizer(x_train_str, truncation=True, padding=True, return_tensors="pt")
X_test_encodings = tokenizer(x_test_str, truncation=True, padding=True, return_tensors="pt")
class CustomDataset(Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __len__(self):
return len(self.labels)
def __getitem__(self, idx):
item = {
'input_ids': torch.tensor(self.encodings['input_ids'][idx]),
'token_type_ids': torch.tensor(self.encodings['token_type_ids'][idx]),
'attention_mask': torch.tensor(self.encodings['attention_mask'][idx]),
'labels': torch.tensor(self.labels[idx])
}
return item
train_dataset = CustomDataset(X_train_encodings, y_train)
test_dataset = CustomDataset(X_test_encodings, y_test)
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
num_epochs = 3
for epoch in range(num_epochs):
model.train()
for batch in train_loader:
input_ids = batch['input_ids'].to(device)
token_type_ids = batch['token_type_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
# Forward pass
outputs = model(input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
# Backward pass and optimization
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Evaluation
model.eval()
predictions = []
with torch.no_grad():
for batch in test_loader:
input_ids = batch['input_ids'].to(device)
token_type_ids = batch['token_type_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
# Forward pass for evaluation
outputs = model(input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)
logits = outputs.logits
# Append predictions
predictions.extend(torch.argmax(logits, dim=1).cpu().numpy())
# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.2f}")
RoBERTa
Accuracy on the test set: 0.86
RoBERTa Code
from transformers import RobertaTokenizer, RobertaForSequenceClassification
from torch.utils.data import Dataset, DataLoader
import torch
from sklearn.metrics import accuracy_score
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaForSequenceClassification.from_pretrained("cardiffnlp/twitter-roberta-base-emotion")
x_train_str = x_train.astype(str).tolist()
x_test_str = x_test.astype(str).tolist()
X_train_encodings = tokenizer(x_train_str, truncation=True, padding=True, return_tensors="pt")
X_test_encodings = tokenizer(x_test_str, truncation=True, padding=True, return_tensors="pt")
class RobertaDataset(Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __len__(self):
return len(self.labels)
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
item['labels'] = torch.tensor(self.labels[idx])
return item
# Assuming you have labels for your training data (replace 'y_train' with your actual variable name)
# Make sure to convert your labels to tensor if they are not already in that format
y_train_tensor = torch.tensor(y_train)
# Create training and testing datasets
train_dataset = RobertaDataset(X_train_encodings, y_train_tensor)
test_dataset = RobertaDataset(X_test_encodings, y_test)
# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
from transformers import AdamW, get_linear_schedule_with_warmup
# Set up scheduler
num_epochs = 3
total_steps = len(train_loader) * num_epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)
# Training loop
for epoch in range(num_epochs):
model.train()
total_loss = 0
for batch in train_loader:
optimizer.zero_grad()
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
total_loss += loss.item()
loss.backward()
optimizer.step()
scheduler.step()
average_loss = total_loss / len(train_loader)
print(f"Epoch {epoch + 1}/{num_epochs}, Average Training Loss: {average_loss}")
# Evaluation loop
model.eval()
all_preds = []
all_labels = []
with torch.no_grad():
for batch in test_loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
outputs = model(input_ids, attention_mask=attention_mask)
logits = outputs.logits
preds = torch.argmax(logits, dim=1)
all_preds.extend(preds.cpu().numpy())
all_labels.extend(labels.cpu().numpy())
# Calculate accuracy
accuracy = accuracy_score(all_labels, all_preds)
print(f"Accuracy on the test set: {accuracy}")
ELECTRA
Accuracy on the test set: 0.96
ELECTRA Code
from transformers import ElectraTokenizer, ElectraForSequenceClassification
from torch.utils.data import Dataset, DataLoader
import torch
from sklearn.metrics import accuracy_score
tokenizer = ElectraTokenizer.from_pretrained("google/electra-base-discriminator")
model = ElectraForSequenceClassification.from_pretrained("google/electra-base-discriminator")
x_train_str = x_train.astype(str).tolist()
x_test_str = x_test.astype(str).tolist()
# Tokenize input data
X_train_encodings = tokenizer(x_train_str, truncation=True, padding=True, return_tensors="pt")
X_test_encodings = tokenizer(x_test_str, truncation=True, padding=True, return_tensors="pt")
class ElectraDataset(Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __len__(self):
return len(self.labels)
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
item['labels'] = torch.tensor(self.labels[idx])
return item
# Assuming you have labels for your training data (replace 'y_train' with your actual variable name)
# Make sure to convert your labels to tensor if they are not already in that format
y_train_tensor = torch.tensor(y_train)
# Create training and testing datasets
train_dataset = ElectraDataset(X_train_encodings, y_train_tensor)
test_dataset = ElectraDataset(X_test_encodings, y_test)
# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
from transformers import AdamW, get_linear_schedule_with_warmup
# Set up scheduler
num_epochs = 3
total_steps = len(train_loader) * num_epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)
# Training loop
for epoch in range(num_epochs):
model.train()
total_loss = 0
for batch in train_loader:
optimizer.zero_grad()
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
total_loss += loss.item()
loss.backward()
optimizer.step()
scheduler.step()
average_loss = total_loss / len(train_loader)
print(f"Epoch {epoch + 1}/{num_epochs}, Average Training Loss: {average_loss}")
# Evaluation loop
model.eval()
all_preds = []
all_labels = []
with torch.no_grad():
for batch in test_loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
outputs = model(input_ids, attention_mask=attention_mask)
logits = outputs.logits
preds = torch.argmax(logits, dim=1)
all_preds.extend(preds.cpu().numpy())
all_labels.extend(labels.cpu().numpy())
# Calculate accuracy
accuracy = accuracy_score(all_labels, all_preds)
print(f"Accuracy on the test set: {accuracy}")