Literature Review - Large Language Model (LLM)

In the context of classifying AI-generated content, Large Language Models (LLMs) are commonly used because it can capture the nuances in the language and can effectively achieve general-purpose language understanding. In practice, LLMs are trained on massive unlabeled dataset before being deployed for specific downstream tasks, with little changes in the architecture post-training. Training for a Large Language Model requires high computational power, usually trained distributedly in multiple GPUs / TPUs for days / months.

The Pioneer of LLM - ELMo

One of the pioneering Large Language Models pre-trained on unlabeled corpus before fine-tuning for specific downstream tasks is Embeddings from Language Models (ELMo).

What’s so special about ELMo?

  1. ELMo takes context into account.
    • Language models before ELMo simply produce embedding based on literal spelling of the word. They do not factor in how, when, to whom the word is being used.
    • ELMo uses bidirectional Long Short-Term Memory (LSTM) to learn the sentence from “left-to-right” and then “right-to-left” at different time. After learning, it will concatenate the hidden states to “understand” the full context.
  2. ELMo is a general-purpose pre-trained Language model that is open source
    • People can download the code from Github and fine-tune it themselves.
    • ELMo is trained on a massive corpus, hence it has learned a lot of linguistic knowledge and will perform well in a variaty of doamins.

But why ELMo loses its advantages over time?

While ELMo uses bidirectional LSTM to learn is context is impressive, but it has its challenges, leading to a loss of limelight.

  1. Locality Bias and Lack of Simultaneous Context Comprehension
    • ELMo’s bidirectional LSTM architecture introduces locality bias, which is its tendency to focus more on recent information and give lesser consideration on distant context. This bias can limit its ability to truly grasp the overall context of a sentence.
    • ELMo is trained bidirectionally at different times, “left-to-right” then “right-to-left”. Compared to later models that trained bidirectionally simultaneously, ELMo’s approach may result in a less nuaces understanding of context.
  2. Slowness in Training Due to LSTM Architecture
    • It is also slow to train ELMos, as it relies on back-propagating of both long and short term memories to update all the parameters.
    • In LSTM, the longer the sentence, the higher number of operations required to relate signals from two arbitrary input or output positions.
  3. Evolution of Transformer Architecture
    • To address these challenges, researchers in Google came up with an idea called Transformer.
    • This marked a shift from the sequential processing of LSTM to a more parallelized and scalable approach, improving both training speed and the model’s ability to capture long-range dependencies.
    • In the Transformer, relating signals from two arbitrary input or output positions is reduced to a constant number of operation by using self-attention machanism.
    • Self-attention mechanism assigned weights based on similarity between positions, those are not similar can be eliminated from calculation, so the complexity of the computation will be a constant number. With transformer, the training of the model will be more effective.

All in all, while ELMo pioneered contextual embeddings, its limitations in terms of training efficiency and context comprehension paved the way for newer models, like those based on the Transformer architecture, to take center stage in the ever-evolving landscape of natural language processing.

Transformer

Under the shared high level idea of pre-trained language model on corpus and the uses of Transformer, different unsupervised pretraining objectives have been explored. Among them, Auto Regressive (AR) and Auto Encoding (AE) are the two most popular ones.

Auto Regressive

In AR, the models are fed with an input sequence of words, and the model will try to predict next suitable word by using probability distributions. The most notable example is Generative Pre-Trained Transformers (GPT).

AR Example 1: GPT

GPT 1:
GPT 1 developed EMLO’s idea further by using transformer instead of Bidiretional LSTM.

GPT 2: GPT 2 has the same architectual features as GPT 1, but they used a large input training datasets and got some major improvements. In GPT 2, they incorporates additional context, such as Parts of Speech (Noun or Verb) and Subject-Object Detection instead of solely predicting the next word in sequence based on previous words. They also use superior sampling algorithm like Top-p sampling, temperature scaling and unconditional sampling in the process of generating new text. These algorithms help in preventing the generation of overly diverse or nonsensical text, providing a balance between creativity and coherence.

GPT 3:
In GPT 3, it is trained on a even larger corpus of text data, it allows the model to be split across multiple accelerators for training, and implemented Few-shot learning to let GPT 3 to quickly adapt to new task based on general understanding of language even though they have not been explicitly trained on it.

GPT3.5:
In GPT 3.5, they incorporate Reinforcement Learning with Human Feedback (RLHF), making sure that the human generated feedback will provide additional rewards or penalties to influence the learning process, an attempt to make sure the model will generate safe and responsible answer.

GPT 4:
In GPT 4, it allows user to process both text and image inputs while generating text outputs.

Auto Encoding

In comparison, AE does not perform explicit probability distribution method but instead aim to reconstruct the original data from corrupted input.

AE Example 1: BERT

BERT is the pioneer in introducing AE. BERT uses two unsupervised tasks for pre-training objectives; one is Next Sentence Prediction, another one is Masked Language Model.

Next Sentence Prediction: In Next Sentence Prediction, it binarized the classification task where it predicts whether a sentence follows another sentence in a given pair. Besides, a token called [CLS] is added to store the prediction.

Masked Language Model: Masked Langugae Model in BERT means the model randomly MASK some percentage of the input tokens, then recover the original tokens from the corrupted version. In many early language models, models are trained left-to-right or right-to-left, especially those based on recurrent neural networks (AR), meaning they predicted one word at a time based on preceding words. However, that is the not ideal case because the predicted word is not influenced by the entire context in which it appears. To address this limitation, the Masked Language Model was introduced, emphasizing the importance of comprehending the bidirectional context of words within a sentence. This modification enhances the model’s ability to capture the nuanced relationships between words and their context, resulting in more robust language understanding.

While in the Masked Language Model, a potential issue arises when [MASK] token does not appear during fine-tuning, hence it is creating a mismatch between pre-training and fine-tuning. To mitigate this, Google AI Brain Team came up with a new model called XLNet: Generalized Autoregressive Pretraining for Language Understanding.

Another (AE+AR) Example: XLNET

XLNet introduces a new pre-training method called Permutation Language Modeling (PLM), distinguishing itself from traditional methods like BERT. Instead of masking tokens, it predicts a token’s identity by considering all tokens in the sequence, effectively capturing bidirectional context while avoiding the issues associated with predicting masked tokens.

More details:
Operating on a generalized AutoRegressive (AR) foundation, XLNet incorporates a permutation language modeling objective to synergize the benefits of both AR and Adversarial Example (AE) methods. Unlike conventional AR models that use a fixed forward or backward factorization order, XLNet maximized the expected log likelihood of a sequence with respect to all possible permutations of the factorization order. Because of this permutation operation, the context for each position can consists of tokens from both left and right, fostering a more comprehensive understanding of sentence structure. Summary, XLNet as a generalized AR language model, avoids relying on data corruption during pre-training, hence it does not suffer from the pretrain-finetune discrepancy that BERT is experiencing.

AE Example 2: RoBERTa

All of the Self-training method mentioned above have brought significant performance gain for the benchmark testing. A few years later, due to the challenge to determine which aspect of the methods contribute the most, Facebook AI rolls out RoBERTa: A Robustly Optimized BERT Pretraining Approach, which includes a careful evaluation of the effects of hyperparmeter tuning and training set size. In their research, they found that BERT is significantly undertrained.

In RoBERTa, they trained the model longer, with bigger batches over more data. They also remove the Next Sentence Prediction as they find that the use Segment sentences without Next Sentence Prediction slightly improves the performance on downstream tasks. In the meantime, they substituted Static Masking with Dynamic Masking. Initially in BERT, to avoid using the same static mask for each training instance is every epoch, training data was duplicated ten times so that each sequence is masked in 10 different ways over the 40 epochs of training. In RoBERTa, they use dynamic masking where they generate the masking pattern every time we feed a sequence to the model. This becomes crucial when pre-training for more steps or with larger datasets.

Another Example: ELECTRA

Although Masked Language Modeling has significant performance gains, model such as BERT require extensive computational resources for optimal effectiveness. Hence, Google and Stanford proposed a more sample-efficient pretraining task called replaced token detection in 2020 - ELECTRA. In ELECTRA, a different approach is taken; rather than masking the input, they corrupt it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, they train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. In their research, they find that model work best with generator 1/4 - 1/2 the size of the discriminator. Although their approach is reminiscent of training the discriminator of a GAN, their method is not adversarial as the generator producing corrupted tokens is trained with maximum likelihood due to the difficulty of applying GANs to text as GANs are originally designed for continuous data but not discrete data like text.