top of page
  • Writer's pictureCSNP

How does ChatGPT work? Tracing the evolution of AIGC

Updated: Jan 24

Originally published on December 31, 2022 at DTonomy

Key AIGC Technology

Author Peter Luo

RNN Seq2Seq

For a long time, AIGC has been dominated by the RNN-based Seq2Seq model which consists of two RNN networks, with the first RNN being the encoder and the second RNN being the decoder. The quality of the text generated by RNN Seq2Seq is usually poor, often accompanied by grammatical errors or unclear semantics, mainly due to error transmission and amplification.

RNN Seq2Seq Model

In 2017, the Transformer model structure was introduced and quickly gained popularity due to its ability to capture complex feature representations and its improved training efficiency compared to RNN models. As a result, a series of pre-training models were developed, which have become the leading AIGC technologies. The following section will provide an overview of these models. The Transformer model is particularly useful because it can process sequences in parallel, leading to a shift in the focus of text-writing algorithm research toward the Transformer model.

Transformer Architecture


UniLM, short for Unified Language Model, is a generative BERT model developed by the Microsoft Research Institute in 2019. Unlike traditional Seq2Seq models, it only utilizes BERT and does not have a Decoder component. It combines the training methods of several other models, such as L2R-LM (ELMo, GPT), R2L-LM (ELMo), BI-LM (BERT), and Seq2Seq-LM, hence the name “Unified” model.

UniLM Model Architecture (Source)

UniLM’s pre-training is divided into three parts: Left-to-Right, Bidirectional, and Seq-to-Seq.

The difference between these three methods is only in the change of the Transformer’s mask matrix:

  • For Seq-to-Seq, the Attention of the previous sentence is masked for the following sentence, so that the previous sentence can only focus on itself but not the following sentence; the Attention of each word in the following sentence to its subsequent words is masked, and it can only focus on the words before it;

  • For Left-to-Right, the Transformer’s Attention only focuses on the word itself and the words before it and does not pay attention to the words after it, so the mask matrix is a lower triangle matrix;

  • For Bidirectional, the Transformer’s Attention pays attention to all words and includes the NSP task, just like the original BERT.

In the UniLM pre-training process, each of these three methods is trained for 1/3 of the time. Compared to the original BERT, the added unidirectional LM pre-training enhances the text representation ability, and the added Seq-to-Seq LM pre-training also enables UniLM to perform well in text generation/writing tasks.


T5, whose full name is Text-to-Text Transfer Transformer, is a model structure proposed by Google in 2020 with the general idea of using Seq2Seq text generation to solve all downstream tasks: e.g., Q&A, summarization, classification, translation, matching, continuation, denotational disambiguation, etc. This approach enables all tasks to share the same model, the same loss function, and the same hyperparameters.

The model structure of T5 is an Encoder-Decoder structure based on a multilayer Transformer. The main difference between T5 and the other models is that the GPT family is an autoregressive language model (AutoRegressive LM) containing only the Decoder structure, and BERT is a self-coding language model (AutoEncoder LM) containing only the Encoder.

Diagram of text-to-text framework. Every task uses text as input to the model, which is trained to generate some target text. The tasks include translation (green), linguistic acceptability (red), sentence similarity (yellow), and document summarization (blue) (Source).

The pre-training of T5 is divided into two parts, unsupervised and supervised.

  • Unsupervised training

The unsupervised part is the MLM method similar to BERT, except that BERT is masking a single word, while T5 is masking a segment of consecutive words, i.e., text span. The text span being masked is only replaced by a single mask character, i.e., the sequence length of the mask is also unknown for the post-mask text. In the Decoder part, only the text span of the mask is output, and the other words are replaced by the set <X>, <Y>, and <Z> symbols uniformly. This has three advantages, one is that it increases the difficulty of pre-training, obviously predicting a continuous text span of unknown length is a more difficult task than predicting a single word, which also makes the text representation capability of the trained language model more universal and more adaptable to fine-tuning on poor quality data; the second is that for the generation task the output sequence is of unknown length, and the pre-training of T5 is well This pre-training task used in T5 is also known as CTR (Corrupted Text Reconstruction).

  • Supervised training

The supervised part uses the four major categories of tasks included in GLUE and SuperGLUE: machine translation, question and answer, summarization, and classification. The core of Fine-tune is to combine these datasets and tasks together as one task, and in order to achieve this it is thought to design a different prefix for each task, which is input together with the task text. For example, for the translation task, to translate “That is good.” from English to German, then “translate English to German: That is good. target: Das ist gut.” is entered for training, and “translate English to German: That is good. target:”, and the model output predicts “Das ist gut.”. where “translate English to German:” is the prefix added for this translation task.


BART stands for Bidirectional and Auto-Regressive Transformers. It is a model structure proposed by Facebook in 2020. As its name suggests, it is a model structure that combines a bidirectional encoding structure with an auto-regressive decoding structure. The BART model structure absorbs the characteristics of the Bidirectional Encoder in BERT and the Left-to-Right Decoder in GPT, building on the standard Seq2Seq Transformer model, which makes it more suitable for text generation scenarios than BERT. At the same time, compared to GPT, it also has more bidirectional contextual context information.

Bart Model Architecture (Source)