Breaking the Barriers of BERT and GPT: Introducing MASS
A New Era of Pre-training for Natural Language Generation
In the realm of natural language understanding, BERT and GPT have long been the gold standard for pre-training models. However, when it comes to natural language generation tasks, these models often fall short of optimal performance. The reason lies in their architecture, which typically consists of a single encoder for understanding and a decoder for generating text. This separation prevents the joint training of encoders and decoders, resulting in suboptimal results.
Enter MASS: Masked Sequence to Sequence Pre-training
To address this limitation, Microsoft Research Asia has introduced a new pre-training method called MASS (Masked Sequence to Sequence Pre-training). MASS is specifically designed for natural language generation tasks, where the goal is to predict a sequence of words given a sequence of input words.
How MASS Works
In MASS, a random mask is applied to a sequence of words, blocking a consecutive segment of 3-6 words on the encoder side. The decoder is then forced to predict only a few consecutive words, while the masked words are hidden from view. This encourages the decoder to extract information from the encoder and promote joint training of the encoder-decoder structure.
Advantages of MASS
MASS has several advantages over traditional pre-training methods:
- Improved joint training: By masking words on the encoder side, MASS encourages the decoder to extract information from the encoder, promoting joint training of the encoder-decoder structure.
- Enhanced language modeling: The encoder is forced to extract semantic words, enhancing its ability to understand the source text sequence.
- Better language modeling capability: The decoder is trained to predict contiguous sequence segments, improving its language modeling capability.
Unified Framework for Pre-training
MASS has a significant hyper-parameter k, which controls the length of the masked segment. By adjusting k, MASS can be used as a common pre-training framework, including language model training methods like BERT and GPT.
Experimental Results
Experiments were conducted on four tasks: unsupervised machine translation, low-resource machine translation, text summarization, and dialogue generation. The results showed that MASS outperformed BERT and GPT in all tasks, with significant improvements in unsupervised machine translation and dialogue generation.
Conclusion
MASS is a new pre-training method specifically designed for natural language generation tasks. Its ability to promote joint training of encoders and decoders, enhance language modeling, and improve language modeling capability make it a powerful tool for natural language processing. With its unified framework and adjustable hyper-parameters, MASS has the potential to become a standard pre-training method for natural language generation tasks.
Future Work
The authors plan to extend MASS to include other sequence-to-sequence tasks, such as voice and video generation. They also aim to explore the use of MASS in more natural language tasks, including sentiment analysis and named entity recognition.