CNN is the Future: A Novel Seq2Seq Architecture for Machine Translation

CNN is the Future: A Novel Seq2Seq Architecture for Machine Translation

The conventional method of learning by means of RNN (Recurrent Neural Network) has been the backbone of sequence-to-sequence (Seq2Seq) modeling for many tasks, including machine translation, speech recognition, and text summarization. However, the limitations of RNNs, such as their sequential nature and reliance on gradient-based learning, have hindered their performance on large-scale tasks. In this paper, we propose a novel Seq2Seq architecture based on a completely CNN (Convolutional Neural Network) architecture, which we call FAIR (Facebook AI Research).

Advantages of FAIR

Compared to the traditional RNN-based Seq2Seq architecture, FAIR has several advantages. Firstly, the GPU hardware performance can be better utilized, as all elements in the training process can be fully parallelized. Secondly, the fixed number of non-linear input is not dominated by the length, making it easier to optimize. Finally, the FAIR linear gating mechanism, also known as gated linear units (GLU), eases the spread of the gradient, and each decoder is equipped with a separate attention module.

Experimental Results

We evaluated FAIR on several large data sets, including WMT’16 English - Romanian, WMT’14 English - German, and WMT’14 English - French. Our results show that FAIR outperforms the current best architecture on all three tasks, with a 1.9 BLEU advantage on WMT’16 English - Romanian, a 1.6 BLEU advantage on WMT’14 English - French, and a 0.5 BLEU advantage on WMT’14 English - German. Furthermore, our model is significantly faster than the previous best architecture, with an order of magnitude improvement on both GPU and CPU hardware.

Architecture of FAIR

The architecture of FAIR consists of two main components: the encoder and the decoder. The encoder is a convolutional network that takes in the input sequence and produces a hierarchical representation of the input. The decoder is a convolutional network that takes in the encoder’s output and generates the output sequence.

Encoder

The encoder is a convolutional network that consists of multiple convolutional blocks. Each block contains a convolutional layer followed by a non-linear one-dimensional convolutional layer. The output of each block is a hierarchical representation of the input sequence, with each output element depending on a fixed number of input elements.

Decoder

The decoder is a convolutional network that takes in the encoder’s output and generates the output sequence. The decoder consists of multiple convolutional blocks, similar to the encoder. However, the decoder has a separate attention module for each block, which allows it to focus on specific elements of the input sequence.

Attention Mechanism

The attention mechanism in FAIR is a multi-step attention mechanism, which allows the decoder to focus on specific elements of the input sequence at each time step. The attention mechanism is calculated as a weighted sum of the output of the encoder and the current state of the decoder.

Initialization

The weights of the network are initialized using a careful re-initialization method, which ensures that the transfer bias remains unchanged during the entire forward and backward pass. The weights are also initialized with a normal distribution, with a mean of 0 and a standard deviation of 0.1.

Conclusion

In this paper, we proposed a novel Seq2Seq architecture based on a completely CNN architecture, which we call FAIR. Our results show that FAIR outperforms the current best architecture on several large data sets, including WMT’16 English - Romanian, WMT’14 English - German, and WMT’14 English - French. Furthermore, our model is significantly faster than the previous best architecture, with an order of magnitude improvement on both GPU and CPU hardware. We believe that FAIR has the potential to be a game-changer in the field of machine translation and other sequence-to-sequence tasks.

Table 1: Compared with previous work, our model accuracy on WMT task.

Model WMT’16 English - Romanian WMT’14 English - German WMT’14 English - French
FAIR 1.9 BLEU 0.5 BLEU 1.6 BLEU
GNMT ConvS2S 1.8 BLEU 0.4 BLEU 1.5 BLEU

References

[1] Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104-3112).

[2] Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. In Advances in neural information processing systems (pp. 2065-2073).

[3] Wu, Y., Schuster, T., Chen, E., Sun, Q., Li, V. O., Norouzi, M., & Salakhutdinov, R. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. In Advances in neural information processing systems (pp. 3612-3620).

[4] Gehring, J., Auli, M., Grangier, D., Auli, Y., & Dauphin, Y. (2016). A neural conversational model. In Advances in neural information processing systems (pp. 1466-1474).

[5] He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).

[6] Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 249-256).