CNN for NLP

Introduction

4 min readJan 28, 2021

Convolutional neural networks (CNNs), another type of neural network architecture that operates in a very different way than how RNNs work. CNN's are particularly good at pattern matching tasks and increasingly popular in the NLP community.

Convolutional neural networks, or CNNs, do exactly this. A CNN is a type of neural network that involves a mathematical operation called convolution, which, put simply, detects local patterns that are useful for the task at hand. A CNN usually consists of one or more convolutional layers (which do convolution) and pooling layers, which are responsible for aggregating the result of convolution.

RNN vs CNN based approach

RNN can send the input tokens in a sequential manner, although CNN can’t do the same.
CNN processes entire input tokens simultaneously, all at once. In this case, we have to figure out the order of the input tokens on our own.
While implementing CNN, we have to keep track of the indexes ourselves.

In Sequence to Sequence model with RNNs, Encoder compresses entire input tokens into one combined context vector Z and passes to the Decoder.

In Sequence to Sequence models with CNNs, Encoder sends encoder conved and encoder combined, in this example, a total of 12 context vector, with 6 input tokens (including <sos> and <eos>).

Token and position embedding

Since CNNs are not like a recurrent model, the input tokens are processed simultaneously. To keep track of the positions of the input tokens, the addition of the position vector (embedding) is required. The elementwise sum does it.

Residual connection

The output of the previous Conv. block gets added with an element-wise sum. It outputs encoder combined.

Why do we need to do this?

It covers the case where the Convolution layer has a problem, for example, gradients are not flowing through the network. With this skip-connection route, the gradient will flow through the network for sure.

It makes the last layer called as Residual layer.

Use of GLU Activation function

We usually use ReLU, and it returns one output for one input

On the other side, GLU splits the input by half. Half of the input goes through sigmoid() and the rest goes through tanh() function and element-wise sum occur as it's output, which we don't want while processing texts. It means losing half the data.

GLU has been used by authors claiming better results. However, there is the loss of data, so we will do a trick during convolution to get rid of this problem.

If data with 256 dimensions comes along, we will double it during convolution. 256 *2 = 512. Later, GLU can reduce the size which won’t matter to us.

Example: out_channels is 2 * hid_dim

With Multiple layers

It can be repeated multiple times this is why the convolution layer has N x conv_block.

Decoder

Similar to Encoder with the only difference of input coming in encoder conved and encoder combined.

Decoder Convolution Blocks

Notice that padding is done twice and also only at the beginning. Although the Encoder has padding on both sides. Why is that?

Here, only one word (e.g. <sos>) needs to go in and for that, we need to only convolve on <sos> using 2 paddings. Then it predicts 'two'.

Attention

There are two places where attention is placed, one is for emb dim → hid dim and another for hid dim → out dim. Before convolution and after convolution

There are conved and combined vectors coming in from the Encoder which tell the network where the Encoder wants to put focus (attention) on.

That’s all for now folks!

The entire implementation can be found in the Colab notebook here.