NBViewer | Google Colab | |
---|---|---|
Old Code - French to English | ||
New Code - English to French w/ GloVe Embeddings |
If someday PyTorch decides to remove the data.zip
file, I’ve added it to this repository.
Some of the dataset code was changed so that it supports the PyTorch Lightning Data Module and Model, like
Use build_vocab_from_iterator
to build a Vocab
object, this will later be used with pretrained word embedding, to map the vocab to the GloVe’s vocab.
def prepare_langs(self, lang_file='eng-fra', reverse=True):
with urlopen(self.zip_url) as f:
with BytesIO(f.read()) as b, ZipFile(b) as datazip:
lang1, lang2 = lang_file.split('-')
pairs = readPairs(datazip, lang1, lang2, reverse)
print("Read %s sentence pairs" % len(pairs))
pairs = filterPairs(pairs)
print("Trimmed to %s sentence pairs" % len(pairs))
print("Counting words...")
input_sentences, target_sentences = zip(*pairs)
input_lang = build_vocab_from_iterator(
[sentence.split(' ') for sentence in input_sentences],
specials=special_tokens
)
output_lang = build_vocab_from_iterator(
[sentence.split(' ') for sentence in target_sentences],
specials=special_tokens
)
setattr(input_lang, 'name', lang2 if reverse else lang1)
setattr(output_lang, 'name', lang1 if reverse else lang2)
setattr(input_lang, 'n_words', len(input_lang))
setattr(output_lang, 'n_words', len(output_lang))
print("Counted words:")
print(input_lang.name, input_lang.n_words)
print(output_lang.name, output_lang.n_words)
return input_lang, output_lang, pairs
The encoder of a seq2seq network is a RNN that outputs some value for every word from the input sentence. For every input word the encoder outputs a vector and a hidden state, and uses the hidden state for the next input word.
If only the context vector is passed between the encoder and decoder, that single vector carries the burden of encoding the entire sentence.
Attention allows the decoder network to “focus” on a different part of the encoder’s outputs for every step of the decoder’s own outputs. First we calculate a set of attention weights. These will be multiplied by the encoder output vectors to create a weighted combination. The result (called attn_applied
in the code) should contain information about that specific part of the input sequence, and thus help the decoder choose the right output words.
Calculating the attention weights is done with another feed-forward layer attn
, using the decoder’s input and hidden state as inputs. Because there are sentences of all sizes in the training data, to actually create and train this layer we have to choose a maximum sentence length (input length, for encoder outputs) that it can apply to. Sentences of the maximum length will use all the attention weights, while shorter sentences will only use the first few.
GloVe
EmbeddingsGlobal Vectors for Word Representation, or GloVe, is an “unsupervised learning algorithm for obtaining vector representations for words.” Simply put, GloVe allows us to take a corpus of text, and intuitively transform each word in that corpus into a position in a high-dimensional space. This means that similar words will be placed together.
I found this nice way for using Embeddings
with GloVe
Vectors
from torchtext.vocab import GloVe, vocab
from torchtext.datasets import AG_NEWS
from torchtext.data.utils import get_tokenizer
import torch
import torch.nn as nn
#define your model that accepts pretrained embeddings
class TextClassificationModel(nn.Module):
def __init__(self, pretrained_embeddings, num_class, freeze_embeddings = False):
super(TextClassificationModel, self).__init__()
self.embedding = nn.EmbeddingBag.from_pretrained(pretrained_embeddings, freeze = freeze_embeddings, sparse=True)
self.fc = nn.Linear(pretrained_embeddings.shape[1], num_class)
self.init_weights()
def init_weights(self):
initrange = 0.5
self.fc.weight.data.uniform_(-initrange, initrange)
self.fc.bias.data.zero_()
def forward(self, text, offsets):
embedded = self.embedding(text, offsets)
return self.fc(embedded)
train_iter = AG_NEWS(split = 'train')
num_class = len(set([label for (label, _) in train_iter]))
unk_token = "<unk>"
unk_index = 0
glove_vectors = GloVe()
glove_vocab = vocab(glove_vectors.stoi)
glove_vocab.insert_token("<unk>",unk_index)
#this is necessary otherwise it will throw runtime error if OOV token is queried
glove_vocab.set_default_index(unk_index)
pretrained_embeddings = glove_vectors.vectors
pretrained_embeddings = torch.cat((torch.zeros(1,pretrained_embeddings.shape[1]),pretrained_embeddings))
#instantiate model with pre-trained glove vectors
glove_model = TextClassificationModel(pretrained_embeddings, num_class)
tokenizer = get_tokenizer("basic_english")
train_iter = AG_NEWS(split = 'train')
example_text = next(train_iter)[1]
tokens = tokenizer(example_text)
indices = glove_vocab(tokens)
text_input = torch.tensor(indices)
offset_input = torch.tensor([0])
model_output = glove_model(text_input, offset_input)
And for using Pretrained Embeddings with an existing Vocab object
min_freq = 5
special_tokens = ['<unk>', '<pad>']
vocab = torchtext.vocab.build_vocab_from_iterator(train_data['tokens'],
min_freq=min_freq,
specials=special_tokens)
# train_data['tokens'] is a list of a list of strings, i.e. [['hello', 'world'], ['goodbye', 'moon']], where ['hello', 'moon'] is the tokens corresponding to the first example in the training set.
pretrained_vectors = torchtext.vocab.FastText()
pretrained_embedding = pretrained_vectors.get_vecs_by_tokens(vocab.get_itos())
# vocab.get_itos() returns a list of strings (tokens), where the token at the i'th position is what you get from doing vocab[token]
# get_vecs_by_tokens gets the pre-trained vector for each string when given a list of strings
# therefore pretrained_embedding is a fully "aligned" embedding matrix
class NBoW(nn.Module):
def __init__(self, vocab_size, embedding_dim, output_dim):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.fc = nn.Linear(embedding_dim, output_dim)
def forward(self, text):
# text = [batch size, seq len]
embedded = self.embedding(text)
# embedded = [batch size, seq len, embedding dim]
pooled = embedded.mean(dim=1)
# pooled = [batch size, embedding dim]
prediction = self.fc(pooled)
# prediction = [batch size, output dim]
return prediction
vocab_size = len(vocab)
embedding_dim = 300
output_dim = n_classes
model = NBoW(vocab_size, embedding_dim, output_dim, pad_index)
# super basic model here, important thing is the nn.Embedding layer that needs to have an embedding layer that is initialized as nn.Embedding(vocab_size, embedding_dim) with embedding_dim = 300 as that's the dimensions of the FastText embedding
model.embedding.weight.data = pretrained_embedding
# overwrite the model's initial embedding matrix weights with that of the pre-trained embeddings from FastText
And this is how I integrated GloVe Embeddings into this assignment
class EncoderRNN(nn.Module):
def __init__(self, input_size, hidden_size, use_pretrained=False, vocab_itos=None):
super(EncoderRNN, self).__init__()
self.hidden_size = hidden_size
if use_pretrained and vocab_itos is None:
raise ValueError('`use_pretained=True` with `vocab_itos=None`, please provide the vocab itos List')
if use_pretrained:
glove_vec = torchtext.vocab.GloVe(name='6B')
glove_emb = glove_vec.get_vecs_by_tokens(vocab_itos)
self.embedding = nn.Embedding.from_pretrained(glove_emb, padding_idx=PAD_token)
else:
self.embedding = nn.Embedding(input_size, hidden_size)
assert self.embedding.embedding_dim == hidden_size,\
f'hidden_size must equal embedding dim, found hidden_size={hidden_size}, embedding_dim={self.embedding.embedding_dim}'
self.gru = nn.GRU(hidden_size, hidden_size)
Consider the task of sequence prediction, so you want to predict the next element of a sequence given the previous elements of this sequence . Teacher forcing is about forcing the predictions to be based on correct histories (i.e. the correct sequence of past elements) rather than predicted history (which may not be correct). To be more concrete, let denote the th predicted element of the sequence and let be the corresponding ground-truth. Then, if you use teacher forcing, to predict etet, rather than using , you would use . ai.stackexchange
Here’s another explanation
Teacher forcing is like a teacher correcting a student as the student gets trained on a new concept. As the right input is given by the teacher to the student during training, student will learn the new concept faster and efficiently.
When training with teacher forcing, at random we choose to do forcing, in this we supply the actual output of the previous time step instead of the predicted output from the previous time step of the encoder.
if use_teacher_forcing:
# Teacher forcing: Feed the target as the next input
for di in range(target_length):
decoder_output, decoder_hidden, decoder_attention = self.attn_decoder(
decoder_input, decoder_hidden, encoder_outputs)
loss += self.criterion(decoder_output, target_tensor[di])
decoder_input = target_tensor[di] # Teacher forcing
else:
# Without teacher forcing: use its own predictions as the next input
for di in range(target_length):
decoder_output, decoder_hidden, decoder_attention = self.attn_decoder(
decoder_input, decoder_hidden, encoder_outputs)
topv, topi = decoder_output.topk(1)
decoder_input = topi.squeeze().detach() # detach from history as input
loss += self.criterion(decoder_output, target_tensor[di])
if decoder_input.item() == EOS_token:
break
But why do we really have to do this?
Lets assume we have an slightly trained Network for the Encoder and Decoder
And these are our sentences
SRC: <SOS> hi satyajit how are you ? <EOS>
TGT: <SOS> salut satyajit comment vas-tu ? <EOS>
After the entire SRC
is sent to the encoder word by word, we will have some embeddings, which would be meaningless since the model is not trained that well
This is what the decoder will see
INPUT PRED
[SOS] a
[SOS] a a ??
[SOS] a ?? a ?? ??
See how difficult it is for the decoder rnn to decode meaningless sentences, and this makes the model unstable and very difficult to learn, and this is why we randomly use the target sentence itself to train the decoder
INPUT TEACHER FORCED PRED
[SOS] ??
[SOS] ?? ??
[SOS] ?? satyajit ?? satyajit
[SOS] ?? satyajit how ?? satyajit comment
Something like above, since the decoder is fed with the actual target words as the previous input, it gets to learn better.
[KEY: > input, = target, < output]
> he s not going .
= il ne s y rend pas .
< il ne s y y . <EOS>
> we re not happy .
= nous ne sommes pas heureuses .
< nous ne sommes pas heureux . <EOS>
> we re too old .
= nous sommes trop vieux .
< nous sommes trop vieux . <EOS>
> i m not a crook .
= je ne suis pas un escroc .
< je ne suis pas un . <EOS>
> you re free of all responsibility .
= vous etes liberee de toute responsabilite .
< vous etes liberee de toute responsabilite . <EOS>
> i m sorry we re completely sold out .
= je suis desole nous avons ete devalises .
< je suis desole nous avons tout vendu . <EOS>
> you are the one .
= vous etes l elu .
< vous etes celui la . <EOS>
> they re all dead .
= elles sont toutes mortes .
< ils sont tous des . <EOS>
> he s always late for school .
= il est toujours en retard a l ecole .
< il est toujours en retard a l ecole . <EOS>
> he is busy .
= il a a faire .
< il a l l l l <EOS>
input = i m very impressed by your work .
output = je suis tres par par votre travail . <EOS>
input = we re smart .
output = nous sommes intelligents . <EOS>
input = i m still hungry .
output = j ai toujours faim . <EOS>
input = he is very eager to go there .
output = il est tres sensible de partir . <EOS>
input = i m sorry we re completely sold out .
output = je suis desole nous avons tout vendu . <EOS>
Thanks for reading, have a great day 😄
:wq satyajit