Embeddings in Keras: Train vs. Pretrained

Most of the state-of-the-art NLP applications — e.g. machine translation and summarization — are now based on recurrent neural networks (RNNs). And more often than not, we'll need to choose a word representation before hand.

Here are two ways of creating word representations:

  1. One-hot Encoding: A simple method is to represent each word using a one-hot vector. Suppose your vocabulary contains 50K words, then the nth word would be represented as a 50K-dimensional vector, full of 0s except for a 1 at the nth position. However, with such a large vocabulary of 50K words, this sparse representation is very inefficient.

  2. Word Embeddings (⭐️): Ideally, you'd want similar words to have similar representations, making it easy for the model to generalize what it learns about a word to all similar words. For example, the representation for "car" should be more similar to "lorry" than, say, "pasta". This is the idea behind word embeddings.

In a Nutshell:

  • Word embeddings provide a dense representation of words and their relative meanings.

  • They are an improvement over sparse representations used in simpler bag of word model representations.

  • Word embeddings can be learned from text data and reused among projects. They can also be learned as part of fitting a neural network on text data.


Let's explore two different ways to add an embedding layer in Keras:

  1. Train your own embedding layer
  2. Use a pretrained embedding (like GloVe)

Import Dependencies and Load Toy Data

import re
import numpy as np
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import one_hot
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense
# Define documents
docs = ['Well done!', 'Good work', 'Great effort', 'nice work', 'Excellent!',
        'Weak', 'Poor effort!', 'not good', 'poor work', 'Could have done better.']

# Define class labels
labels = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]

1. Train Your Own Embedding (oe) Layer

Note that we're using a Keras Sequential Model here to do the job.

One-hot encode the documents in docs:

own_embedding_vocab_size = 10
encoded_docs_oe = [one_hot(d, own_embedding_vocab_size) for d in docs]
print(encoded_docs_oe)

Output:

[[2, 6], [5, 9], [2, 9], [4, 9], [2], [7], [2, 9], [3, 5], [2, 9], [5, 9, 6, 3]]

Pad each document to ensure they are of the same length:

maxlen = 5
padded_docs_oe = pad_sequences(encoded_docs_oe, maxlen=maxlen, padding='post')
print(padded_docs_oe)

Output:

[[2 6 0 0 0]
 [5 9 0 0 0]
 [2 9 0 0 0]
 [4 9 0 0 0]
 [2 0 0 0 0]
 [7 0 0 0 0]
 [2 9 0 0 0]
 [3 5 0 0 0]
 [2 9 0 0 0]
 [5 9 6 3 0]]

Define the model:

model = Sequential()
model.add(Embedding(input_dim=own_embedding_vocab_size, # 10
                    output_dim=32, 
                    input_length=maxlen))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

Compile and train the model:

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])  # Compile the model
print(model.summary())  # Summarize the model
model.fit(padded_docs_oe, labels, epochs=50, verbose=0)  # Fit the model
loss, accuracy = model.evaluate(padded_docs_oe, labels, verbose=0)  # Evaluate the model
print('Accuracy: %0.3f' % accuracy)
> _________________________________________________________________
> Layer (type)                 Output Shape              Param #   
> =================================================================
> embedding_1 (Embedding)      (None, 5, 32)             320       
> _________________________________________________________________
> flatten_1 (Flatten)          (None, 160)               0         
> _________________________________________________________________
> dense_1 (Dense)              (None, 1)                 161       
> =================================================================
> Total params: 481
> Trainable params: 481
> Non-trainable params: 0
> _________________________________________________________________
> None
> Accuracy: 0.800

2. Use a Pretrained GloVe Embedding (ge) Layer

Note that we're using a Keras Functional Model here to do the job.

Download and use the load_glove_embeddings() function:

from load_glove_embeddings import load_glove_embeddings

word2index, embedding_matrix = load_glove_embeddings('data_embeddings/en/glove.6B.50d.txt', embedding_dim=50)

One-hot encode the documents in docs with our special custom_tokenize() function, which requires the word2index variable from the previous step:

def custom_tokenize(docs):
    output_matrix = []
    for d in docs:
        indices = []
        for w in d.split():
            indices.append(word2index[re.sub(r'[^\w\s]','',w).lower()])
        output_matrix.append(indices)
    return output_matrix
    
# Encode docs with our special "custom_tokenize" function
encoded_docs_ge = custom_tokenize(docs)
print(encoded_docs_ge)

Output:

[[143, 751], [219, 161], [353, 968], [3082, 161], [4345], [2690], [992, 968], [36, 219], [992, 161], [94, 33, 751, 439]]

Pad each document to ensure they are of the same length:

# Pad documents to a max length of 5 words
maxlen = 5
padded_docs_ge = pad_sequences(encoded_docs_ge, maxlen=maxlen, padding='post')
print(padded_docs_ge)

Output:

[[ 143  751    0    0    0]
 [ 219  161    0    0    0]
 [ 353  968    0    0    0]
 [3082  161    0    0    0]
 [4345    0    0    0    0]
 [2690    0    0    0    0]
 [ 992  968    0    0    0]
 [  36  219    0    0    0]
 [ 992  161    0    0    0]
 [  94   33  751  439    0]]

Define the model (note that the embedding_matrix variable is required here):

from keras.models import Model
from keras.layers import Input

embedding_layer = Embedding(input_dim=embedding_matrix.shape[0],
                            output_dim=embedding_matrix.shape[1], 
                            input_length=maxlen,
                            weights=[embedding_matrix], 
                            trainable=False, 
                            name='embedding_layer')

i = Input(shape=(maxlen,), dtype='int32', name='main_input')
x = embedding_layer(i)
x = Flatten()(x)
o = Dense(1, activation='sigmoid')(x)

model = Model(inputs=i, outputs=o)

Compile and train the model:

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])  # Compile the model
print(model.summary())  # Summarize the model
model.fit(padded_docs_ge, labels, epochs=50, verbose=0)  # Fit the model
loss, accuracy = model.evaluate(padded_docs_ge, labels, verbose=0)  # Evaluate the model
print('Accuracy: %0.3f' % accuracy)
> _________________________________________________________________
> Layer (type)                 Output Shape              Param #   
> =================================================================
> main_input (InputLayer)      (None, 5)                 0         
> _________________________________________________________________
> embedding_layer (Embedding)  (None, 5, 50)             20000050  
> _________________________________________________________________
> flatten_2 (Flatten)          (None, 250)               0         
> _________________________________________________________________
> dense_2 (Dense)              (None, 1)                 251       
> =================================================================
> Total params: 20,000,301
> Trainable params: 251
> Non-trainable params: 20,000,050
> _________________________________________________________________
> None
> Accuracy: 1.000

In a Nutshell

Here's the main difference:

  1. Own Embedding:
embedding_layer_1 = Embedding(input_dim=own_embedding_vocab_size, 
                              output_dim=32, 
                              input_length=maxlen)
  1. Pretrained Embedding (requires embedding_matrix):
embedding_layer_2 = Embedding(input_dim=embedding_matrix.shape[0],
                              output_dim=embedding_matrix.shape[1], 
                              input_length=maxlen,
                              weights=[embedding_matrix], 
                              trainable=False)

If you enjoyed this post and want to buy me a cup of coffee...

The thing is, I'll always accept a cup of coffee. So feel free to buy me one.

Cheers! ☕️