The motivation to write this piece came from the FastAI course, chapter 7 on collaborative filtering. I have used embeddings in the past and understood their context in the NLP world and the relevance but in the lecture Jeremy takes you under the hood and explains what an embedding layer is. It was an aha! moment for me and motivated me to go back to some old codes that I had saved and run them again.

In image processing tasks, we have images and images have RGB channels which quantify the redness, greenness,and the blueness of the image and that can be fed to a neural net or any other machine learning algorithm to learn about the image.

But there are problems that deal with the categorical data such as:

  • written text in which the relationship and context among the words isn't provided to us, or
  • recommendation systems in which the relationship between the users and products isn't available to us in any usable format.

The relationship could be user from a particular age cohort, geography, or sex liking a particular product; or it could be words that are said in a particualr context such as "Coffee is as good as water" would have a differnt connotation in an airline review space vs a health magazine space.

How can we determine these relationships and contexts, so the ML models can use them? The answer is, we don't. We let the ML models learn them.

  1. The method is straight forward, we assign random values for let's say users and products(it can be either a single float value or a vector of values for each of the user and the product), we call these random values as latent factors.
  2. With some mathematical calculation, most likely a dot product, we find the approximate interaction between the user and product.
  3. We compare this approximate interacion with the "interaction values" such as ratings, time spent, money spent, or any other relevant metric, and find how far off our approximations are.
  4. We use Stochastic Gradient Descent to find the gradients and update the values that we had randomly assigned in the step 1 and repeat steps 2, 3, and 4

These upadted values or vectors of certain length are the embeddings. They encode in them the intrinsic realtionship between various factors and the context for your particular problem at hand

In the beginning, those random values didn't mean anything as they were chosen randomly, but by the end of the traning, they do as they learn on the existing data about the hidden relationships.

This YouTube video has more information, especially the part from 1:18:30 to 1:20:10

To calculate the result/interaction for a particular product and user combination, we have to look up the index of the product in our product latent factor matrix and the index of the user in our user latent factor matrix; then we can do our dot product between the two latent factor vectors. But look up in an index is not an operation deep learning models know how to do. They know how to do matrix products, and activation functions but not look ups.

We can represent look up in an index as a matrix product. The trick is to replace our indices with one-hot-encoded vectors e.g. for index 2 in a length 5 vector, the encoded vector would be [0, 1, 0, 0, 0]

Embedding is a computational shortcut of multiplying something with a one hot encoded vector

This multiplication with one-hot-encoded vectors is fine if it is done for only a few indices but for practical purposes creating too many enncoded vectors will cause memory management issues.

Most deep learning libraries avoid this problem by including a special layer that does this task of look up - they index into a vector using an intger. Also, the gradient calculated in such a manner that it is same as matrix multiplication with a one-hot-encoded vector.

This layer is called Embedding.

Now that we have some understanding of what an embedding is, let's try to create our own.

SCR-20220906-4rc.jpg

Problem Statement :Create the word embeddings for the Game of Thrones

Library to use :Word2Vec by Gensim

Data is present on Kaggle here, it's a text format of the five books of famous Game of Thrones

Downloads

!pip install gensim
!pip install nltk
Requirement already satisfied: gensim in /opt/conda/lib/python3.7/site-packages (4.0.1)
Requirement already satisfied: numpy>=1.11.3 in /opt/conda/lib/python3.7/site-packages (from gensim) (1.21.6)
Requirement already satisfied: scipy>=0.18.1 in /opt/conda/lib/python3.7/site-packages (from gensim) (1.7.3)
Requirement already satisfied: smart-open>=1.8.1 in /opt/conda/lib/python3.7/site-packages (from gensim) (5.2.1)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Requirement already satisfied: nltk in /opt/conda/lib/python3.7/site-packages (3.7)
Requirement already satisfied: regex>=2021.8.3 in /opt/conda/lib/python3.7/site-packages (from nltk) (2021.11.10)
Requirement already satisfied: click in /opt/conda/lib/python3.7/site-packages (from nltk) (8.0.4)
Requirement already satisfied: tqdm in /opt/conda/lib/python3.7/site-packages (from nltk) (4.64.0)
Requirement already satisfied: joblib in /opt/conda/lib/python3.7/site-packages (from nltk) (1.0.1)
Requirement already satisfied: importlib-metadata in /opt/conda/lib/python3.7/site-packages (from click->nltk) (4.12.0)
Requirement already satisfied: typing-extensions>=3.6.4 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata->click->nltk) (4.3.0)
Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata->click->nltk) (3.8.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

Imports and File reads

import glob
import codecs
import nltk
from nltk.corpus import stopwords
import re
import multiprocessing


import gensim.models.word2vec as w2v
import sklearn.manifold
import pandas as pd
path = "../input/game-of-thrones-book-files/"
book_filenames = sorted(glob.glob(path + "*.txt"))
print("Found books:")
for file in book_filenames:
    print(file[36:])
Found books:
got1.txt
got2.txt
got3.txt
got4.txt
got5.txt

Combine the books into one long string

corpus_raw = u""
for book_filename in book_filenames:
    print("Reading '{0}'...".format(book_filename))
    with codecs.open(book_filename, "r", "utf-8") as book_file:
        corpus_raw += book_file.read()
    print("Corpus is now {0} characters long".format(len(corpus_raw)))
    print()
Reading '../input/game-of-thrones-book-files/got1.txt'...
Corpus is now 1770659 characters long

Reading '../input/game-of-thrones-book-files/got2.txt'...
Corpus is now 4071041 characters long

Reading '../input/game-of-thrones-book-files/got3.txt'...
Corpus is now 6391405 characters long

Reading '../input/game-of-thrones-book-files/got4.txt'...
Corpus is now 8107945 characters long

Reading '../input/game-of-thrones-book-files/got5.txt'...
Corpus is now 9719485 characters long

Data Preprocessing and Cleaning

Preprocess the data

Split the corpus into sentences

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
nltk.download("punkt")
nltk.download("stopwords")
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
True

This is somehow taking too long, so I have created this raw_sentences list on local machine and uploaded for use here.

raw_sentences = tokenizer.tokenize(corpus_raw)

Reading the uploaded file

raw_sentences = []

with open(r'../input/sentences-corpus/sentences.txt', 'r') as fp:
    for line in fp:
        x = line[:-1]
        raw_sentences.append(x)
len(raw_sentences)
188855

From sentences to List of words and remove any unnecessary special characters using re

def sentence_to_wordlist(raw):
    clean = re.sub("[^a-zA-Z]"," ", raw)
    words = clean.split()
    return words
stop_words = set(stopwords.words('english'))

Remove the stop words such as prepositions, conjunctions etc

filtered_sentences = []
for raw_sentence in raw_sentences:
    if len(raw_sentence) > 0:
        a = sentence_to_wordlist(raw_sentence)
        filtered_sentences.append([w for w in a if not w.lower() in stop_words])
len(filtered_sentences)
157925
c = 0
for x in filtered_sentences:
    c = c + len(x)

print("The filtered corpus contains {0:,} tokens".format(c))
The filtered corpus contains 917,562 tokens

As we can see above that the book corpus has 1,818,103 tokens while the filtered one that is sanitised version of the book has 917,562 tokens.

Word2Vec Model Training

Define the parameters for the Word2vec model

num_workers = multiprocessing.cpu_count() # Fror multi threading
downsampling = 1e-3 # For frequently occurring words, downsample rate, so they don't overbeer the vocab
seed = 42 # For replication of the results
wordVectors = w2v.Word2Vec(
    sg=1, # 1 for Skip gram, 0 for CBOW
    seed=seed,
    workers=num_workers,
    vector_size=300, # How many embeddings we want per word( can be any number)
    min_count=5, # The minimum number of occurences for the word to be considered in building the word vector
    window=7, # How many neighbours of the word need to be taken into account, for the context
    sample=downsampling
)

Build the vocab

wordVectors.build_vocab(filtered_sentences)
len(wordVectors.wv)
16967

_This is where we train the model, we can also pass len(wordVectors.corpus_length) for the total_words parameter_

wordVectors.train(filtered_sentences, total_words= len(wordVectors.wv), epochs= 15)
(13178142, 13763430)
wordVectors.wv.vectors.shape
(16967, 300)

The model is trained, let's see what we have created. The shape tells us that there are 16,967 word vectors each with 300 components

Dense Matrix - T-SNE

The embeddings have 300 dimensions, so unless you're an extra-terrestrial being, you won't be able to visualise them. We'll use T-SNE to compress the 300 dimensions into 2 dimensions

tsne = sklearn.manifold.TSNE(n_components=2, random_state=0)
all_word_vectors_matrix = wordVectors.wv.vectors
all_word_vectors_matrix
array([[-0.15183511, -0.10445416, -0.15541989, ...,  0.14572449,
         0.04010338, -0.00964977],
       [ 0.09000773,  0.00260416, -0.21182461, ...,  0.19929099,
         0.07410478,  0.26346382],
       [ 0.00926181, -0.13957332, -0.12223979, ...,  0.12079053,
        -0.02541121,  0.05712756],
       ...,
       [ 0.00106871, -0.02786902, -0.03142978, ...,  0.0080044 ,
         0.01172699,  0.01934569],
       [ 0.00177492, -0.01036997, -0.0088923 , ...,  0.00428336,
         0.007435  ,  0.00769483],
       [ 0.00204526, -0.00720066, -0.00651641, ..., -0.00069571,
        -0.0011584 ,  0.00415785]], dtype=float32)

Here we compress the 300d to 2d. What TSNE does is find the nearest neighbours and and preserves the neighbourhood structure while compressing, while PCA loses a lot of information while compressing.

all_word_vectors_matrix_2d = tsne.fit_transform(all_word_vectors_matrix)
/opt/conda/lib/python3.7/site-packages/sklearn/manifold/_t_sne.py:783: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  FutureWarning,
/opt/conda/lib/python3.7/site-packages/sklearn/manifold/_t_sne.py:793: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  FutureWarning,
all_word_vectors_matrix_2d
array([[-22.1214   , -40.20547  ],
       [-22.737371 , -51.337734 ],
       [ -1.2034475, -43.987854 ],
       ...,
       [-39.63466  ,  45.280827 ],
       [-68.378525 , -17.104765 ],
       [-23.636723 ,  10.487    ]], dtype=float32)

Create a dataframe for plotting

df = pd.DataFrame(
    [
        (word, coords[0], coords[1])
        for word, coords in [
            (word, all_word_vectors_matrix_2d[wordVectors.wv.key_to_index[word]])
            for word in wordVectors.wv.index_to_key
        ]
    ],
    columns=["word", "x", "y"]
)

The dataframe has the word and the x, y coordindates that can be plotted as a scatter plot

df.head(5)
word x y
0 said -22.121401 -40.205471
1 would -22.737371 -51.337734
2 one -1.203447 -43.987854
3 Lord -21.235813 -23.119661
4 could -22.683886 -51.672825

Plotting

Bokeh and Altair provide much better visualisations

from bokeh.plotting import figure, output_file, show, output_notebook
from bokeh.palettes import Category20
from bokeh.models import ColumnDataSource, Range1d, LabelSet, Label, CustomJS, Div, Button
from bokeh.layouts import column, row, widgetbox
from bokeh.models.widgets import Toggle
from bokeh import events
from bokeh.palettes import Spectral6
output_notebook()
from bokeh.transform import linear_cmap
from bokeh.transform import log_cmap

# create a colour map
#colourmap = {}
#for idx, cat in enumerate(categories): colourmap[cat] = Category20[len(categories)][idx]
#colors = [colourmap[x] for x in points['word']]



# create data source

source = ColumnDataSource(data=dict(
    x=df['x'],
    y=df['y'],
    name=df['word']
    
))

TOOLTIPS = [
    ("word", "@name"),
]



output_notebook()
#output_file(filename="TSNE.html", title="Word Embeddings Game of Thrones")


# create a new plot
p = figure(
   tools="pan,box_zoom,reset,save",
   title="Word Embeddings Visualisation Using TSNE",
   tooltips=TOOLTIPS,
   plot_width=1000, plot_height=1000
)

mapper = linear_cmap(field_name='x', palette="Viridis256" ,low=-80 ,high=60)

# add some renderers
p.circle('x', 'y', source=source, line_color=mapper, color=mapper)

labels = LabelSet(x='x', y='y', text='name', level='underlay',
              x_offset=5, y_offset=5, source=source, render_mode='canvas', text_font_size="8pt")

#p.add_layout(labels)

# show the results
show(p)
Loading BokehJS ...
Loading BokehJS ...

bokeh-plot.png

For an interactive Bokeh plot, please click here.

Let's look at some results

This is done to find the coordinates of a word we are interested in the plot

df[df["word"] == "Jon"]
word x y
9 Jon -8.131994 -53.089184
wordVectors.wv.most_similar("Stark")
[('Ned', 0.871069610118866),
 ('Eddard', 0.8589808344841003),
 ('ward', 0.8371943831443787),
 ('Greyjoy', 0.8356505036354065),
 ('Arryn', 0.8288456201553345),
 ('Edmure', 0.7985324859619141),
 ('direwolf', 0.7942730784416199),
 ('Winterfell', 0.7936611175537109),
 ('Robb', 0.7925195693969727),
 ('Tully', 0.7905474305152893)]
wordVectors.wv.most_similar("Jon")
[('Sam', 0.7944141030311584),
 ('Catelyn', 0.7163089513778687),
 ('Snow', 0.7153167724609375),
 ('Qhorin', 0.7027862668037415),
 ('Theon', 0.6976580619812012),
 ('Jojen', 0.6847004890441895),
 ('Bear', 0.683414101600647),
 ('cry', 0.6755016446113586),
 ('Ned', 0.673988938331604),
 ('Mormont', 0.6729429364204407)]
wordVectors.wv.most_similar("Tyrion")
[('Varys', 0.7423267960548401),
 ('Cersei', 0.7259597778320312),
 ('Jaime', 0.7211229205131531),
 ('Bronn', 0.7203950881958008),
 ('Dany', 0.7153236865997314),
 ('Littlefinger', 0.6874520778656006),
 ('Catelyn', 0.6540576815605164),
 ('Davos', 0.6521568298339844),
 ('queen', 0.6493181586265564),
 ('moment', 0.6428911685943604)]

Of course, the results are as they should be

SCR-20220906-4rk.jpg

wordVectors.wv.most_similar("Cersei")
[('Jaime', 0.9132179021835327),
 ('dwarf', 0.8904335498809814),
 ('Imp', 0.8688753247261047),
 ('Lancel', 0.8603962659835815),
 ('Joffrey', 0.8597180843353271),
 ('Joff', 0.8585003614425659),
 ('queen', 0.8486377000808716),
 ('sister', 0.8455819487571716),
 ('Littlefinger', 0.8399965763092041),
 ('Kingslayer', 0.8309287428855896)]

SCR-20220906-1tc.png

Altair Visuals

alt.data_transformers.disable_max_rows()
DataTransformerRegistry.enable('default')
import altair as alt

source = df

alt.Chart(source).mark_circle(size=60).encode(
    x='x',
    y='y',
    tooltip=['word']
).properties(
    width=900,
    height=800
).interactive()

Change the parameters for creating the models and see how the results are affected.

It's interesting that how embeddings bridge the gap between words and numbers. In one of the models at work, I even passed them as a parameter to a regression model and the beta value was significant!

End of Notebook