Welcome to the wonderful world of Word Vectors, also referred to as Word Embedding. In this post, we will cover a general introduction to Word Vectors and take a peek at some of common algorithms and python packages that provide pre-trained Word Vectors and also support development of custom Word Vector models.
While there are plenty of introductions available on the topic of Word Vectors, i found the introduction covered in these couple of blog posts of useful to gain a quick overview of Word Vectors: ‘Get busy with Word Embeddings’ by Shane Lynn and ‘How to develop Word Embedding using Python’ by Jason Brownlee. This post covers an overview understanding with the essential excerpts from these two posts.
Lets start with what Word Vectors are all about
A word embedding is an approach to provide a dense vector representation of words that capture something about their meaning.
Word embeddings are an improvement over simpler bag-of-word model word encoding schemes like word counts and frequencies that result in large and sparse vectors (mostly 0 values) that describe documents but not the meaning of the words.
Word embeddings work by using an algorithm to train a set of fixed-length dense and continuous-valued vectors based on a large corpus of text. Each word is represented by a point in the embedding space and these points are learned and moved around based on the words that surround the target word.
It is defining a word by the company that it keeps that allows the word embedding to learn something about the meaning of words. The vector space representation of the words provides a projection where words with similar meanings are locally clustered within the space.
The use of word embeddings over other text representations is one of the key methods that has led to breakthrough performance with deep neural networks on problems like machine translation.
Key characteristics of Word Vectors
- Word Vectors help capture semantics: The numbers in the word vector represent the word’s distributed weight across dimensions. Each dimension represents a meaning and the word’s numerical weight on that dimension captures the closeness of its association with and to that meaning. Thus, the semantics of the word are embedded across the dimensions of the vector.
- Semantically similar words have similar vectors and are closer in the vector space: The direction of the vectors is especially significant.
- Words as vectors is that they lend themselves to mathematical operators: For example, we can add and subtract vectors — the popular example here is showing that by using word vectors we can determine that:
king – man + woman = queen
- Every word has a unique word embedding (or “vector”), which is just a list of numbers for each word.
- The word embeddings are multidimensional; typically for a good model, embeddings are between 50 and 500 in length.
- For each word, the embedding captures the “meaning” of the word.
- Similar words end up with similar embedding values.
Word Embedding Applications
Word embeddings have found use across the complete spectrum of NLP tasks.
- In conjunction with modelling techniques such as artificial neural networks, word embeddings have massively improved text classification accuracy in many domains including customer service, spam detection, document classification etc.
- Word embeddings are used to improve the quality of language translations, by aligning single-language word embeddings using a transformation matrix. See this example for an explanation attempting bilingual translation to four languages (English, German, Spanish, French)
- Word vectors are also used to improve the accuracy of document search and information retrieval applications, where search strings no longer require exact keyword searches and can be insensitive to spelling.
How to implement Word Vectors?
There are broadly two options to use Word Vectors:
- Use pre-trained models that you can download online (easiest)
- Train custom models using your own data and the Word2Vec or GloVe or any other algorithm
What software packages can be used for NLP implementation using Word Vectors
Two Python natural language processing (NLP) libraries are most commonly used for the implementation using Word Vectors:
- Spacy: Spacy is a natural language processing (NLP) library for Python designed to have fast performance, and with word embedding models built in, it’s perfect for a quick and easy start.
- Gensim: Gensim is an open source Python library for natural language processing, with a focus on topic modeling. Gensim was developed and is maintained by the Czech natural language processing researcher Radim Řehůřek and his company RaRe Technologies. It supports an implementation of the Word2Vec word embedding for learning new word vectors from text.It also provides tools for loading pre-trained word embeddings in a few formats and for making use and querying a loaded embedding.
Word Vector Training
This principle of context words being similar for centre words of similar meaning is the basis of word embedding training algorithms.
There are two primary approaches to training word embedding models:
- Distributed Semantic Models: These models are based on the co-occurance / proximity of words together in large bodies of text. A co-occurance matrix is formed for a large text corpus (an NxN matrix with values denoting the probability that words occur closely together), and this matrix is factorised (using SVD / PCA / similar) to form a word vector matrix. Word embedding modelling tTechniques using this approach are known as “count approaches”.
- Neural Network Models: Neural network approaches are generally “predict approaches”, where models are constructed to predict the context words from a centre word, or the centre word from a set of context words.
Predict approaches tend to outperform count models in general, and some of the most popular word embedding algorithms, Skip Gram, Continuous Bag of Words (CBOW), and Word2Vec are all predict-type approaches.
What are common algorithms used for Word Vectors?
The two commonly used word embedding methods are
- word2vec by researchers at Google
- GloVe by researchers at Stanford.
For more deep dive into the algorithms:
Firth’s (1957) distributional hypothesis:
“You shall know a word by the company it keeps”
- Coursera Deep Learning course video on Word Embeddings.
- Google Tensorflow Tutorial on Word Embeddings.
- Excellent break down for Skip-Gram algorithm.Chris McCormick – The Skip-Gram Model
- “The amazing power of word vectors” – Adrian Colyer