If you have been researching Natural Language Search or Natural Language Processing (NLP), you may have heard of word2vec. If you’ve not, we have you covered. This article digs into what you need to know about a word2vec model to be conversant with the data scientists in your life.

What Is Word2Vec?

Word2Vec is a machine learning technique that has been around since 2013, courtesy of Tomas Mikolov and his data science team at Google. It relies on deep learning to train a computer to learn about your language (vocabulary, expressions, context, etc.) using a corpus (content library).

What Are Word Embeddings?

Word embeddings are the vectors that are created to represent words – and their context – or the relationships between individual words. It is the process of converting words into dense vectors – or a series of smaller vectors. This process assigns values to each word along several different dimensions, creating a dense vector that isn’t just a string of 0s but actual coordinates in abstract space.

[For geeks out there, sparse vectors are when you have a lot of values in the vector that are zero. A dense vector is when most of the values in the vector are non-zero.]

How Do We Learn Language As Individuals?

So you’ve had the brief: word2vec and word embeddings. But before we go any further – let’s take a step back first to consider how we learn as individuals.

Ever have someone come up to you with a question – and you have no idea what they are talking about? Chances are they have not given you enough context to understand what they really need. Maybe they are using a unique word to describe the problem, or maybe they are using a single word that changes in meaning based on the context in which it is used. For example, the ‘bowl’ has two completely different meanings and its interpretation depends heavily on context.

We can try and decode what a person wants by asking questions until we can finally winnow down what they want.

Understanding natural language is not a trivial matter. And as humans, we don’t realize how incredibly complicated language is, which is why it is hard to teach computers natural language processing.

Structural Relationships Between Words / Context

Words rarely have meaning in isolation. If I say “leopard print” – it might be referring to the indentation of the animal’s paw on a given surface, or maybe it’s a description of a pattern that replicates a leopard’s skin. Then again, it might be a poster of the magnificent beast itself.

Without any context, it’s hard to know what a given word means. That’s where word embedding models come in. They show the particular relationship certain words have in relation to others in the same phrase or sentence.

When paired with clothing or accessories – we understand that “leopard print” is describing colors and markings. When paired with sizes like 24 x 36, we understand that it is a poster, and when we use it with words like sand or mud, we know that we are tracking the animal.

Semantic Relationships Between Words

Your brain – and the word2vec model – understands the semantic relationship between words.

For instance, it understands that word pairs like “king” and “queen”, “blue” and “yellow”, “running” and “jogging” each have a special relationship. A word pair of “king” and “running,” and “queen” with “yellow” – don’t! Our brain also understands that the semantic meaning between “shoes” and “socks” is different than, say, the relationship between “shoes” and “sandals,” or “shoes” and “running.”

Your brain knows that the word “queen” has a certain relationship with the word “England” that it doesn’t have with the word “California,” (unless we are talking about Gen Zs in California talking to each other). The word “toast” has a relationship with the word “French” that it doesn’t have with the word “Spanish.”

Now imagine the area of your brain that contains language as a well-organized storage space. Words like “French” and “toast” would be located closer together in that space than the words “toast” and “Spanish.”

Target Words, Word Vectors, Vector Representation

Finally, every word evokes a set of associations that are partially shared by all speakers of the language and partially a result of your personal experiences/geography/etc.

For example, your associations with the word “milk” might be white, 1%, breakfast, cereal, cow, almond, and if you have experience with dairy allergies, danger, ambulance, anaphylaxis.

This ability to draw these associations is the result of complex neurological computations honed by our brain’s neural network over thousands of years of evolution.

Does Word2vec Use One Hot Vector Encoding?

No, but instead, it makes use of word embeddings as the one-hot vector coding has two main limitations:

  • Inability to understand ‘obvious’ relationships
  • Incompatible with a large corpus

From all the context shared above, you now have a greater appreciation and understanding of the complexity of the human language. You know that it’s not enough to just feed computers the dictionary definition of words as training data and hope for the best.

Yet, NLP had to start somewhere, and this is where it started. Since computers understand only numbers, in order to teach them natural language, we have to represent words numerically. For a long time, it was done by representing each target word as a string of zeros with 1 in a different position for each word. This method is called “one-hot vector encoding.”

So if you have a vocabulary of four words, they would be represented like this:

This computational linguistics method creates a unique representation for each target word and therefore helps the system easily distinguish “dogs” from “cats” and “play” from “Rome.” But there are two problems with it.

First – you guessed it, this method has no way of encoding the relationship between words that as humans we take for granted. It has no way of knowing that the word pair DOG and CAT are similar – in a way that DOG and ROME are not. That CAT and PLAY have a special relationship – that CAT and ROME do not.

Second, this vector representation is problematic if you have a very large corpus. What if your vocabulary size is 10,000 words instead of four? This would require very long vectors, with each vector consisting of a long string of zeros. Many machine learning models don’t work well with this type of data – as it would make training the data really hard.  

Today, both these problems are solved with the help of a modern NLP technique called word embeddings.

Word Embedding Methodology

As an example, lets see how this method would encode the meaning of the following five words:

Aardvark

Black

Cat

Duvet

Zombie

Each word can be assigned a value between 0 and 1 along several different dimensions, for example, “animal”, “fluffiness”, “dangerous”, and “spooky”

Each semantic feature is a single dimension in an abstract multidimensional semantic space and is represented as its own axis. It is possible to create word vectors using anywhere from 50 to 500 dimensions (or any other number of dimensions really…).

Each word is then given specific coordinates within that space based on its specific values on the features in question. The good news is, this is not a manual job. The computer assigns these coordinates based on how often it “sees” the co-occurrences of words. 

For example, the words “cat” and “aardvark” are very close on the “animal” axis but are far from each other on the scale of fluffiness, and the words “cat” and “duvet” are similar on the scale of fluffiness but not on any other scale.

Word embedding algorithms excel at encoding a variety of semantic relationships between words. Synonyms (words that have a similar meaning) will be located very close to each other. 

The counterpart is that often antonyms are also very close in that same space. That’s how Word2vec works: words that appear in the same context – and antonyms usually do – are mapped in the same area of space.

Other semantic relationships between words, for example, hyponymy (a subtype relationship, e.g. “spoon” is a hyponym of “cutlery”) will also be encoded.

This method also helps establish the relationship between specific target words, for example

  • A leopard print was seen in the mud.
  • That’s a beautiful leopard print on your coat
  • I bought a leopard print with a black frame.

The system will encode that the target word “leopard print” appears within sentences that have the words “mud,” “coat” and “frames.” (This is called a word window and older models like Word2vec have a small window – with the targets being within 3-5 words.) By looking at what is before and after a target word, the computer is able to learn additional information about each word and locate it as precisely as possible in abstract vector space.

So to summarize, under this method, words are analyzed to see how often they appear near each other (co-occurrence). Thus, word embedding algorithms are capable of capturing the context of a word in a document, semantic and syntactic similarity, relation with other words, and so on.

As you can see below, by assigning similar words and similar vector representations, this method was able to encode the meaning of words and cluster them according to categories, such as cities, food, countries, and so on.

Over the next couple of months, we will delve into more machine learning models that come into play in with deep learning and NLP.

Dig Deeper

We believe that the greatest discoveries are created from collaboration and community. This is why we’re committed to sharing our latest research on artificial intelligence, natural language processing, and information retrieval.

Fueling AI innovation
Get AI research publications, events, datasets and more