Word Embeddings: Theory and Analysis

To implement natural language processing tasks, we deal with various kinds of discrete types. The most common being words. Words originate from a finite set; vocabulary. Other examples are characters, part-of-speech tags, named
entities, named entity types, parse features, items in a product catalog and more.

Any input feature that belongs to a finite or countably infinite set is considered a discrete type.

any input feature from a finite set = discrete type

Representing discrete types like words as dense vectors is a major milestone and drives the success of deep learning in natural language processing.

Terms like representation learning and embedding refer to learning this mapping from one discrete type to a point in the vector space.

When discrete type is a word, then the dense vector representation is called a word embedding.

Evolution of Word Embeddings

Key Concepts in Word Embeddings

  1. Distributional Hypothesis - words with similar meanings tend to occur in similar contexts.
  2. Dimensionality Reduction - word embeddings are dense vectors of lower dimensionality. This reduces computational complexity making it suitable for large scale NLP.
  3. Semantic Representation - semantic relationship between words allow models to understand and represent words in a continuous vector space.
  4. Contextual Information
    - consider words that co-occur in a given context
    - helps models understand meaning of a word based on its surrounding words
  5. Generalization
    - generalize well to unseen/ rare words because they learn to represent words based on their context

Word embeddings are so powerful in NLP tasks that it has earned the title of "Sriracha of NLP" because we can utilize word embeddings in any NLP task and expect the performance of the task to improve.

Sriracha of NLP is a metaphor to profess that word embeddings are universally useful in almost everything.

Just like Sriracha sauce can improve the flavor of many different foods, word embeddings can enhance performance across a wide variety of NLP tasks from classification to translation to sentiment analysis.

Here's a simplified example of word embedding where each word is represented as a 3 dimensional vector:

Word Vector
cat [0.2, -0.6, 0.8]
dog [0.6, 0.2, 0.5]
apple [0.8, -0.2, -0.3]
orange [0.7, -0.2, -0.5]
happy [-0.4, 0.9, 0.2]
sad [0.3, -0.8, -0.5]

Here, each word is associated with a unique vector. Values in vector represents the word's position in continuous 3 dimensional vector space.
Words with similar meaning/ context are expected to have similar vector representations.

cat and dog are close together whereas happy and sad have opposite directions.


Example of Word Embedding

Each word is a point in some space. Embedding enables to perform semantic operator like obtaining the capital city of a given country.

One-hot encoding

  • Each word is represented as a vector of 0 with a single 1 at the index of the word.
  • High dimensionality and vectors are same length as size of vocabulary.
  • one-hot representations are not learned but heuristically constructed.
  • high dimensional sparse vector often 105 or 106 or even higher
  • expensive computational complexity and memory size
  • co-occurrence count and not trainable due to static nature

For eg: If vocab = ['cat', 'dog', 'mouse']

Word Vector
cat [1, 0, 0]
dog [0, 1, 0]
mouse [0, 0, 1]

Here, cat and dog has no relationship, no semantic meaning. It is useful to encode identity but not for meaning.

Dense Word Embeddings

  • aka distributed representations, low-dimensional learned vector
  • introduces Word2Vec, GloVe, FastText as solution models
  • trained from data often predicting context (Skip-Gram, CBOW)
  • modern approach due to small, trainable, generalizable and meaningful
  • strongly captures semantic similarity between words
  • maps each word to a low-dimensional dense vector
  • fine-tune ready and highly efficient than one-hot encoding

Dense word embeddings have several benefits over one-hot encoding and they can be categorized as follows:

  1. Dimensionality and Efficiency: Reducing dimensionality is computationally efficient.
  2. Dense embeddings allow generalization and semantic similarity
  3. very high dimension in input can result in real problems in machine learning and optimization also called as "curse of dimensionality".
  4. dense embeddings can be fine tuned from task specific data

Word2Vec - Word to Vector

  • Language modeling technique that maps words to vectors
  • method to generate word embeddings
  • widely used in NLP tasks, developed by Google in 2013
  • Word2Vec utilizes 2 architectures
    • CBOW - Continuous Bag of Words
    • Skip-Gram
  • CBOW predicts the current word given context words within a specific window
  • Skip-Gram predicts surrounding context words within specific window given current word
  • applications include topic categorization and sentiment analysis. Named Entity Recognition - NER

GloVe - Global Vectors for Word Representation

  • unsupervised learning algorithm to generate dense vector representation also known as "embeddings".
  • primarily used to capture semantic relationships between words up analyzing their co-occurrence patterns in large text corpus
at the core, the idea is to map each word into continuous vector space where both magnitude and direction of vectors reflect meaningful semantic relationships

For eg: king - man + woman = queen

  • count-based model from Stanford 2014
  • captures fine-grained similarity better than Word2Vec
  • trained on massive corpora like Common Crawl 840B tokens

FastText

  • built by Facebook AI Research (FAIR) in 2016
  • extension of Word2Vec
  • represents each word as a bag of character n-grams so it learns subword embeddings

For eg: "apple" with n = 3"<ap", "app", "ppl", "ple", "le>"

Final word is sum of its subword n-gram vectors

  • handles out-of-vocabulary and rare words
  • extends the Skip-Gram and CBOW models
  • great for morphologically rich languages or low-resource languages like Nepali
Model Type Learns from Invented by
Word2Vec Predictive Local context window Google (2013)
GloVe Count-based Global co-occurrence Stanford (2014)
FastText Predictive Local context + n-grams Facebook AI (2016)

Semantic Similarity

Semantic similarity refers to the degree to which the meanings of two pieces of text whether words, phrases, sentences or larger chunks of text are similar.

When trained well, embeddings place semantically similar words closer together in high dimensional space.

For eg: vectors for cat and dog should be closer than vectors for cat and car.

In embedding space, semantic similarity measured by Cosine similarity

cosine_similarity (A,B) = A ⋅ B ∥A∥ × ∥B∥

where

  • A⋅B is the dot product of vectors A and B.
  • ∥A∥ x ∥B∥ are the magnitudes (norms) of the vectors.

Returns value from -1 to 1

1 = same direction; high similarity
0 = orthogonal; no similarity
-1 = opposite direction

For eg:
words like ["king", "queen", "prince", "princess"] are grouped together to form a royal cluster
["cat", "dog", "rabbit"] form another animal cluster
["sun", "moon", "star"] form celestial cluster

Cluster reflects semantic fields.

For eg:

Word Vector
king [0.25, 0.12, 0.65]
queen [0.24, 0.13, 0.66]
apple [-0.53, 0.78, 0.12]
  1. king vs queen

king = [0.25, 0.12, 0.65] and queen = [0.24, 0.13, 0.66]

  • Dot product = (0.25 × 0.24) + (0.12 × 0.13) + (0.65 × 0.66)
    = 0.06 + 0.0156 + 0.429
    = 0.5046
  • Norm of king = √(0.25² + 0.12² + 0.65²)
    = √(0.0625 + 0.0144 + 0.4225)
    = √0.4994 = 0.7067
  • Norm of queen = √(0.24² + 0.13² + 0.66²)
    = √(0.0576 + 0.0169 + 0.4356)
    = √0.5101 = 0.7142
  • Cosine similarity = 0.5046 / (0.7067 × 0.7142)
    = 0.5046 / 0.5047
    = 0.9998

2. king vs apple

king = [0.25, 0.12, 0.65] and apple = [-0.53, 0.78, 0.12]

  • Dot product = (0.25 × -0.53) + (0.12 × 0.78) + (0.65 × 0.12)
    = -0.1325 + 0.0936 + 0.078
    = 0.0391
  • Norm of king = 0.7067 (from above)
  • Norm of apple = √((-0.53)² + 0.78² + 0.12²)
    = √(0.2809 + 0.6084 + 0.0144)
    = √0.9037 = 0.9506
  • Cosine similarity = 0.0391 / (0.7067 × 0.9506)
    = 0.0391 / 0.6717
    = 0.0582

3. queen vs apple

queen = [0.24, 0.13, 0.66] and apple = [-0.53, 0.78, 0.12]

  • Dot product = (0.24 × -0.53) + (0.13 × 0.78) + (0.66 × 0.12)
    = -0.1272 + 0.1014 + 0.0792
    = 0.0534
  • Norm of queen = 0.7142 (from above)
  • Norm of apple = 0.9506 (from above)
  • Cosine similarity = 0.0534 / (0.7142 × 0.9506)
    = 0.0534 / 0.6789
    = 0.0787
Similarity (king, queen) ≈ 0.9998 (highly similar)
Similarity (king, apple) ≈ 0.0582 (not semantically similar)
Similarity (queen, apple) ≈ 0.0787 (not semantically similar)

Dimensionality in Embeddings

Each word is represented as vector of length d where d is embedding dimensionality.

For eg: If d = 50 then "cat" = [0.23, -0.15, ..... , 0.02] ℇ R50

If d = 300 , the vector has 300 features.

Role of Dimensionality in Embeddings

  1. Capacity to encode meaning
  2. trade off between efficiency
    1. low dimension fast to compute, memory efficient but lose details
    2. high dimension has more feature representation, slower training
  3. "curse of dimensionality" - high dimension distances between vectors become less meaningful.
  4. traditional dimensionality reduction doesn't scale
    1. Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) do not scale when dimensionality is on order of millions. Hence, learned dense embeddings like Word2Vec or GloVe are trained end to end and scale better.

To choose dimensionality of word embedding, consider these factors:

  1. nature of data
  2. computational resources
  3. performance requirements
  4. experimentation and evaluation metrics
Semantic Similarity