Word Embeddings: Theory and Analysis

To implement natural language processing tasks, we deal with various kinds of discrete types. The most common being words. Words originate from a finite set; vocabulary. Other examples are characters, part-of-speech tags, named
entities, named entity types, parse features, items in a product catalog and more.

Any input feature that belongs to a finite or countably infinite set is considered a discrete type.

any input feature from a finite set = discrete type

Representing discrete types like words as dense vectors is a major milestone and drives the success of deep learning in natural language processing.

Terms like representation learning and embedding refer to learning this mapping from one discrete type to a point in the vector space.

When discrete type is a word, then the dense vector representation is called a word embedding.

Evolution of Word Embeddings

Key Concepts in Word Embeddings

Distributional Hypothesis - words with similar meanings tend to occur in similar contexts.
Dimensionality Reduction - word embeddings are dense vectors of lower dimensionality. This reduces computational complexity making it suitable for large scale NLP.
Semantic Representation - semantic relationship between words allow models to understand and represent words in a continuous vector space.
Contextual Information
- consider words that co-occur in a given context
- helps models understand meaning of a word based on its surrounding words
Generalization
- generalize well to unseen/ rare words because they learn to represent words based on their context

Word embeddings are so powerful in NLP tasks that it has earned the title of "Sriracha of NLP" because we can utilize word embeddings in any NLP task and expect the performance of the task to improve.

Sriracha of NLP is a metaphor to profess that word embeddings are universally useful in almost everything.

Just like Sriracha sauce can improve the flavor of many different foods, word embeddings can enhance performance across a wide variety of NLP tasks from classification to translation to sentiment analysis.

Here's a simplified example of word embedding where each word is represented as a 3 dimensional vector:

Word	Vector
cat	[0.2, -0.6, 0.8]
dog	[0.6, 0.2, 0.5]
apple	[0.8, -0.2, -0.3]
orange	[0.7, -0.2, -0.5]
happy	[-0.4, 0.9, 0.2]
sad	[0.3, -0.8, -0.5]

Here, each word is associated with a unique vector. Values in vector represents the word's position in continuous 3 dimensional vector space.
Words with similar meaning/ context are expected to have similar vector representations.

cat and dog are close together whereas happy and sad have opposite directions.

Example of Word Embedding

Each word is a point in some space. Embedding enables to perform semantic operator like obtaining the capital city of a given country.

One-hot encoding

Each word is represented as a vector of 0 with a single 1 at the index of the word.
High dimensionality and vectors are same length as size of vocabulary.
one-hot representations are not learned but heuristically constructed.
high dimensional sparse vector often 10⁵ or 10⁶ or even higher
expensive computational complexity and memory size
co-occurrence count and not trainable due to static nature

For eg: If vocab = ['cat', 'dog', 'mouse']

Word	Vector
cat	[1, 0, 0]
dog	[0, 1, 0]
mouse	[0, 0, 1]

Here, cat and dog has no relationship, no semantic meaning. It is useful to encode identity but not for meaning.

Dense Word Embeddings

aka distributed representations, low-dimensional learned vector
introduces Word2Vec, GloVe, FastText as solution models
trained from data often predicting context (Skip-Gram, CBOW)
modern approach due to small, trainable, generalizable and meaningful
strongly captures semantic similarity between words
maps each word to a low-dimensional dense vector
fine-tune ready and highly efficient than one-hot encoding

Dense word embeddings have several benefits over one-hot encoding and they can be categorized as follows:

Dimensionality and Efficiency: Reducing dimensionality is computationally efficient.
Dense embeddings allow generalization and semantic similarity
very high dimension in input can result in real problems in machine learning and optimization also called as "curse of dimensionality".
dense embeddings can be fine tuned from task specific data

Word2Vec - Word to Vector

Language modeling technique that maps words to vectors
method to generate word embeddings
widely used in NLP tasks, developed by Google in 2013
Word2Vec utilizes 2 architectures
- CBOW - Continuous Bag of Words
- Skip-Gram
CBOW predicts the current word given context words within a specific window
Skip-Gram predicts surrounding context words within specific window given current word
applications include topic categorization and sentiment analysis. Named Entity Recognition - NER

GloVe - Global Vectors for Word Representation

unsupervised learning algorithm to generate dense vector representation also known as "embeddings".
primarily used to capture semantic relationships between words up analyzing their co-occurrence patterns in large text corpus

at the core, the idea is to map each word into continuous vector space where both magnitude and direction of vectors reflect meaningful semantic relationships

For eg: king - man + woman = queen

count-based model from Stanford 2014
captures fine-grained similarity better than Word2Vec
trained on massive corpora like Common Crawl 840B tokens

FastText

built by Facebook AI Research (FAIR) in 2016
extension of Word2Vec
represents each word as a bag of character n-grams so it learns subword embeddings

For eg: "apple" with n = 3 → "<ap", "app", "ppl", "ple", "le>"

Final word is sum of its subword n-gram vectors

handles out-of-vocabulary and rare words
extends the Skip-Gram and CBOW models
great for morphologically rich languages or low-resource languages like Nepali

Model	Type	Learns from	Invented by
Word2Vec	Predictive	Local context window	Google (2013)
GloVe	Count-based	Global co-occurrence	Stanford (2014)
FastText	Predictive	Local context + n-grams	Facebook AI (2016)

Semantic Similarity

Semantic similarity refers to the degree to which the meanings of two pieces of text whether words, phrases, sentences or larger chunks of text are similar.

When trained well, embeddings place semantically similar words closer together in high dimensional space.

For eg: vectors for cat and dog should be closer than vectors for cat and car.

In embedding space, semantic similarity measured by Cosine similarity

cosine_similarity (A,B) = A ⋅ B ∥A∥ × ∥B∥

where

A⋅B is the dot product of vectors A and B.
∥A∥ x ∥B∥ are the magnitudes (norms) of the vectors.

Returns value from -1 to 1

1 = same direction; high similarity
0 = orthogonal; no similarity
-1 = opposite direction

For eg:
words like ["king", "queen", "prince", "princess"] are grouped together to form a royal cluster
["cat", "dog", "rabbit"] form another animal cluster
["sun", "moon", "star"] form celestial cluster

Cluster reflects semantic fields.

For eg:

Word	Vector
king	[0.25, 0.12, 0.65]
queen	[0.24, 0.13, 0.66]
apple	[-0.53, 0.78, 0.12]

king vs queen

king = [0.25, 0.12, 0.65] and queen = [0.24, 0.13, 0.66]

Dot product = (0.25 × 0.24) + (0.12 × 0.13) + (0.65 × 0.66)
= 0.06 + 0.0156 + 0.429
= 0.5046
Norm of king = √(0.25² + 0.12² + 0.65²)
= √(0.0625 + 0.0144 + 0.4225)
= √0.4994 = 0.7067
Norm of queen = √(0.24² + 0.13² + 0.66²)
= √(0.0576 + 0.0169 + 0.4356)
= √0.5101 = 0.7142
Cosine similarity = 0.5046 / (0.7067 × 0.7142)
= 0.5046 / 0.5047
= 0.9998

2. king vs apple

king = [0.25, 0.12, 0.65] and apple = [-0.53, 0.78, 0.12]

Dot product = (0.25 × -0.53) + (0.12 × 0.78) + (0.65 × 0.12)
= -0.1325 + 0.0936 + 0.078
= 0.0391
Norm of king = 0.7067 (from above)
Norm of apple = √((-0.53)² + 0.78² + 0.12²)
= √(0.2809 + 0.6084 + 0.0144)
= √0.9037 = 0.9506
Cosine similarity = 0.0391 / (0.7067 × 0.9506)
= 0.0391 / 0.6717
= 0.0582

3. queen vs apple

queen = [0.24, 0.13, 0.66] and apple = [-0.53, 0.78, 0.12]

Dot product = (0.24 × -0.53) + (0.13 × 0.78) + (0.66 × 0.12)
= -0.1272 + 0.1014 + 0.0792
= 0.0534
Norm of queen = 0.7142 (from above)
Norm of apple = 0.9506 (from above)
Cosine similarity = 0.0534 / (0.7142 × 0.9506)
= 0.0534 / 0.6789
= 0.0787

Similarity (king, queen) ≈ 0.9998 (highly similar)
Similarity (king, apple) ≈ 0.0582 (not semantically similar)
Similarity (queen, apple) ≈ 0.0787 (not semantically similar)

Dimensionality in Embeddings

Each word is represented as vector of length d where d is embedding dimensionality.

For eg: If d = 50 then "cat" = [0.23, -0.15, ..... , 0.02] ℇ R⁵⁰

If d = 300 , the vector has 300 features.

Role of Dimensionality in Embeddings

Capacity to encode meaning
trade off between efficiency
1. low dimension fast to compute, memory efficient but lose details
2. high dimension has more feature representation, slower training
"curse of dimensionality" - high dimension distances between vectors become less meaningful.
traditional dimensionality reduction doesn't scale
1. Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) do not scale when dimensionality is on order of millions. Hence, learned dense embeddings like Word2Vec or GloVe are trained end to end and scale better.

To choose dimensionality of word embedding, consider these factors:

nature of data
computational resources
performance requirements
experimentation and evaluation metrics

Semantic Similarity