Text Embedding
Explore key text embedding techniques for NLP, transforming text into numerical vectors for machine learning. Learn about document embedding for similarity & classification.
6. Text Embedding Techniques
This document explores various techniques for text embedding, a crucial step in Natural Language Processing (NLP) that transforms text into numerical representations suitable for machine learning models.
6.1 Document Embedding
Document embedding aims to represent an entire document as a single dense vector. This allows us to capture the overall meaning and context of a document, enabling tasks like document similarity, classification, and clustering.
6.1.1 Pre-Trained Embeddings
Pre-trained embeddings are vector representations of words or documents that have been learned from massive text corpora. These embeddings capture general semantic relationships and can be leveraged directly or fine-tuned for specific downstream tasks.
Advantages:
Capture rich semantic and syntactic information.
Reduce the need for large, task-specific training datasets.
Provide a good starting point for various NLP tasks.
Common Sources:
Word2Vec: Learns word embeddings by predicting surrounding words or the target word given its context.
GloVe (Global Vectors for Word Representation): Combines global matrix factorization with local context window methods.
FastText: Extends Word2Vec by considering character n-grams, making it effective for out-of-vocabulary words and morphologically rich languages.
BERT (Bidirectional Encoder Representations from Transformers): A transformer-based model that generates context-aware word embeddings, capturing nuances of word meaning based on its surrounding words.
Doc2Vec (Paragraph Vectors): An extension of Word2Vec that learns embeddings for paragraphs or documents by treating them as sequences of words, often by adding a unique paragraph ID.
6.1.2 Word Embeddings
Word embeddings represent individual words as dense numerical vectors in a multi-dimensional space. Words with similar meanings are expected to be closer to each other in this vector space.
Popular Word Embedding Models:
Word2Vec:
Skip-gram: Predicts surrounding words given a target word. This model is generally better for infrequent words.
Continuous Bag-of-Words (CBOW): Predicts the target word given its surrounding context words. This model is generally better for frequent words.
Example (Conceptual): If you have trained Word2Vec on a large corpus, you might observe relationships like:
vector("king") - vector("man") + vector("woman") ≈ vector("queen")
GloVe:
Relies on word co-occurrence statistics from a corpus. It aims to capture ratios of word-word co-occurrence probabilities.
FastText:
Represents words as a bag of character n-grams. This allows it to generate embeddings for words not seen during training (out-of-vocabulary words) by summing the embeddings of their constituent n-grams.
Example (Conceptual): The word "apple" might be represented by its character n-grams:
{"ap", "app", "ppl", "ple", "apple"}
and their embeddings.Contextual Embeddings (e.g., BERT, ELMo, GPT):
Unlike static word embeddings, contextual embeddings generate different vector representations for the same word depending on its context within a sentence. This is a significant advancement as it captures polysemy (words with multiple meanings).
Example: The word "bank" would have different embeddings in the sentences:
"I went to the bank to deposit money."
"The river bank was eroded."