Data Science Natural Language ProcessingText Embedding

Text Embedding

Explore key text embedding techniques for NLP, transforming text into numerical vectors for machine learning. Learn about document embedding for similarity & classification.

6. Text Embedding Techniques

This document explores various techniques for text embedding, a crucial step in Natural Language Processing (NLP) that transforms text into numerical representations suitable for machine learning models.

6.1 Document Embedding

Document embedding aims to represent an entire document as a single dense vector. This allows us to capture the overall meaning and context of a document, enabling tasks like document similarity, classification, and clustering.

6.1.1 Pre-Trained Embeddings

Pre-trained embeddings are vector representations of words or documents that have been learned from massive text corpora. These embeddings capture general semantic relationships and can be leveraged directly or fine-tuned for specific downstream tasks.

Advantages:
- Capture rich semantic and syntactic information.
- Reduce the need for large, task-specific training datasets.
- Provide a good starting point for various NLP tasks.
Common Sources:
- Word2Vec: Learns word embeddings by predicting surrounding words or the target word given its context.
- GloVe (Global Vectors for Word Representation): Combines global matrix factorization with local context window methods.
- FastText: Extends Word2Vec by considering character n-grams, making it effective for out-of-vocabulary words and morphologically rich languages.
- BERT (Bidirectional Encoder Representations from Transformers): A transformer-based model that generates context-aware word embeddings, capturing nuances of word meaning based on its surrounding words.
- Doc2Vec (Paragraph Vectors): An extension of Word2Vec that learns embeddings for paragraphs or documents by treating them as sequences of words, often by adding a unique paragraph ID.

6.1.2 Word Embeddings

Word embeddings represent individual words as dense numerical vectors in a multi-dimensional space. Words with similar meanings are expected to be closer to each other in this vector space.

Popular Word Embedding Models:

Word2Vec:
- Skip-gram: Predicts surrounding words given a target word. This model is generally better for infrequent words.
- Continuous Bag-of-Words (CBOW): Predicts the target word given its surrounding context words. This model is generally better for frequent words.
Example (Conceptual): If you have trained Word2Vec on a large corpus, you might observe relationships like: vector("king") - vector("man") + vector("woman") ≈ vector("queen")
GloVe:
- Relies on word co-occurrence statistics from a corpus. It aims to capture ratios of word-word co-occurrence probabilities.
FastText:
- Represents words as a bag of character n-grams. This allows it to generate embeddings for words not seen during training (out-of-vocabulary words) by summing the embeddings of their constituent n-grams.
Example (Conceptual): The word "apple" might be represented by its character n-grams: {"ap", "app", "ppl", "ple", "apple"} and their embeddings.
Contextual Embeddings (e.g., BERT, ELMo, GPT):
- Unlike static word embeddings, contextual embeddings generate different vector representations for the same word depending on its context within a sentence. This is a significant advancement as it captures polysemy (words with multiple meanings).
Example: The word "bank" would have different embeddings in the sentences:
1. "I went to the bank to deposit money."
2. "The river bank was eroded."