Find Similar Sentences

Learn how to find semantically similar sentences using Sentence-BERT (SBERT). Discover its applications in AI, NLP, and building intelligent customer support.

Finding Similar Sentences with Sentence-BERT

Sentence-BERT (SBERT) is a powerful tool for identifying semantically similar sentences. This capability is highly valuable in various applications, such as building intelligent customer support systems. For instance, in an e-commerce scenario, users might ask questions like "When will my order arrive?" or "Can I cancel my order?". Instead of manually mapping these incoming user queries to a knowledge base of Frequently Asked Questions (FAQs), SBERT can automate this process by matching new questions to existing FAQs based on their meaning.

This guide will walk you through the process of finding the most similar sentence from a predefined set of questions using a pre-trained Sentence-BERT model.

Prerequisites

Before you begin, ensure you have the necessary libraries installed.

pip install sentence-transformers numpy

Step-by-Step Implementation

Step 1: Install and Import Required Libraries

We need SentenceTransformer for loading and using SBERT models, and util from the same library for calculating cosine similarity. numpy is used for numerical operations, specifically finding the index of the maximum similarity score.

from sentence_transformers import SentenceTransformer, util
import numpy as np

Step 2: Load a Pre-trained Sentence-BERT Model

Sentence-BERT offers various pre-trained models. For semantic similarity tasks, bert-base-nli-mean-tokens is a good choice as it's specifically optimized for this purpose.

## Load a pre-trained model optimized for semantic similarity
model = SentenceTransformer('bert-base-nli-mean-tokens')

Step 3: Define the Master FAQ Dictionary

This list represents your predefined common customer support questions, acting as your knowledge base.

## Master list of common customer support questions (FAQs)
master_dict = [
    'How to cancel my order?',
    'Please let me know about the cancellation policy?',
    'Do you provide refund?',
    'What is the estimated delivery date of the product?',
    'Why is my order missing?',
    'How do I report the delivery of incorrect items?'
]

Step 4: Define the Incoming User Query

This is the new question submitted by a user that you want to match to an existing FAQ.

## An incoming user query
inp_question = 'When is my product getting delivered?'

Step 5: Generate Sentence Embeddings

Sentence-BERT converts sentences into dense vector representations (embeddings) that capture their semantic meaning. We'll encode both the incoming user query and all questions in our master dictionary. Setting convert_to_tensor=True ensures the embeddings are returned as PyTorch tensors, which are directly compatible with the util.pytorch_cos_sim function.

## Encode the input query into a vector representation
inp_question_representation = model.encode(inp_question, convert_to_tensor=True)

## Encode all the questions in the master dictionary into vector representations
master_dict_representation = model.encode(master_dict, convert_to_tensor=True)

Step 6: Calculate Cosine Similarity

Cosine similarity measures the cosine of the angle between two non-zero vectors. In the context of sentence embeddings, a higher cosine similarity score indicates that the sentences are more semantically similar. We'll use util.pytorch_cos_sim to compute the similarity between the input query's embedding and each embedding in the master dictionary. Then, np.argmax finds the index of the highest similarity score, which corresponds to the most similar sentence in the master_dict.

## Compute cosine similarity between the input query and all master dictionary questions
## The result is a tensor where each element is the similarity score between
## the input query and a question in the master dictionary.
similarity = util.pytorch_cos_sim(inp_question_representation, master_dict_representation)

## Find the index of the most similar sentence by getting the index of the maximum similarity score
most_similar_idx = np.argmax(similarity)

## Retrieve and print the most similar sentence from the master dictionary
print("Most similar question:", master_dict[most_similar_idx])

Output:

Most similar question: What is the estimated delivery date of the product?

As demonstrated, Sentence-BERT successfully understood the semantic meaning of the input query "When is my product getting delivered?" and accurately matched it to the most relevant FAQ: "What is the estimated delivery date of the product?".

Why This Works

  • Dense Embeddings: Sentence-BERT generates fixed-size dense vector representations (embeddings) for sentences. These embeddings are not just bag-of-words representations; they capture the nuanced meaning, context, and relationships between words within a sentence.

  • Semantic Understanding: Unlike traditional keyword-matching approaches, SBERT models are trained on vast amounts of text data and learn to represent sentences in a way that reflects their semantic content. Sentences with similar meanings will have embeddings that are close to each other in the vector space.

  • Efficient Similarity Computation: Cosine similarity is a computationally efficient method for comparing these dense embeddings. It measures the angular distance between vectors, providing a score that indicates how similar their directions (and thus their meanings) are. A score closer to 1 means higher similarity, while a score closer to 0 indicates less similarity.

Use Cases

This approach is highly beneficial for:

  • Semantic Search: Finding documents or passages that are semantically relevant to a query, even if they don't share exact keywords.

  • Question Answering: Matching user questions to relevant answers or knowledge base articles.

  • FAQ Matching: Automatically routing customer queries to the most appropriate FAQ entry.

  • Duplicate Detection: Identifying sentences or short texts that convey the same meaning.

  • Clustering Text Data: Grouping similar sentences or documents together.

Interview Questions

Here are some common interview questions related to this topic:

  1. How does Sentence-BERT help in matching semantically similar sentences? Sentence-BERT converts sentences into dense vector embeddings that capture semantic meaning. Similar sentences have embeddings that are close in the vector space, allowing for efficient similarity comparison using metrics like cosine similarity.

  2. Which Sentence-BERT model is suitable for semantic similarity tasks? Models like bert-base-nli-mean-tokens, all-MiniLM-L6-v2, or models specifically fine-tuned on STS (Semantic Textual Similarity) datasets are generally suitable.

  3. What is the role of cosine similarity in finding similar sentences? Cosine similarity measures the angle between two sentence embeddings (vectors). It quantifies how similar the semantic direction of the sentences is. A higher cosine similarity score indicates greater semantic similarity.

  4. How do you encode sentences using Sentence-BERT for similarity comparison? You use the model.encode() method from the sentence_transformers library, passing the sentence(s) as input. It's often beneficial to set convert_to_tensor=True for compatibility with similarity calculation functions.

  5. Explain how you would use Sentence-BERT in an e-commerce FAQ matching system.

    1. Encode all existing FAQs into sentence embeddings.

    2. When a new user query arrives, encode it into a sentence embedding.

    3. Calculate the cosine similarity between the user query embedding and all FAQ embeddings.

    4. Identify the FAQ with the highest similarity score as the most relevant match.

  6. What does the output embedding vector represent in Sentence-BERT? The output embedding vector is a dense numerical representation of the sentence's meaning, capturing its semantic content, context, and relationships between words.

  7. How does Sentence-BERT differ from traditional keyword matching? Keyword matching relies on exact word matches and struggles with synonyms, paraphrasing, and understanding context. Sentence-BERT understands the meaning of sentences, allowing it to match semantically similar sentences even if they use different words.

  8. How can you optimize Sentence-BERT for real-time question-answer retrieval?

    • Model Choice: Use smaller, faster models (e.g., MiniLM variants).

    • Pre-computation: Pre-compute and store FAQ embeddings.

    • Indexing: Use vector databases (e.g., FAISS, Annoy, Pinecone) for efficient nearest neighbor search on embeddings.

    • Batching: Process incoming queries in batches if possible.

  9. What are the benefits of using dense sentence embeddings in NLP tasks? They enable semantic understanding, facilitate efficient similarity calculations, capture contextual nuances, and allow for powerful applications like semantic search, question answering, and clustering.

  10. How do you interpret the similarity score from the cosine similarity calculation? Scores range from -1 to 1 (though for sentence embeddings, they are typically between 0 and 1).

    • 1: Identical meaning.

    • Close to 1: Very similar meaning.

    • Around 0.5: Moderately similar.

    • Close to 0: Little to no semantic similarity.

    • Close to -1: Opposite meaning (rarely encountered with typical SBERT usage as embeddings are often non-negative in practice).