Introduction

Functionality

Sent2Vec is a neural network.

Before discussing its principles, we can temporarily understand Sent2Vec as a mapping $f: \text{sentence} \to \text{vector}$ . A vector consists of elements or an ordered arrangement of numbers, which we denote with an arrow over letters, for example, $\vec{x} = (0.2, 0.5, 0.3)$ . The dimension of the vector is defined as the number of elements it contains, represented by $\dim(\cdot)$ ; in the previous example, we have $\dim(\vec{x}) = 3$ .

A $d$ -dimensional vector can also represent the coordinates of a point in $d$ -dimensional space; for example, $(1, 2)$ can be a point in a plane (two-dimensional space), while $\vec{x}$ could represent a point in three-dimensional space (if you consider the corner of a room as the origin, measured in meters) that is $0.2\ \text{m}$ from one wall, $0.5\ \text{m}$ from another, and $0.3\ \text{m}$ from the floor. Similarly, you can envision an arrow starting at the origin and ending at the point represented by the vector; this arrow can also be expressed using the same vector.

For two vectors of the same dimension and length, the closer the represented points and arrows are, the smaller their difference; Sent2Vec precisely encodes sentences into vectors of the same length. Based on this, by calculating the angle between the vectors of two sentences in arrow form or the distance in point form, we can quickly assess their semantic similarity.

Fuzzy Matching

Fuzzy matching, or similarity matching, aims to find the degree of similarity between two or more objects. In natural language processing, fuzzy matching is commonly used in text search, information retrieval, and recommendation systems.

Example: King, Queen, Male, Female

Word2Vec can be understood as Sent2Vec at the word level, primarily used to generate word vectors, mapping each word to a dense vector space. These vectors capture semantic and syntactic relationships between words. In fuzzy matching, we can assess the semantic similarity of two words by calculating the similarity between their word vectors.

Suppose we have already trained a Word2Vec model $g$ . In Word2Vec, we have:

This equation indicates that the vector transformation from "king" to "man" is approximately the same as from "queen" to "woman" in the vector space. This reflects the analogy between words. Calculating the vector similarity between "king" and "queen" and between "man" and "woman" will reveal a high degree of similarity between them.

This means that the model understands the similarity between "female king" and "queen" , thus achieving fuzzy matching between the two expressions.

Similarly, Sent2Vec can achieve fuzzy matching between semantically similar sentences, enabling fuzzy matching between user inputs and content in a component database, thereby avoiding issues where slight changes in user input lead to matching failures in strict matching.

Principles

In more professional terms, the Sent2Vec model is an unsupervised learning model for generating sentence vectors (sentence embedding). Its core function is to map variable-length text sequences to a fixed-dimensional dense vector space, ensuring that semantically similar sentences have smaller distances in this vector space.

Word2Vec

Before training the Sent2Vec model, we typically need to prepare a Word2Vec model to encode the words that appear in sentences.

Word Occurrence Probability

Let us denote a word as $w$ and a sentence consisting of $c$ words as $W$ , where the words are sequentially recorded as $W = w_1 w_2 \dots w_c$ . The overall probability of this sentence is:

where $p(w_k|w_1 w_2 \dots w_{k-1})$ denotes the probability of $w_k$ appearing immediately after $w_1 w_2 \dots w_{k-1}$ . We use $c(\cdot)$ to represent the occurrence count of a sentence in the collected corpus. By the law of large numbers, given sufficient samples, we have:

However, for sufficiently long texts, we can assume that only nearby words influence the current word; for example, when "nearby" refers to the previous word, we have:

Context of Similar Words

Consider the following two sentences:

A woman-king is a female who rules a kingdom.
A queen      is a female who rules a kingdom.

It can be seen that the contexts of the two phrases with similar meanings, "female king" and "queen," are likely to be identical. Generally, we consider that words with similar meanings have similar contexts (because of their semantic similarity, replacing these words should not significantly affect their compatibility with the context). This means that if $w_{k,\alpha}$ and $w_{k,\beta}$ are semantically similar, then according to the method of expressing the context word occurrence probability mentioned above, we should have:

Predicting the Next Word

To represent words, we consider encoding each word $w$ into an $m$ -dimensional feature vector $\vec{x}$ ; we wish to achieve this through a function $f$ , i.e., $\vec{x} = f(w)$ . We also consider a model that predicts the probability distribution of the next word based on the previous word; specifically, we want another function $g$ such that $p(w_{\text{next}}|w) = g(\vec{x_{\text{next}}}, \vec{x})$ , and this function should possess continuous, smooth, and other desirable properties. Thus, we seek $f$ and $g$ that satisfy:

Here, $g: \mathbb{R}^{2m} \to \mathbb{R}$ can be represented by a $2m \times 1$ matrix (or vector) $G$ .

We aim to make this prediction with a fixed set of parameters. As mentioned in equation $\texttt{(2)}$ , for semantically similar words $w_\alpha$ and $w_\beta$ , we have:

Since the parameters in $g$ are fixed, it follows that:

Thus, $\vec{x_{\text{next}}}$ remains the same on both sides, leading to:

One-Hot Encoding and $f$

We can perform one-hot encoding for each word. Assuming the one-hot vector $h_i$ for the $i$ -th word $w_i$ has a value of $1$ only in the $i$ -th dimension, then for a vocabulary of size $n$ , we construct matrix $F$ such that:

Thus, the function $f$ from the previous section can be computed as:

Training

Training from Scratch

Prediction vs. Actual

Before training the Word2Vec model, we need to perform some preparatory work, such as collecting all common sentences, separating each word from them, and removing words that appear too infrequently.

Next, we randomly initialize matrices $F$ and $G$ , enabling us to

derive functions $f$ and $g$ . Subsequently, we can compute the predicted probabilities using:

The predicted probabilities can then be compared to the actual probabilities derived from the collected corpus using the maximum likelihood estimation method, allowing us to update the matrices until convergence.

Sent2Vec Training

In summary, the training of the Sent2Vec model is a straightforward extension of the Word2Vec model. The only difference is that the training process is designed to ensure that sentence embedding, rather than word embedding, is achieved.

※ Randomly Sample Sentences: Randomly select $N$ sentences from the corpus and encode each of them into vectors using $f$ .
※ Sentence Pairing: Create sentence pairs; each positive pair consists of semantically similar sentences, while negative pairs consist of semantically dissimilar sentences.
※ Feature Extraction: For each sentence pair, compute the features for the sentence vectors.
※ Matrix Representation: Represent the training features using matrices $F$ and $G$ .
※ Stochastic Gradient Descent: Utilize stochastic gradient descent to update the parameters of matrices $F$ and $G$ to minimize the loss.
※ Output Vectors: Finally, after training, obtain the sentence vectors for further applications.

Applications

Sent2Vec can be applied in various scenarios, including:

※ Semantic Textual Similarity: Measuring the similarity between two sentences for search engines or recommendation systems.
※ Text Classification: Classifying texts into categories based on their semantic meaning.
※ Information Retrieval: Improving search results by finding semantically similar documents.
※ Chatbots and Dialogue Systems: Enhancing the understanding of user input by capturing the meaning of sentences.
※ Natural Language Generation: Assisting in generating human-like text responses based on input semantics.
※ Document Clustering: Grouping documents based on semantic similarity to facilitate organization and retrieval.

By leveraging the power of Sent2Vec, we can delve deeper into understanding language and enhance various applications that rely on semantic understanding.

Conclusion

Sent2Vec represents a significant advancement in generating sentence embeddings. By capturing semantic relationships between sentences, it provides a foundation for numerous natural language processing applications. Understanding its principles and functionality is crucial for effectively utilizing this model in various contexts. As natural language processing continues to evolve, tools like Sent2Vec will play a pivotal role in enhancing our understanding of language and communication.