Introduction
Transforming document context into vectors provide several advantages such as:
- calculate similarity between documents with vector distance
- find similar groups of documents by applying a cluster algorithm
- get a visual representation of similar documents with dimensionality reduction
There are several designs to represent a document as a vector, one of the simplest is a word-count per document.
Also how to measure similarity between document vectors.
Transforming a Document into a Vector
In this approach, each vector position represents a words and each value a frequency count (word-count).
This way the text below:
I like to eat hamburguer.
Buying peperoni hamburguer to eat.
Is mapped as:
I | like | to | eat | hamburguer | buying | peperoni |
---|---|---|---|---|---|---|
1 | 1 | 2 | 2 | 2 | 1 | 1 |
Similarity Among Documents
Using the idea describe above, several documents can be mapped as vectors and placed in a table such as:
document | I | do | play | weekend | ball | peperoni | sometimes | … |
---|---|---|---|---|---|---|---|---|
D1 | 1 | 2 | 2 | 0 | 0 | 2 | 7 | … |
D2 | 12 | 0 | 0 | 2 | 0 | 0 | 11 | … |
D3 | 8 | 12 | 0 | 0 | 1 | 1 | 9 | … |
… | … | … | … | … | … | … | … | … |
Given that documents are represented as vector, the distance between 2 vectors can be calculated as:
Euclidean distance \(dist(D1, D2)=\sqrt{ \sum_{i=1}^{n} (D1 - D2)^2 }\)
Cosine similarity \(\cos(\theta)=\frac{D1 \cdot D2}{\left \| D1 \right \| \left \| D2 \right \|}\) \(\cos(\theta)=\frac{ \sum_{i=1}^{n} D1_i D2_i }{ \sqrt{ \sum_{i=1}^{n} {D1_i}^2 } \cdot \sqrt{ \sum_{i=1}^{n} {D2_i}^2 } }\)
PS: Prefer cosine similarity is preferred since it works well even with vectors with different magnitude.
PS: Consider converting document vector to unitary vector before using with euclidean distance
Find Document Clusters
K-Means with Euclidean Distance
Euclidean distance provides a distance function between 2 document-vectors D1 and D2 by considering each dimension distance separately.
Besides an average vector can be calculated by averaging a dataset of document vectors in a table \(DT\).
\(DT=\frac{ \sum_{i=1}^{n} DT_i }{ n }\)
Given the documents:
document | I | do | play | weekend | ball | peperoni | sometimes |
---|---|---|---|---|---|---|---|
D1 | 1 | 2 | 2 | 0 | 0 | 2 | 7 |
D2 | 12 | 0 | 0 | 2 | 0 | 0 | 11 |
D3 | 8 | 12 | 0 | 0 | 1 | 1 | 9 |
The sum of those documents are:
I | do | play | weekend | ball | peperoni | sometimes | |
---|---|---|---|---|---|---|---|
sum | 21 | 14 | 2 | 2 | 1 | 3 | 27 |
with n = 3 (number of documents) then average vector is:
I | do | play | weekend | ball | peperoni | sometimes | |
---|---|---|---|---|---|---|---|
average | 7 | 4.667 | 0.667 | 0.667 | 0.333 | 1 | 9 |
Given a enclidean distance and average functions above, then a cluster mechanism can be built on top of a cluster algorithm such as K-means whose standard steps are described as:
- \(K\) initial vectors are randomly generated inside data range as centroids
- Each document-vector is associated to the closest centroid using the distance function (K clusters)
- Inside each cluster \(K_i\), a new centroid is calculated with average vector function
- Repeat steps 2 and 3 until the distance of new centroid’s locations from original is less than a threshold
Visual Representation with Dimensionality Reduction
Visualising documents in 2 dimensions can provide a graphical and intuitive way to communicate how or perhaps why groups of documents are created.
First each document vector need to be mapped to a 2 dimension vector so it can be visualised with PCA (principal component analysis).
Details on PCA can be found here.
With the 2D document vectors, one can come up with a visualisation such as scatterplot to present to an audience or gain more insigths on the data:
Conclusion
Learning how to convert document to vector can help perform analysis more quickly. Some applications are cited above however there are more, for instance, document-vectors that belong to the same category such as finance, movies, tech, etc can be summed up together to describe a vector information of that group and perhaps help classify another document to the nearest category using a distance function.
The new document to be classified just needs to be converted to a vector and then this vector needs to be compared by distance to the document-category vectors.
Another application is to detect plagiarism, by measuring the distance between a given input document to a dataset of known documents.