Reading Time: 2 minutes

[latexpage]In Natural Language Processing, we sometimes need to estimate the similarity between text documents. There are several methods to achieve this such as Costine Distance, Jaccard Similarity and Euclidean Distance measurements. In this article we are going to discuss the Jaccard Similarity and present an example.

The Jaccard similarity is defined as the intersection of two documents which is divided by the union of that two documents that refer to the number of common words, over a total number of words.

The Jaccard similarity score has a range between 0 to 1. If the two documents being compared are identical then the Jaccard similarity will have a score of 1. If the two documents have no common words then the Jaccard similarity is 0.

doc1 = “This apple round and green”

doc2 = “The orange is round and orange”

First, we must tokenise the words to obtain a set of unique words.

[php] doc1 = " {‘This’, ‘apple’, ’round’, ‘and’, ‘green’}[/php]

[php] doc2 = " {‘The’ , ‘orange’, ‘is’, ’round’, ‘and’, ‘orange’} [/php]

The Jaccard Similarity metric can be used to determine the similarity between two text documents, but determining the closeness in terms of common words over the total words in the sentence.

[php]def Jaccard_Similarity_fnct(doc1, doc2): 
    
    # List the unique words in the documents
    words_doc1 = set(doc1.lower().split()) 
    words_doc2 = set(doc2.lower().split())
    
    # Find the intersection of words list of doc1 & doc2
    intersection = words_doc1.intersection(words_doc2)

    # Find the union of words list of doc1 & doc2
    union = words_doc1.union(words_doc2)
        
    # Calculate Jaccard similarity score 
    # using length of intersection set divided by length of union set
    return float(len(intersection)) / len(union)

doc1 = "This apple round and green"
doc2 = "The orange is round and orange"
Jaccard_Similarity_fnct(doc1,doc2)[/php]

The Jaccard similarity between doc1 and doc2 is 0.25

However, the following documents yields a Jaccard similarity score 0.44

[php]doc1 = "This apple round and is a fruit"
doc2 = "The orange fruit is round and orange"
Jaccard_Similarity_fnct(doc1,doc2)[/php]