QBoard » Artificial Intelligence & ML » AI and ML - Python » How to compute the similarity between two text documents?

How to compute the similarity between two text documents?

  • I am looking at working on an NLP project, in any programming language (though Python will be my preference).I want to take two documents and determine how similar they are.
    This post was edited by Samar Patil at September 12, 2020 10:55 AM IST
      August 1, 2020 3:28 PM IST
    1
  • I have tried using NLTK package in python to find similarity between two or more text documents.  One common use case is to check all the bug reports on a product to see if two bug reports are duplicates.

     

    A document is characterised by a vector where the value of each dimension corresponds to the number of times that term appears in the document. Cosine similarity then gives a useful measure of how similar two documents are likely to be in terms of their subject matter. For more details on cosine similarity refer this link.

    So I downloaded a few bugs from https://bugzilla.mozilla.org/show_bug.cgi?id=bugid

    First step is to import all the relevant packages. Open a file, read all lines and the words and tokenise them. Convert words into lower case.

    Use Porter Stemmer to stem the words.  Stemming is the process of reducing inflected words into their word stem or root form. Like "runs", "running" get converted into it's root form "run".

    Remove stop words like "a", "the".In natural language processing, useless words (data), are referred to as stop words. For more information on stop word removal refer this link.

    Then count the occurrence of each word in the document.



    then calculate the cosine similarity between 2 different bug reports.


    Here is the output which shows that Bug#599831 and Bug#1055525 are more similar than the rest of the pairs.

      September 12, 2020 10:52 AM IST
    1
  • Bayesian filters have exactly this purpose. That's the techno you'll find in most tools that identify spam.

    Example, to detect a language (from http://sebsauvage.net/python/snyppets/#bayesian) :

    from reverend.thomas import Bayes
    guesser = Bayes()
    guesser.train('french','La souris est rentrée dans son trou.')
    guesser.train('english','my tailor is rich.')
    guesser.train('french','Je ne sais pas si je viendrai demain.')
    guesser.train('english','I do not plan to update my website soon.')
    
    >>> print guesser.guess('Jumping out of cliffs it not a good idea.')
    [('english', 0.99990000000000001), ('french', 9.9999999999988987e-005)]
    
    >>> print guesser.guess('Demain il fera très probablement chaud.')
    [('french', 0.99990000000000001), ('english', 9.9999999999988987e-005)]

    But it works to detect any type you will train it for : technical text, songs, jokes, etc. As long as you can provide enought material to let the tool learn what does you document looks like.

      September 11, 2020 5:49 PM IST
    0
  • If these are pure text documents, or you have a method to extract the text from the documents, you can use a technique called shingling.

    You first compute a unique hash for each document. If these are the same, you are done.

    If not, you break each document down into smaller chunks. These are your 'shingles.'

    Once you have the shingles, you can then compute identity hashes for each shingle and compare the hashes of the shingles to determine if the documents are actually the same.

    The other technique you can use is to generate n-grams of the entire documents and compute the number of similar n-grams in each document and produce a weighted score for each document. Basically an n-gram is splitting a word into smaller chunks. 'apple' would become ' a', ' ap', 'app', 'ppl', 'ple', 'le '. (This is technically a 3-gram) This approach can become quite computationally expensive over a large number of documents or over two very large documents. Of course, common n-grams 'the', ' th, 'th ', etc need to be weighted to score them lower.

    This post was edited by Shivakumar Kota at September 11, 2020 5:52 PM IST
      September 11, 2020 5:51 PM IST
    0
  • import nltk, string
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    nltk.download('punkt') # if necessary...
    
    
    stemmer = nltk.stem.porter.PorterStemmer()
    remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
    
    def stem_tokens(tokens):
        return [stemmer.stem(item) for item in tokens]
    
    '''remove punctuation, lowercase, stem'''
    def normalize(text):
        return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))
    
    vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
    
    def cosine_sim(text1, text2):
        tfidf = vectorizer.fit_transform([text1, text2])
        return ((tfidf * tfidf.T).A)[0,1]
    
    
    print cosine_sim('a little bird', 'a little bird')
    print cosine_sim('a little bird', 'a little bird chirps')
    print cosine_sim('a little bird', 'a big dog barks')
      September 12, 2020 10:55 AM IST
    0
  • I found this can be done easily with Spacy. Once the document is read, a simple api similarity can be used to find the cosine similarity between the document vectors.

    import spacy
    nlp = spacy.load('en')
    doc1 = nlp(u'Hello hi there!')
    doc2 = nlp(u'Hello hi there!')
    doc3 = nlp(u'Hey whatsup?')
    
    print doc1.similarity(doc2) # 0.999999954642
    print doc2.similarity(doc3) # 0.699032527716
    print doc1.similarity(doc3) # 0.699032527716​
      September 12, 2020 11:21 AM IST
    0
  • For Syntactic Similarity There can be 3 easy ways of detecting similarity.

    • Word2Vec
    • Glove
    • Tfidf or countvectorizer

    For Semantic Similarity One can use BERT Embedding and try a different word pooling strategies to get document embedding and then apply cosine similarity on document embedding.

    An advanced methodology can use BERT SCORE to get similarity.



    Research Paper Link: https://arxiv.org/abs/1904.09675

      September 12, 2020 11:24 AM IST
    0
  • To find sentence similarity with very less dataset and to get high accuracy you can use below python package which is using pre-trained BERT models,

    pip install similar-sentences
     
      September 12, 2020 12:38 PM IST
    0