QBoard » Artificial Intelligence & ML » AI and ML - Python » How to Tokenize group of words in Python

How to Tokenize group of words in Python

  • I am developing a application in python which gives job recommendation based on the resume uploaded. I am trying to tokenize resume before processing further. I want to tokenize group of words. For example Data Science is a keyword when i tokenize i will get data and science separately. How to overcome this situation. Is there any library which does these extraction in python?

     
      October 26, 2021 1:22 PM IST
    0
  • if you wish to tokenise all the words in the resume by some delimiter such as a space then based on your example input "Data Science" and output ["data", "science"] the following function will lower case a string an split its contents by a space, returning a list of strings.
    def tokenize(resume_string): return resume_string.lower().split(" ")
      October 27, 2021 2:09 PM IST
    0
  • Looks like you are looking to generate n-grams (bi-grams in particular). If that's the case, the following is one way to achieve this:

    from nltk import ngrams
    resume = '... working in the data science field for years ...'
    n = 2
    bigrams = ngrams(resume.split(), n)
    for grams in bigrams:
      print grams
    
      October 28, 2021 4:40 PM IST
    0