QBoard » Artificial Intelligence & ML » AI and ML - Python » Understanding min_df and max_df in scikit CountVectorizer

Understanding min_df and max_df in scikit CountVectorizer

  • I have five text files that I input to a CountVectorizer. When specifying min_df and max_df to the CountVectorizer instance what does the min/max document frequency exactly means? Is it the frequency of a word in its particular text file or is it the frequency of the word in the entire overall corpus (5 txt files)?

    How is it different when min_df and max_df are provided as integers or as floats?

    The documentation doesn't seem to provide a thorough explanation nor does it supply an example to demonstrate the use of min_df and/or max_df. Could someone provide an explanation or example demonstrating min_df or max_df.

      August 26, 2021 11:31 PM IST
    0
  • max_df is used for removing terms that appear too frequently, also known as "corpus-specific stop words". For example:

    max_df = 0.50 means "ignore terms that appear in more than 50% of the documents".
    max_df = 25 means "ignore terms that appear in more than 25 documents".
    The default max_df is 1.0, which means "ignore terms that appear in more than 100% of the documents". Thus, the default setting does not ignore any terms.

    min_df is used for removing terms that appear too infrequently. For example:

    min_df = 0.01 means "ignore terms that appear in less than 1% of the documents".
    min_df = 5 means "ignore terms that appear in less than 5 documents".
    The default min_df is 1, which means "ignore terms that appear in less than 1 document". Thus, the default setting does not ignore any terms.
      November 3, 2021 2:03 PM IST
    0
  • max_df is used for removing terms that appear too frequently, also known as "corpus-specific stop words". For example: max_df = 0.50 means "ignore terms that appear in more than 50% of the documents". max_df = 25 means "ignore terms that appear in more than 25 documents".
      November 8, 2021 2:52 PM IST
    0
  • The goal of MIN_DF is to ignore words that have very few occurrences to be considered meaningful. For example, in your text you may have names of people that may appear in only 1 or two documents. In some applications, this may qualify as noise and could be eliminated from further analysis. Similarly, you can ignore words that are too common with MAX_DF.
    Instead of using a minimum/maximum term frequency (total occurrences of a word) to eliminate words, MIN_DF and MAX_DF look at how many documents contained a term, better known as document frequency. The threshold values can be an absolute value (e.g. 1, 2, 3, 4) or a value representing proportion of documents (e.g. 0.25 meaning, ignore words that have appeared in 25% of the documents) .
      November 8, 2021 5:22 PM IST
    0
  • The goal of MIN_DF is to ignore words that have very few occurrences to be considered meaningful. For example, in your text you may have names of people that may appear in only 1 or two documents. In some applications, this may qualify as noise and could be eliminated from further analysis. Similarly, you can ignore words that are too common with MAX_DF.

    Instead of using a minimum/maximum term frequency (total occurrences of a word) to eliminate words, MIN_DF and MAX_DF look at how many documents contained a term, better known as document frequency. The threshold values can be an absolute value (e.g. 1, 2, 3, 4) or a value representing proportion of documents (e.g. 0.25 meaning, ignore words that have appeared in 25% of the documents) .

      November 9, 2021 2:23 PM IST
    0
  • max_df is the upper ceiling value of the frequency values, while min_df is just the lower cutoff value of the frequency values. If we want to remove more common words, we set max_df to a lower ceiling value between 0 and 1. If we want to remove more rare words, we set min_df to a higher cutoff value between 0 and 1.
      August 27, 2021 7:14 PM IST
    0
  • I would add this point also for understanding min_df and max_df in tf-idf better.

    If you go with the default values, meaning considering all terms, you have generated definitely more tokens. So your clustering process (or any other thing you want to do with those terms later) will take a longer time.

    BUT the quality of your clustering should NOT be reduced.

    One might think that allowing all terms (e.g. too frequent terms or stop-words) to be present might lower the quality but in tf-idf it doesn't. Because tf-idf measurement instinctively will give a low score to those terms, effectively making them not influential (as they appear in many documents).

    So to sum it up, pruning the terms via min_df and max_df is to improve the performance, not the quality of clusters (as an example).

    And the crucial point is that if you set the min and max mistakenly, you would lose some important terms and thus lower the quality. So if you are unsure about the right threshold (it depends on your documents set), or if you are sure about your machine's processing capabilities, leave the min, max parameters unchanged.
      August 31, 2021 12:28 PM IST
    0