I have five text files that I input to a CountVectorizer. When specifying min_df and max_df to the CountVectorizer instance what does the min/max document frequency exactly means? Is it the frequency of a word in its particular text file or is it the frequency of the word in the entire overall corpus (5 txt files)?
How is it different when min_df and max_df are provided as integers or as floats?
The documentation doesn't seem to provide a thorough explanation nor does it supply an example to demonstrate the use of min_df and/or max_df. Could someone provide an explanation or example demonstrating min_df or max_df.
MIN_DF
is to ignore words that have very few occurrences to be considered meaningful. For example, in your text you may have names of people that may appear in only 1 or two documents. In some applications, this may qualify as noise and could be eliminated from further analysis. Similarly, you can ignore words that are too common with MAX_DF
.MIN_DF
and MAX_DF
look at how many documents contained a term, better known as document frequency. The threshold values can be an absolute value (e.g. 1, 2, 3, 4) or a value representing proportion of documents (e.g. 0.25 meaning, ignore words that have appeared in 25% of the documents) .The goal of MIN_DF is to ignore words that have very few occurrences to be considered meaningful. For example, in your text you may have names of people that may appear in only 1 or two documents. In some applications, this may qualify as noise and could be eliminated from further analysis. Similarly, you can ignore words that are too common with MAX_DF.
Instead of using a minimum/maximum term frequency (total occurrences of a word) to eliminate words, MIN_DF and MAX_DF look at how many documents contained a term, better known as document frequency. The threshold values can be an absolute value (e.g. 1, 2, 3, 4) or a value representing proportion of documents (e.g. 0.25 meaning, ignore words that have appeared in 25% of the documents) .