QBoard » Artificial Intelligence & ML » AI and ML - R » Estimating document polarity using R's qdap package without sent

Estimating document polarity using R's qdap package without sent

  • I'd like to apply qdap's polarity function to a vector of documents, each of which could contain multiple sentences, and obtain the corresponding polarity for each document. For example:

    library(qdap)
    polarity(DATA$state)$all$polarity
    # Results:
    [1] -0.8165 -0.4082 0.0000 -0.8944 0.0000 0.0000 0.0000 -0.5774 0.0000
    [10] 0.4082 0.0000
    Warning message:
    In polarity(DATA$state) :
    Some rows contain double punctuation. Suggested use of `sentSplit` function.
    This warning can't be ignored, as it seems to add the polarity scores of each sentence in the document. This can result in document-level polarity scores outside the [-1, 1] bounds.

    I'm aware of the option to first run sentSplit and then average across the sentences, perhaps weighting polarity by word count, but this is (1) inefficient (takes roughly 4x as long as running on the full documents with the warning), and (2) unclear how to weight sentences. This option would look something like this:

    DATA$id <- seq(nrow(DATA)) # For identifying and aggregating documents
    sentences <- sentSplit(DATA, "state")
    library(data.table) # For aggregation
    pol.dt <- data.table(polarity(sentences$state)$all)
    pol.dt[, id := sentences$id]
    document.polarity <- pol.dt[, sum(polarity * wc) / sum(wc), "id"]
    I was hoping I could run polarity on a version of the vector with periods removed, but it seems that sentSplit does more than that. This works on DATA but not on other sets of text (I'm unsure of the full set of breaks other than periods).

    So, I suspect the best way of approaching this is to make each element of the document vector look like one long sentence. How would I do this, or is there another way?
      June 12, 2019 11:18 AM IST
    0
    • Raji Reddy A
      Raji Reddy A Removing the endmarks is extra work if you just want to ignore the warnings. Your results are the same, so it seems you just don't want a warning. First I'd say if its interactive then you could just ignore a warning as its just a flag saying this could be bad.
      June 14, 2019


  • Looks like removing punctuation and other extras tricks polarity into thinking the vector is a single sentence:

    SimplifyText <- function(x) {
    return(removePunctuation(removeNumbers(stripWhitespace(tolower(x)))))
    }
    polarity(SimplifyText(DATA$state))$all$polarity
    # Result (no warning)
    [1] -0.8165 -0.4472 0.0000 -1.0000 0.0000 0.0000 0.0000 -0.5774 0.0000
    [10] 0.4082 0.0000
      June 14, 2019 12:37 PM IST
    0
  • Max found a bug in this version of qdap (1.3.4) that counted a place holder as a word which affect the equation since the denominator is sqrt(n) where n is the word count. As of 1.3.5 this has been corrected, hence why the two different outputs did not match.

    Here is the output:

    library(qdap)
    counts(polarity(DATA$state))[, "polarity"]

    ## > counts(polarity(DATA$state))[, "polarity"]
    ## [1] -0.8164966 -0.4472136 0.0000000 -1.0000000 0.0000000 0.0000000 0.0000000
    ## [8] -0.5773503 0.0000000 0.4082483 0.0000000
    ## Warning message:
    ## In polarity(DATA$state) :
    ## Some rows contain double punctuation. Suggested use of `sentSplit` function.
    In this case using strip does not matter. It may in more complex situations involving amplifiers, negators, negatives, and commas. Here is an example:

    ## > counts(polarity("Really, I hate it"))[, "polarity"]
    ## [1] -0.5
    ## > counts(polarity(strip("Really, I hate it")))[, "polarity"]
    ## [1] -0.9
    see the documentation for more.
      June 12, 2019 11:23 AM IST
    0
  • qdap (Rinker, 2013) is an R package designed to assist in quantitative discourse analysis. The package stands as a bridge between qualitative transcripts of dialogue and statistical analysis and visualization. qdap was born out of a frustration with current discourse analysis programs. Packaged programs are a closed system, meaning the researcher using the method has little, if any, influence on the program applied to her data.

    R already has thousands of excellent packages for statistics and visualization. qdap is designed to stand as a bridge between the qualitative discourse of a transcript and the computational power and freedom that R offers. As qdap returns the power to the researcher it will also allow the researcher to be more efficient and thus effective and productive in data analysis. The qdap package provides researchers with the tools to analyze data and more importantly is a dynamic system governed by the data, shaped by theory, and continuously refined by the field.
      September 2, 2021 1:40 PM IST
    0