tm package error “Cannot convert DocumentTermMatrix into normal

QBoard » Artificial Intelligence & ML » AI and ML - R » tm package error “Cannot convert DocumentTermMatrix into normal

tm package error “Cannot convert DocumentTermMatrix into normal

Back To Topics

Tags : R tm DTM DocumentTermMatrix

Rakesh Racharla

129 8

I have created a DocumentTermMatrix that contains 1859 documents (rows) and 25722 (columns). In order to perform further calculations on this matrix I need to convert it to a regular matrix. I want to use the as.matrix() command. However, it returns the following error: cannot allocate vector of size 364.8 MB.

> corp
A corpus with 1859 text documents
> mat<-DocumentTermMatrix(corp)
> dim(mat)
[1] 1859 25722
> is(mat)
[1] "DocumentTermMatrix"
> mat2<-as.matrix(mat)
Fehler: kann Vektor der Größe 364.8 MB nicht allozieren # cannot allocate vector of size 364.8 MB
> object.size(mat)
5502000 bytes
For some reason the size of the object seems to increase dramatically whenever it is transformed to a regular matrix. How can I avoid this?

Or is there an alternative way to perform regular matrix operations on a DocumentTermMatrix?

June 11, 2019 4:39 PM IST

0
Vaibhav Mali

259
The quick and dirty way is to export your data into a sparse matrix object from an external package like Matrix.
```
> attributes(dtm)
$names
[1] "i"        "j"        "v"        "nrow"     "ncol"     "dimnames"

$class
[1] "DocumentTermMatrix"    "simple_triplet_matrix"

$Weighting
[1] "term frequency" "tf"
```
The dtm object has the i, j and v attributes which is the internal representation of your DocumentTermMatrix. Use:
```
library("Matrix") 
mat <- sparseMatrix(
           i=dtm$i,
           j=dtm$j, 
           x=dtm$v,
           dims=c(dtm$nrow, dtm$ncol)
           )
```
and you're done.

A naive comparison between your objects:
```
> mat[1,1:100]
> head(as.vector(dtm[1,]), 100)
```
will each give you the exact same output.
September 3, 2021 1:38 PM IST

0
Pranav B

106 5

Since you only have 1859 documents, the distance matrix you need to compute is fairly small. Using the slam package (and in particular, its crossapply_simple_triplet_matrix function), you might be able to compute the distance matrix directly, instead of converting the DTM into a dense matrix first. This means that you will have to compute the Jaccard similarity yourself. I have successfully tried something similar for the cosine distance matrix on a large number of documents.

June 11, 2019 4:40 PM IST

0
Viaan Prakash

461
The number of documents should not be a problem but you may want to try removing sparse terms, this could very well reduce the dimension of document term matrix.
```
inspect(removeSparseTerms(dtm, 0.7))
```
It removes terms that has at least a sparsity of 0.7.

Another option available to you is that you specify minimum word length and minimum document frequency when you create document term matrix
```
a.dtm <- DocumentTermMatrix(a.corpus, control = list(weighting = weightTfIdf, minWordLength = 2, minDocFreq = 5))
```
use inspect(dtm) before and after your changes, you will see huge difference, more importantly you won't ruin significant relations hidden in your docs and terms.
September 2, 2021 1:38 PM IST

0

Cluzters.ai is the first step towards uniting various Industry participants in the field of Applied Data Innovations. It is a gamified community geared towards creating a level playing turf for Data science professionals.

Member Sign In

Member Sign In

Create Account

tm package error “Cannot convert DocumentTermMatrix into normal

Connect With Us