I have a number of columns that I would like to remove from a data frame. I know that we can delete them individually using something like:
df$x <- NULL
But I was hoping to do... moreI have a number of columns that I would like to remove from a data frame. I know that we can delete them individually using something like:
df$x <- NULL
But I was hoping to do this with fewer commands.
Also, I know that I could drop columns using integer indexing like this:
df <- df
But I am concerned that the relative position of my variables may change.
Given how powerful R is, I figured there might be a better way than dropping each column one by one.
Now that .NET v3.5 SP1 has been released (along with VS2008 SP1), we now have access to the .NET entity framework.
My question is this. When trying to decide between using the... moreNow that .NET v3.5 SP1 has been released (along with VS2008 SP1), we now have access to the .NET entity framework.
My question is this. When trying to decide between using the Entity Framework and LINQ to SQL as an ORM, what's the difference?
The way I understand it, the Entity Framework (when used with LINQ to Entities) is a 'big brother' to LINQ to SQL? If this is the case - what advantages does it have? What can it do that LINQ to SQL can't do on its own?
Using machine learning in R while generating formula ~. ,data,what does . indicatefor example
fit <- svm(factor(outcome)~., data= train, probability= T)
pre <- predict(fit, test,... moreUsing machine learning in R while generating formula ~. ,data,what does . indicatefor example
fit <- svm(factor(outcome)~., data= train, probability= T)
pre <- predict(fit, test, decision.value= T, probability= T)
is their any possible way to call all the R packages/libraries and functions (packages like raster, rgdal, maptools etc.) in .net framework so that i am able to access all the features of R and run R script using .Net frontend....
I know this is not a new concept by any stretch in R, and I have browsed the High Performance and Parallel Computing Task View. With that said, I am asking this question from a... moreI know this is not a new concept by any stretch in R, and I have browsed the High Performance and Parallel Computing Task View. With that said, I am asking this question from a point of ignorance as I have no formal training in Computer Science and am entirely self taught.
Recently I collected data from the Twitter Streaming API and currently the raw JSON sits in a 10 GB text file. I know there have been great strides in adapting R to handle big data, so how would you go about this problem? Here are just a handful of the tasks that I am looking to do:
Read and process the data into a data frame
Basic descriptive analysis, including text mining (frequent terms, etc.)
Plotting
Is it possible to use R entirely for this, or will I have to write some Python to parse the data and throw it into a database in order to take random samples small enough to fit into R.
Simply, any tips or pointers that you can provide will be greatly appreciated. Again, I won't take offense if you describe solutions at a 3rd... less
I have created a DocumentTermMatrix that contains 1859 documents (rows) and 25722 (columns). In order to perform further calculations on this matrix I need to convert it to a... moreI have created a DocumentTermMatrix that contains 1859 documents (rows) and 25722 (columns). In order to perform further calculations on this matrix I need to convert it to a regular matrix. I want to use the as.matrix() command. However, it returns the following error: cannot allocate vector of size 364.8 MB.> corpA corpus with 1859 text documents> mat<-DocumentTermMatrix(corp)> dim(mat) 1859 25722> is(mat) "DocumentTermMatrix"> mat2<-as.matrix(mat)Fehler: kann Vektor der Größe 364.8 MB nicht allozieren # cannot allocate vector of size 364.8 MB> object.size(mat)5502000 bytesFor some reason the size of the object seems to increase dramatically whenever it is transformed to a regular matrix. How can I avoid this?Or is there an alternative way to perform regular matrix operations on a DocumentTermMatrix? less
It seems like R is really designed to handle datasets that it can pull entirely into memory. What R packages are recommended for signal processing and machine learning on... more
It seems like R is really designed to handle datasets that it can pull entirely into memory. What R packages are recommended for signal processing and machine learning on very large datasets that can not be pulled into memory?
If R is simply the wrong way to do this, I am open to other robust free suggestions (e.g. scipy if there is some nice way to handle very large datasets)
I'd like to apply qdap's polarity function to a vector of documents, each of which could contain multiple sentences, and obtain the corresponding polarity for each document.... moreI'd like to apply qdap's polarity function to a vector of documents, each of which could contain multiple sentences, and obtain the corresponding polarity for each document. For example:library(qdap)polarity(DATA$state)$all$polarity# Results: -0.8165 -0.4082 0.0000 -0.8944 0.0000 0.0000 0.0000 -0.5774 0.0000 0.4082 0.0000Warning message:In polarity(DATA$state) : Some rows contain double punctuation. Suggested use of `sentSplit` function.This warning can't be ignored, as it seems to add the polarity scores of each sentence in the document. This can result in document-level polarity scores outside the bounds.I'm aware of the option to first run sentSplit and then average across the sentences, perhaps weighting polarity by word count, but this is (1) inefficient (takes roughly 4x as long as running on the full documents with the warning), and (2) unclear how to weight sentences. This option would look something like this:DATA$id <- seq(nrow(DATA)) # For identifying and aggregating documents... less
I am attempting to use the tm package to convert a vector of text strings to a corpus element.
My code looks something like... more
I am attempting to use the tm package to convert a vector of text strings to a corpus element.
My code looks something like this
Corpus(d1$Yes)
where d1$Yes is a factor with 124 levels, each containing a text string.
For example, d1$Yes = "So we can get the boat out!"
I'm receiving the following error: "Error: inherits(x, "Source") is not TRUE"
I'm not sure how to remedy this.
I'm trying to use the mice package in R for a project and discovered that the pooled results seemed to change the dummy code I had for one of the variables in the output.To... moreI'm trying to use the mice package in R for a project and discovered that the pooled results seemed to change the dummy code I had for one of the variables in the output.To elaborate, let's say I have a factor, foo, with two levels: 0 and 1. Using a regular lm would typically yield an estimate for foo1. Using mice and the pool function, however, yields an estimate for foo2. I included a reproducible example below using the nhanes dataset from the mice package. Any ideas why the might be occurring?require(mice)# Create age as: 0, 1, 2nhanes$age <- as.factor(nhanes$age - 1)head(nhanes)# age bmi hyp chl# 1 0 NA NA NA# 2 1 22.7 1 187# 3 0 NA 1 187# 4 2 NA NA NA# 5 0 20.4 1 113# 6 2 NA NA 184# Use a regular lm with missing data just to see output# age1 and age2 come up as expectedlm(chl ~ age + bmi, data = nhanes)# Call:# lm(formula = chl ~ age + bmi, data = nhanes)# Coefficients:# (Intercept) age1 age2 bmi # -28.948 55.810 104.724 6.921 imp <- mice(nhanes)str(complete(imp)) # still the same codingfit <-... less
I am trying to get the second last value in each row of a data frame, meaning the first job a person has had. (Job1_latest is the most recent job and people had a different number... moreI am trying to get the second last value in each row of a data frame, meaning the first job a person has had. (Job1_latest is the most recent job and people had a different number of jobs in the past and I want to get the first one). I managed to get the last value per row with the code below:first_job <- function(x) tail(x, 1)first_job <- apply(data, 1, first_job)
structure(list(Index = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,
45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59),
FromJob = c("Senior Machine Learning Engineer", "Senior Machine Learning Engineer",
"Senior Machine Learning Engineer", "Senior Machine Learning Engineer",
"Senior Machine Learning Engineer", "Python Data Engineer (m/w/d)",
"Python Data Engineer (m/w/d)", "Python Data Engineer (m/w/d)",
"Lead Backend Developer (f/m/d)", "Lead Backend Developer (f/m/d)",
"Lead... less
I'm using R's caret package to do some grid search and model evaluation. I have a custom evaluation metric that is a weighted average of absolute error. Weights are assigned at... moreI'm using R's caret package to do some grid search and model evaluation. I have a custom evaluation metric that is a weighted average of absolute error. Weights are assigned at the observation level.
X <- c(1,1,2,0,1) #feature 1
w <- c(1,2,2,1,1) #weights
Y <- 1:5 #target, continuous
#assume I run a model using X as features and Y as target and get a vector of predictions
v <- sum(abs(target-predictions)*weights)/sum(weights)
return(v)
}
Here an example is given on how to use summaryFunction to define a custom evaluation metric for caret's train(). To quote:
The trainControl function has a argument called summaryFunction that specifies a function for computing performance. The function should have these arguments:
data is a reference for a data frame or matrix with columns called obs and pred for the observed and predicted outcome values (either numeric data for regression or character values for classification). Currently, class... less
For my stats class, I'm using R to do some of the math for my term project. The class doesn't call for it, but I want to supplement myself by learning R, which is my weaker... moreFor my stats class, I'm using R to do some of the math for my term project. The class doesn't call for it, but I want to supplement myself by learning R, which is my weaker language.
Using this data: skittle-data.csv (Every row was an individual bag of skittles submitted by each student)
I'm trying to generate some charts and other things to satisfy the assignment. While doing so, I noticed that in determining the total number of skittles I was off by 1.
When I load the csv into a dataframe I make summations of the rows, and then sum those summations to get the total, like this:
skittles = read.csv("skittle-data.csv", header = TRUE)
columnTotals = colSums(skittles, na.rm=FALSE, dims = 1)
rowTotals = rowSums(skittles, na.rm=FALSE, dims = 1)
total = sum(rowTotals, na.rm=FALSE, dims = 1)
There doesn't seem to be too many options for deploying predictive models in production which is surprising given the explosion in Big Data.
I understand that the open-source PMML... moreThere doesn't seem to be too many options for deploying predictive models in production which is surprising given the explosion in Big Data.
I understand that the open-source PMML can be used to export models as an XML specification. This can then be used for in-database scoring/prediction. However it seems that to make this work you need to use the PMML plugin by Zementis which means the solution is not truly open source. Is there an easier open way to map PMML to SQL for scoring?
Another option would be to use JSON instead of XML to output model predictions. But in this case, where would the R model sit? I'm assuming it would always need to be mapped to SQL...unless the R model could sit on the same server as the data and then run against that incoming data using an R script?
Any other options out there? less
I need to detect the language of many short texts, using R. I am using the textcat package, which find which among many (say 30) European languages is the one of each text.... moreI need to detect the language of many short texts, using R. I am using the textcat package, which find which among many (say 30) European languages is the one of each text. However, I know my texts are either French or English (or, more generally, a small subset of the langages handled by textcat).How could add this knowledge when calling textcat functions ?Thanks,