QBoard » Statistical modeling » Stats - Conceptual » Linear Regression and group by in R

Linear Regression and group by in R

  • I want to do a linear regression in R using the lm() function. My data is an annual time series with one field for year (22 years) and another for state (50 states). I want to fit a regression for each state so that at the end I have a vector of lm responses. I can imagine doing for loop for each state then doing the regression inside the loop and adding the results of each regression to a vector. That does not seem very R-like, however. In SAS I would do a 'by' statement and in SQL I would do a 'group by'. What's the R way of doing this?
      September 21, 2020 12:17 PM IST
    0
  • In my opinion is a mixed linear model a better approach for this kind of data. The code below given in the fixed effect the overall trend. The random effects indicate how the trend for each individual state differ from the global trend. The correlation structure takes the temporal autocorrelation into account. Have a look at Pinheiro & Bates (Mixed Effects Models in S and S-Plus).

    library(nlme)
    lme(response ~ year, random = ~year|state, correlation = corAR1(~year))
      September 21, 2020 2:46 PM IST
    1
  • ## make fake data
        library(data.table)
        set.seed(1)
        dat <- data.table(x=runif(100), y=runif(100), grp=rep(1:2,50))
    
    ##calculate the regression coefficient r^2
        dat[,summary(lm(y~x))$r.squared,by=grp]
           grp         V1
        1:   1 0.01465726
        2:   2 0.02256595

    as well as all the other output from summary(lm):

    dat[,list(r2=summary(lm(y~x))$r.squared , f=summary(lm(y~x))$fstatistic[1] ),by=grp]
       grp         r2        f
    1:   1 0.01465726 0.714014
    2:   2 0.02256595 1.108173
      September 21, 2020 2:48 PM IST
    1
  • Here's one way using the lme4 package.

    library(lme4)
     d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
                     year=rep(1:10, 2),
                     response=c(rnorm(10), rnorm(10)))
    
     xyplot(response ~ year, groups=state, data=d, type='l')
    
     fits <- lmList(response ~ year | state, data=d)
     fits
    #------------
    Call: lmList(formula = response ~ year | state, data = d)
    Coefficients:
       (Intercept)        year
    CA -1.34420990  0.17139963
    NY  0.00196176 -0.01852429
    
    Degrees of freedom: 20 total; 16 residual
    Residual standard error: 0.8201316​
      September 21, 2020 2:43 PM IST
    0
  • Here's an approach using the plyr package:

    d <- data.frame(
      state = rep(c('NY', 'CA'), 10),
      year = rep(1:10, 2),
      response= rnorm(20)
    )
    
    library(plyr)
    # Break up d by state, then fit the specified model to each piece and
    # return a list
    models <- dlply(d, "state", function(df) 
      lm(response ~ year, data = df))
    
    # Apply coef to each model and return a data frame
    ldply(models, coef)
    
    # Print the summary of each model
    l_ply(models, summary, .print = TRUE)
      September 21, 2020 2:44 PM IST
    0
  • The lm() function above is an simple example. By the way, I imagine that your database has the columns as in the following form:

    year state var1 var2 y...

    In my point of view, you can to use the following code:

    require(base) 
    library(base) 
    attach(data) # data = your data base
                 #state is your label for the states column
    modell<-by(data, data$state, function(data) lm(y~I(1/var1)+I(1/var2)))
    summary(modell)
     
      September 21, 2020 2:56 PM IST
    0