Logistic regression: R vs Julia

This is a quick comparison of simple logistic regression fitting in R and julia. To implement this, you will need julia and the JuliaCall package in R. The following allows you to define a julia chunk in R markdown.

knitr::knit_engines$set(julia = JuliaCall::eng_juliacall)

In R

Standard logistic regression is implemented easily in R with the glm function. In this example we’ll use the following data, that gives a continuous variable x and the numbers of successes, succ, and failures fail. The total population size for each observation of x is then successes plus failures.

obs <- data.frame(x = c(1.1, 3.3, 5.3, 2.1, 7.3, 3.6, 5.9, 4.6, 3.5, 4.3),
                  succ = c(23, 24, 17, 26, 15, 22, 20, 19, 19, 21),
                  fail = c(39, 35, 32, 28, 25, 31, 35, 27, 29, 31))
obs$pop <- obs$succ + obs$fail

For data concatenated in this way (so that there isn’t an entry for every observation with a binary outcome, but instead a summary of the binomial outcome) we can fit logistic regression in a number of ways in R. One option is to give success probabilities and total population sizes with the weights option, and the other is to give a two-column dataframe of the success and failures. Both these options are shown here.

fit_a <- glm(succ / pop ~ x, data = obs, family = binomial, weights = obs$pop)
fit_b <- glm(cbind(succ, fail) ~ x, data = obs, family = binomial)

These function calls both produce the same estimates of the coefficients, deviance, etc.

identical(coefficients(fit_a), coefficients(fit_b)) 
## [1] TRUE

In julia

Julia is a relatively new high performance language that implements just-in-time compilation, so is really speedy for a lot of things that R just isn’t. To define dataframes like in R and fit glm’s two packages are needed.

# packages are required for glm's and dataframes
Pkg.add("GLM")
Pkg.add("DataFrames")
using(GLM) 
using(DataFrames) 

To define the same data as used above in julia we use DataFrame. This is pretty similar to R, but notice that if you want to do vector operations you need the ./ syntax.

obs = DataFrame(x = [1.1, 3.3, 5.3, 2.1, 7.3, 3.6, 5.9, 4.6, 3.5, 4.3],
                succ = [23.0, 24, 17, 26, 15, 22, 20, 19, 19, 21],
                fail = [39.0, 35, 32, 28, 25, 31, 35, 27, 29, 31])
obs[:pop] = obs[:succ] .+ obs[:fail]
obs[:theta] = obs[:succ] ./ obs[:pop]

Fitting the regression is also very similar to option a in R (the two-column definition approach does not apply in julia), using wts.

fit = glm(@formula(theta ~ x), obs, Binomial(), LogitLink(), wts = obs[:pop])

Sanity check between R and julia

Check that the two approaches produce the same estimates. In r:

summary(fit_a)
## 
## Call:
## glm(formula = succ/pop ~ x, family = binomial, data = obs, weights = obs$pop)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.85032  -0.20715   0.07557   0.19383   0.99042  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.26432    0.22604  -1.169    0.242
## x           -0.03828    0.05281  -0.725    0.469
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2.7722  on 9  degrees of freedom
## Residual deviance: 2.2458  on 8  degrees of freedom
## AIC: 49.781
## 
## Number of Fisher Scoring iterations: 3

and in julia:

fit
## StatsModels.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Binomial{Float64},GLM.LogitLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
## 
## Formula: theta ~ 1 + x
## 
## Coefficients:
##                Estimate Std.Error   z value Pr(>|z|)
## (Intercept)   -0.264318  0.226044  -1.16932   0.2423
## x            -0.0382826 0.0528096 -0.724917   0.4685

# examples of options for specific model values
coef(fit)
## 2-element Array{Float64,1}:
##  -0.264318 
##  -0.0382826
deviance(fit)
## 2.2458348828330963
stderr(fit)
## 2-element Array{Float64,1}:
##  0.226044 
##  0.0528096

R vs julia

So what’s the point of using julia over R for fitting glm’s, especially when the documentation for this in julia is severely lacking in comparison? Well, julia is fast at doing a lot of things R isn’t, so if we needed to fit lots of regression models sequentially, we would hope to see a significant time boost by using julia over R.

In this example, let’s assume we have 10000 sets of data coming in sequentially, and each time we want to fit the logistic regression and predict what the outcome would be for a new observation with x=1. Implement this in R as:

r_lg_pred <- function() {
  obs = data.frame(x = rnorm(10),
                   succ = sample(20:30, 10),
                   fail = sample(10:20, 10))
  obs$pop <- obs$succ + obs$fail
  fit <- glm(succ / pop ~ x, data = obs, family = binomial, weights = obs$pop)
  predict(fit, newdata = data.frame(x = 1.0))
}
r_store_pred <- rep(NA, 10000)

and in julia as:

function jul_lg_pred()
  obs = DataFrame(x = randn(10),
                  succ = rand(20.0:30, 10),
                  fail = rand(10.0:20, 10))
  obs[:pop] = obs[:succ] .+ obs[:fail]
  obs[:theta] = obs[:succ] ./ obs[:pop]
  fit = glm(@formula(theta ~ x), obs, Binomial(), LogitLink(), wts = obs[:pop])
  predict(fit, DataFrame(x = [1.0]))[1]
end
jul_store_pred = zeros(Float64, 10000)

Timed runs show a potentially useful speed-up in julia

system.time(
  for (i in 1:10000) {
    r_store_pred[i] <- r_lg_pred()
  }
)
##    user  system elapsed 
##  17.369   0.004  17.373
@elapsed for i in 1:10000
           jul_store_pred[i] = jul_lg_pred()
         end
## 2.829691777