This is a quick comparison of simple logistic regression fitting in R and julia. To implement this, you will need julia and the `JuliaCall`

package in R. The following allows you to define a `julia chunk`

in R markdown.

`knitr::knit_engines$set(julia = JuliaCall::eng_juliacall)`

## In R

Standard logistic regression is implemented easily in R with the `glm`

function. In this example we’ll use the following data, that gives a continuous variable `x`

and the numbers of successes, `succ`

, and failures `fail`

. The total population size for each observation of `x`

is then successes plus failures.

```
obs <- data.frame(x = c(1.1, 3.3, 5.3, 2.1, 7.3, 3.6, 5.9, 4.6, 3.5, 4.3),
succ = c(23, 24, 17, 26, 15, 22, 20, 19, 19, 21),
fail = c(39, 35, 32, 28, 25, 31, 35, 27, 29, 31))
obs$pop <- obs$succ + obs$fail
```

For data concatenated in this way (so that there isn’t an entry for every observation with a binary outcome, but instead a summary of the binomial outcome) we can fit logistic regression in a number of ways in R. One option is to give success probabilities and total population sizes with the `weights`

option, and the other is to give a two-column dataframe of the success and failures. Both these options are shown here.

```
fit_a <- glm(succ / pop ~ x, data = obs, family = binomial, weights = obs$pop)
fit_b <- glm(cbind(succ, fail) ~ x, data = obs, family = binomial)
```

These function calls both produce the same estimates of the coefficients, deviance, etc.

```
identical(coefficients(fit_a), coefficients(fit_b))
## [1] TRUE
```

## In julia

Julia is a relatively new high performance language that implements just-in-time compilation, so is really speedy for a lot of things that R just isn’t. To define dataframes like in R and fit glm’s two packages are needed.

```
# packages are required for glm's and dataframes
Pkg.add("GLM")
Pkg.add("DataFrames")
using(GLM)
using(DataFrames)
```

To define the same data as used above in julia we use `DataFrame`

. This is pretty similar to R, but notice that if you want to do vector operations you need the `./`

syntax.

```
obs = DataFrame(x = [1.1, 3.3, 5.3, 2.1, 7.3, 3.6, 5.9, 4.6, 3.5, 4.3],
succ = [23.0, 24, 17, 26, 15, 22, 20, 19, 19, 21],
fail = [39.0, 35, 32, 28, 25, 31, 35, 27, 29, 31])
obs[:pop] = obs[:succ] .+ obs[:fail]
obs[:theta] = obs[:succ] ./ obs[:pop]
```

Fitting the regression is also very similar to option `a`

in R (the two-column definition approach does not apply in julia), using `wts`

.

`fit = glm(@formula(theta ~ x), obs, Binomial(), LogitLink(), wts = obs[:pop])`

## Sanity check between R and julia

Check that the two approaches produce the same estimates. In r:

```
summary(fit_a)
##
## Call:
## glm(formula = succ/pop ~ x, family = binomial, data = obs, weights = obs$pop)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.85032 -0.20715 0.07557 0.19383 0.99042
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.26432 0.22604 -1.169 0.242
## x -0.03828 0.05281 -0.725 0.469
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2.7722 on 9 degrees of freedom
## Residual deviance: 2.2458 on 8 degrees of freedom
## AIC: 49.781
##
## Number of Fisher Scoring iterations: 3
```

and in julia:

```
fit
## StatsModels.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Binomial{Float64},GLM.LogitLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
##
## Formula: theta ~ 1 + x
##
## Coefficients:
## Estimate Std.Error z value Pr(>|z|)
## (Intercept) -0.264318 0.226044 -1.16932 0.2423
## x -0.0382826 0.0528096 -0.724917 0.4685
# examples of options for specific model values
coef(fit)
## 2-element Array{Float64,1}:
## -0.264318
## -0.0382826
deviance(fit)
## 2.2458348828330963
stderr(fit)
## 2-element Array{Float64,1}:
## 0.226044
## 0.0528096
```

## R vs julia

So what’s the point of using julia over R for fitting glm’s, especially when the documentation for this in julia is severely lacking in comparison? Well, julia is fast at doing a lot of things R isn’t, so if we needed to fit lots of regression models sequentially, we would hope to see a significant time boost by using julia over R.

In this example, let’s assume we have 10000 sets of data coming in sequentially, and each time we want to fit the logistic regression and predict what the outcome would be for a new observation with `x=1`

. Implement this in R as:

```
r_lg_pred <- function() {
obs = data.frame(x = rnorm(10),
succ = sample(20:30, 10),
fail = sample(10:20, 10))
obs$pop <- obs$succ + obs$fail
fit <- glm(succ / pop ~ x, data = obs, family = binomial, weights = obs$pop)
predict(fit, newdata = data.frame(x = 1.0))
}
r_store_pred <- rep(NA, 10000)
```

and in julia as:

```
function jul_lg_pred()
obs = DataFrame(x = randn(10),
succ = rand(20.0:30, 10),
fail = rand(10.0:20, 10))
obs[:pop] = obs[:succ] .+ obs[:fail]
obs[:theta] = obs[:succ] ./ obs[:pop]
fit = glm(@formula(theta ~ x), obs, Binomial(), LogitLink(), wts = obs[:pop])
predict(fit, DataFrame(x = [1.0]))[1]
end
jul_store_pred = zeros(Float64, 10000)
```

Timed runs show a potentially useful speed-up in julia

```
system.time(
for (i in 1:10000) {
r_store_pred[i] <- r_lg_pred()
}
)
## user system elapsed
## 17.369 0.004 17.373
```

```
@elapsed for i in 1:10000
jul_store_pred[i] = jul_lg_pred()
end
## 2.829691777
```