Welcome to the R Workshop @ Memorial HKR

The R project for statistical computing is a free open source statistical programming language and project. Follow these steps to get started:

  1. Download and install R from the Dalhousie CRAN (Comprehensive R Archive Network) Mirror http://mirror.its.dal.ca/cran/
  2. Download and install the Open Source Rstudio Desktop License https://www.rstudio.com/products/rstudio/download/. RStudio is a free IDE (Integrated Development Environment) built for R. It makes your life easier!

Watch a cheezy R Studio video below

RStudio Overview – 1:30 from RStudio, Inc. on Vimeo.

Open RStudio

You should see something that looks like in the video.

1. What is R?

R was first written as a research project by Ross Ihaka and Robert Gentleman, and is now under active development by a group of statisticians called ‘the R core team’. R is available free of charge and is distributed under the terms of the Free Software Foundation’s GNU General Public License. You can download the program from the Comprehensive R Archive Network (CRAN). Ready-to-run ‘binaries’ are available for Windows, Mac OS X, and Linux. The source code is also available for download and can be compiled for other platforms. German Rodriguez

2. Why use R?

The main advantages of R are the fact that R is freeware and that there is a lot of help available online.

  • Strengths
    • free and open source, supported by a strong user community
    • highly extensible and flexible
    • implementation of modern statistical methods
    • assists with replication and extension of your work
  • Weaknesses
    • slow with very large data sets
    • non-standard programming paradigms

3. Important Notes

  1. First a quick note that the first sections of this document borrow from Torfs & Bruaer – A very short introduction to R, Spector – An introduction to R, and Rodriguez – Introducing R.
  2. We will mostly be working in the RStudio environment. You can find many cheat sheets if you can’t quite remember the correct code. That said, there is almost always more than one way to do the same thing in R so find a way that works for you.
  3. Coding style is important. Follow the Hadley or Goolge style guide. Leave comments so your team can pass the “hit by a bus test.”

4. Let’s get started

First R Commands

Calculator

You can use R as a basic calculator. If you pass the command 2+2 it will return 4.

2 + 2
[1] 4

You can use arithmetic and mathematical operators in your commands

Code Actionf
+ Addition
- Subtraction
_*_ Multiplication
/ Division
^ Raise to the power of

You can also do more complex calculations

2^6/32 - 1
[1] 1

Assignment and Variables

You can use the “<-” symbol to assign a variable. This is not a variable in the social science/statistics sense. Think if it as storing something to be used again later. So if we want to store the number 4 and call it a we can. Then we ask R to return the value of a we get 4.

a <- 4
a
[1] 4

With our newly stored a we can do calculations.

a * 20
[1] 80

Scalars, vectors, and matrices

Like in many other programs, R organizes numbers in

  • Scalars (a single number – 0-dimensional)
  • Vectors (a row of numbers, also called arrays – 1-dimensional)
  • Matrices (like a table – 2- dimensional)

We can define a vector with the numbers 1, 2, 3, and 4 with the function c, which is short for concatenate (paste together).

vector <- c(1, 2, 3, 4)
vector
[1] 1 2 3 4

Matrices and dataframes will be introduced later…

Functions

We can apply functions to the vector we just created. For example, if we wanted to compute the mean of the vector we could write

(1 + 2 + 3 + 4)/4
[1] 2.5

That is long and error prone. So we can call a function from Base R named mean to do the same thing.

mean(vector)
[1] 2.5

Within the brackets you specify the arguments. Arguments give extra information to the function.

There a lots of pre-existing arguments in R. If you are thinking about it, someone probably already wrote a function for it. Here is a non-exaustive list of mathematical and statistical functions borrowed from Wagner – BIO360.

Code Action
mean(x) Average
median(x) Median
var(x) Variance
sd(x) Standard Deviation
log(x) Logarithm of x
exp(x) Exponential function ex
sin(x) Sin
cos(x) Cosine
tan(x) Tangent
min(x) Minimum value
max(x) Maximum value
sum(x) Sum of values
cumsum(x) Cummulative sum of values
length(x) Length of vector

5. Help

To get help for a specific function type help(rnorm). This will provide you with the documentation. You can also do this on the bottom right panel in RStudio.

If you want an example you can type example(rnorm) to get an example. Typically, the help documentation will provide an example.

One of the best things about R is that everyone is in the same boat. There are tons of online tutorials and help to be found. Some good places to go are:

6. Packages/Libraries

R can do many statistical and data analyses. They are organized in so-called packages or libraries. With the standard installation, most common packages are installed. There are lots of packages. It’s probable that if you have thought about the analysis, there is a package that is able to do it already. There are two basic steps to using a package:

  • Installing the package install.packages("ggplot2")
  • Loading the package library(ggplot2)

Matrices and Data Frames

Matrices

R understands matrices. This can be used to your advantage. The following function creates a 3 by 4 matrix and fills it by columns with the numbers 1 to 12:

M <- matrix(1:12, 3, 4)
M
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

The elements of a matrix may be addressed using the row and column subscripts in square brackets, separated by a comma. Thus, M[1,1] is the first element of M.

A subscript may be left blank to select an entire row or column: M[1,] is the first row and M[,1] is the first column of M. Any of the subscripts can be a vector, so M[1:2,1:2] is the upper-left 2 by 2 corner of M.

M[1:2, 1:2]
     [,1] [,2]
[1,]    1    4
[2,]    2    5

Data Frames

Data frames are probably what you are used to working with. This is the typical data format other statistical software, like SPSS and Stata, use. In R, are a data frame can be comprised of many different data types (i.e., numeric, string, factor) so long as all variables are the same length. Data frames also typically have labelled headers with some meaning, unlike the matrix example above.

7. Working with real data

Set a working directory

Mac setwd(“/Users/dfuller/Desktop/”)
PC setwd(“C:\Users\Andrie\R\win-library\”)

You can import pretty much any type of data into R with the foreign package. It allows you to import and work with data without owning the software. Once you do that you can also write the data to .csv or another reasonable data format. Some examples:

Package Code Action
haven read.sav SPSS data
foreign read.dta Stata data
read_excel readxl Excel data

You can read many file formats using the haven package.

Example of read.csv

cchs <- read.csv("cchs_RWorkshop.csv")

Example of read_sav

library(haven)

cchs_spss <- read_sav("cchs_RWorkshop.sav")

Example of readxl

library(readxl)

cchs_xl <- read_excel("cchs_RWorkshop.xlsx", col_types = c("numeric", "text", 
    "text", "numeric", "numeric"))

Can also write data to many different formats SPSS, Stata, SAS, csv…

Example 1. Survey type data

Exploring the Data

Now we have some data. The variables are as follows:

  • CASEID
    • Unique identifier for each participant
  • verdate
    • Date the survey was complete
  • geogprv
    • Province
  • hwtghtm
    • Height in Meters
  • hwtgwtk
    • Weight in Kilograms

We can also use head to get a quick view of the data frame

head(cchs)
  CASEID  verdate geogprv hwtghtm hwtgwtk
1      1 20130913      35   1.575   65.25
2      2 20130913      59   1.905   99.00
3      3 20130913      35   1.803   77.40
4      4 20130913      46   1.727   85.50
5      5 20130913      24   1.803   81.00
6      6 20130913      48   1.727   78.75

We will mostly be working the Hadley/Rstudio way to do this. That said, one of the nice/horrific things about R is that there are many many ways to do the same thing.

Some basic concepts:

  • Unlike other stats programs R makes you call the data and variable together. So if I want the weight variable from the cchs data you can do that a few different ways:
  • cchs$hwtghtm – Call the data and variable by name
  • cchs[,4] – Use matrix logic to call the 4th variable
  • Sometimes packages have a data = to call the data and then you can call the varible name at the correct places. This is common in regression type packages.

Descriptive Statistics

library(psych)
psych::describe(cchs$hwtghtm)
   vars      n mean   sd median trimmed  mad  min max range skew kurtosis
X1    1 124929 2.01 1.62   1.68    1.69 0.11 1.27  10  8.73 4.71    20.24
   se
X1  0
psych::describe(cchs[, 4])
   vars      n mean   sd median trimmed  mad  min max range skew kurtosis
X1    1 124929 2.01 1.62   1.68    1.69 0.11 1.27  10  8.73 4.71    20.24
   se
X1  0
library(stargazer)
stargazer(cchs[c("hwtghtm")], type = "text")

============================================
Statistic    N    Mean  St. Dev.  Min   Max 
--------------------------------------------
hwtghtm   124,929 2.014  1.620   1.270 9.999
--------------------------------------------

Graph the Data

library(ggplot2)

scatter <- ggplot(cchs, aes(x = hwtghtm, y = hwtgwtk)) + geom_point() + theme(axis.line = element_line(colour = "black"), 
    panel.background = element_blank(), axis.title = element_text(size = 16), 
    text = element_text(size = 14)) + xlab("Height in Meters") + ylab("Weight in Kilograms")
plot(scatter)

Wooo what the heck is happening here? Looks like the data are coded in a specific way (SPSS style) for missing data. With 999.96 and 999.99 for weight and 9.996 and 9.999 for height.

Missing Values

“The value NA is used to represent missing values for input in R. An NA can be assigned to a variable directly to create a missing value, but to test for missing values, the function is.na must be used. R propogates missing values throughout computations, so often computations performed on data con- taining missing values will result in more missing values. Some of the basic statistical functions (like mean, min, max, etc.) have an argument called na.rm, which, if set to TRUE, will remove the NAs from your data before calculations are performed. In addition, the statistical modeling functions (like aov, glm, loess, etc.) provide an argument called na.action. This argument can be set to a function which will be called to operate on the data before it is processed. One very useful choice for the na.action argument is na.omit. This function (which can be called idependently of the statistical modelling functions) will remove all the rows of a data frame or matrix which contain any missing values.” (Spector – An introduction to R)

Recoding data

Recoding a values from a continous variable.

library(car)

# Recoding the height data
cchs$height_m <- car::recode(cchs$hwtghtm, "9.996=NA;
                      9.999=NA;", 
    as.factor.result = FALSE)

# Recoding the weight data
cchs$weight_kg <- car::recode(cchs$hwtgwtk, "999.99=NA;
                      999.96=NA;", 
    as.factor.result = FALSE)

Creating a new variable from 2 numerical variables.

cchs$bmi <- (cchs$weight_kg/cchs$height_m^2)

Categorizing a continuous variable.

library(car)

cchs$c_bmi <- car::recode(cchs$bmi, "lo:18.499='1. Underweight';
                      18.5:25='2. Normal';
                      25.00001:30='3. Overweight';
                      30.00001:hi='4. Obese';", 
    as.factor.result = TRUE)

## Let's also recode geogprv so it makes more sense

cchs$province <- car::recode(cchs$geogprv, "10='NL';
                      11='PEI';
                      12='NS';
                      13='NB';
                      24='QC';
                      35='ON';
                      46='MN';
                      47='SK';
                      48='AB';
                      59='BC';
                      60='TR';
                      96:99=NA;", 
    as.factor.result = TRUE)

Let’s graph that data again

library(ggplot2)

scatter <- ggplot(cchs, aes(x = height_m, y = weight_kg)) + geom_point(aes(colour = factor(c_bmi)), 
    alpha = 1/5) + theme(axis.line = element_line(colour = "black"), panel.background = element_blank(), 
    axis.title = element_text(size = 16), text = element_text(size = 14)) + 
    xlab("Height in Meters") + ylab("Weight in Kilograms")
plot(scatter)

Here I added aes(colour = factor(c_bmi)) to the geom_point command to get colours by BMI category. There are many ways to change the colour scheme but I’m not going into that at the moment.

Correlation Table

What if we wanted to look at the correlation between height and weight now that we have cleaned the data.

library(PerformanceAnalytics)
library(dplyr)

### Bit of jump here but we are going to subset columns
bmi_corr <- select(cchs, height_m, weight_kg, bmi)

### Run the correlation chart on the new data frame
chart.Correlation(bmi_corr, histogram = TRUE, pch = 19)

Subsetting

Image want to run a t-test with only 2 provinces, NL and BC. To simplify that task I can create a new data frame with only NL and BC. Then I can run the t-test relatively easily.

library(dplyr)

bmi_NL_BC <- filter(cchs, province == c("NL", "BC"))

table(bmi_NL_BC$province)

  AB   BC   MN   NB   NL   NS   ON  PEI   QC   SK   TR 
   0 7694    0    0 1747    0    0    0    0    0    0 
T-Test

Now that we have created the new data frame we can run a t-test comparing the 2 provinces. Also nice to graph the data.

# Sort by group then ID
bmi_NL_BC <- arrange(bmi_NL_BC, province, CASEID)

# t-test
t.test(bmi ~ province, bmi_NL_BC)

    Welch Two Sample t-test

data:  bmi by province
t = -12, df = 2200, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.089 -1.512
sample estimates:
mean in group BC mean in group NL 
            25.4             27.2 

Let’s make a graph of the t-test data

library(ggplot2)

histo_ttest <- ggplot(bmi_NL_BC, aes(bmi, colour = factor(province))) + geom_density(alpha = 0.5) + 
    theme(axis.line = element_line(colour = "black"), panel.background = element_blank(), 
        axis.title = element_text(size = 16), text = element_text(size = 14)) + 
    xlab("BMI") + ylab("Density")
plot(histo_ttest)

Collapsing by

Let’s say I want to get the mean and SD of BMI for each province. I can do that with dplyr.

library(dplyr)

bmi_prov <- cchs %>% group_by(province) %>% summarise(mean_BMI = mean(bmi, na.rm = TRUE), 
    sd_BMI = sd(bmi, na.rm = TRUE), total = n())
bmi_prov
# A tibble: 11 × 4
   province mean_BMI sd_BMI total
     <fctr>    <dbl>  <dbl> <int>
1        AB    26.31  5.493 11321
2        BC    25.45  4.970 15413
3        MN    26.86  5.680  6962
4        NB    27.00  5.662  4786
5        NL    27.11  5.427  3625
6        NS    26.79  5.591  4629
7        ON    26.07  5.336 42915
8       PEI    26.64  5.319  1774
9        QC    25.47  5.031 23260
10       SK    26.83  5.478  7161
11       TR    26.36  5.589  3083
Running a regression
regressionModel <- lm(height_m ~ weight_kg, data = cchs)
summary(regressionModel)

Call:
lm(formula = height_m ~ weight_kg, data = cchs)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.3591 -0.0578  0.0014  0.0605  0.3375 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.4628429  0.0011057    1323   <2e-16 ***
weight_kg   0.0030080  0.0000144     208   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.0862 on 116980 degrees of freedom
  (7947 observations deleted due to missingness)
Multiple R-squared:  0.271, Adjusted R-squared:  0.271 
F-statistic: 4.34e+04 on 1 and 116980 DF,  p-value: <2e-16

Here we see an example of the data = call we talked about earlier.

We use summary to get the summary information from the model. We can also use anova or plot to look at model fit.

anova(regressionModel)
Analysis of Variance Table

Response: height_m
              Df Sum Sq Mean Sq F value Pr(>F)    
weight_kg      1    322     322   43383 <2e-16 ***
Residuals 116980    868       0                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## plot(regressionModel)

Example 2. Repeated Measures

Data are approximations from Froman et al. Experiment 1.

Generate simulated data

library(dplyr)
froman <- NULL  # Create a NULL data.frame
froman$participant <- list(1:40)  # Create a list
froman <- as.data.frame(froman)  # Convert list to data.frame
froman$position <- c("neutral", "pronated")  # Create a variable that is neutral or pronated
froman$position <- as.factor(froman$position)
froman <- arrange(froman, position)  # Sort the position variable
froman$oclock <- c("six", "twelve")  # Create a variable called oclock
froman$oclock <- as.factor(froman$oclock)

# Create an mMax variable using the rnorm function and a mega ifelse
# function
set.seed(100)  # Make the data the same for each rnorm command
froman$mMax <- ifelse(froman$position == "neutral" & froman$oclock == "six", 
    rnorm(10, mean = 81.2, sd = 10), ifelse(froman$position == "pronated" & 
        froman$oclock == "six", rnorm(10, mean = 65.8, sd = 10), ifelse(froman$position == 
        "neutral" & froman$oclock == "twelve", rnorm(10, mean = 5.3, sd = 3), 
        ifelse(froman$position == "pronated" & froman$oclock == "twelve", rnorm(10, 
            mean = 3.6, sd = 3), NA))))

froman <- arrange(froman, oclock, position)

# Create a participant variable to capture the repeated measures
froman$participant <- ifelse(froman$position == "neutral" & froman$oclock == 
    "six", seq(c(1:10)), ifelse(froman$position == "pronated" & froman$oclock == 
    "six", seq(c(1:10)), ifelse(froman$position == "neutral" & froman$oclock == 
    "twelve", seq(c(1:10)), ifelse(froman$position == "pronated" & froman$oclock == 
    "twelve", seq(c(1:10)), NA))))
froman$participant <- as.factor(froman$participant)

# Sort by participant
froman <- arrange(froman, position, oclock)

write.csv(froman, "froman.csv")

Now we have some data. The variables are as follows:

  • positions
    • Neutral or Pronated
  • oclock
    • six or twelve
  • mMax
    • Max of normalized MEP value
  • participant
    • Participant number
Get some descriptive statistics by position and oclock
library(dplyr)
froman %>% group_by(position, oclock) %>% summarise(mean_mMax = mean(mMax, na.rm = TRUE), 
    sd_mMax = sd(mMax, na.rm = TRUE), total = n())
Source: local data frame [4 x 5]
Groups: position [?]

  position oclock mean_mMax sd_mMax total
    <fctr> <fctr>     <dbl>   <dbl> <int>
1  neutral    six    77.458   3.625    10
2  neutral twelve     6.246   1.408    10
3 pronated    six    63.218   3.991    10
4 pronated twelve     5.287   2.317    10
Graph the data using ggplot2
library(ggplot2)

#### Bar Graph
bargraph <- ggplot(froman, aes(factor(oclock), mMax, fill = factor(position))) + 
    geom_bar(stat = "identity", position = position_dodge()) + theme(axis.line = element_line(colour = "black"), 
    panel.background = element_blank(), axis.title = element_text(size = 16), 
    text = element_text(size = 14)) + xlab("Grip") + ylab("MEP mMax")
plot(bargraph)

#### BoxPlot Graph
boxplot <- ggplot(froman, aes(factor(oclock), mMax, fill = factor(position))) + 
    geom_boxplot() + theme(axis.line = element_line(colour = "black"), panel.background = element_blank(), 
    axis.title = element_text(size = 16), text = element_text(size = 14)) + 
    xlab("Grip") + ylab("MEP mMax")
plot(boxplot)

### Histogram
histogram_froman <- ggplot(froman, aes(mMax, colour = factor(position))) + geom_density() + 
    theme(axis.line = element_line(colour = "black"), panel.background = element_blank(), 
        axis.title = element_text(size = 16), text = element_text(size = 14)) + 
    xlab("Grip") + ylab("MEP mMax") + facet_wrap(~oclock)
plot(histogram_froman)

ANOVA
library(ez)

### Need to figure out between versus within for the study...
anova_results <- ezANOVA(dv = mMax, wid = participant, within = oclock, between = position, 
    data = froman, return_aov = TRUE, detaile = TRUE)

anova_results  #show the summary table
$ANOVA
           Effect DFn DFd     SSn   SSd       F         p p<.05    ges
1     (Intercept)   1  18 57918.9 139.1 7496.68 4.820e-25     * 0.9944
2        position   1  18   577.5 139.1   74.75 7.974e-08     * 0.6379
3          oclock   1  18 41694.3 188.7 3976.48 1.424e-22     * 0.9922
4 position:oclock   1  18   440.9 188.7   42.05 4.240e-06     * 0.5736

$aov

Call:
aov(formula = formula(aov_formula), data = data)

Grand Mean: 38.05

Stratum 1: participant

Terms:
                position Residuals
Sum of Squares     577.5     139.1
Deg. of Freedom        1        18

Residual standard error: 2.78
1 out of 2 effects not estimable
Estimated effects are balanced

Stratum 2: participant:oclock

Terms:
                oclock position:oclock Residuals
Sum of Squares   41694             441       189
Deg. of Freedom      1               1        18

Residual standard error: 3.238
Estimated effects may be unbalanced