R Workshop
Daniel Fuller
Welcome to the R Workshop @ Memorial HKR
The R project for statistical computing is a free open source statistical programming language and project. Follow these steps to get started:
- Download and install R from the Dalhousie CRAN (Comprehensive R Archive Network) Mirror http://mirror.its.dal.ca/cran/
- Download and install the Open Source Rstudio Desktop License https://www.rstudio.com/products/rstudio/download/. RStudio is a free IDE (Integrated Development Environment) built for R. It makes your life easier!
Watch a cheezy R Studio video below
RStudio Overview – 1:30 from RStudio, Inc. on Vimeo.
1. What is R?
R was first written as a research project by Ross Ihaka and Robert Gentleman, and is now under active development by a group of statisticians called ‘the R core team’. R is available free of charge and is distributed under the terms of the Free Software Foundation’s GNU General Public License. You can download the program from the Comprehensive R Archive Network (CRAN). Ready-to-run ‘binaries’ are available for Windows, Mac OS X, and Linux. The source code is also available for download and can be compiled for other platforms. German Rodriguez
2. Why use R?
The main advantages of R are the fact that R is freeware and that there is a lot of help available online.
- Strengths
- free and open source, supported by a strong user community
- highly extensible and flexible
- implementation of modern statistical methods
- assists with replication and extension of your work
- Weaknesses
- slow with very large data sets
- non-standard programming paradigms
3. Important Notes
- First a quick note that the first sections of this document borrow from Torfs & Bruaer – A very short introduction to R, Spector – An introduction to R, and Rodriguez – Introducing R.
- We will mostly be working in the RStudio environment. You can find many cheat sheets if you can’t quite remember the correct code. That said, there is almost always more than one way to do the same thing in R so find a way that works for you.
- Coding style is important. Follow the Hadley or Goolge style guide. Leave comments so your team can pass the “hit by a bus test.”
4. Let’s get started
First R Commands
Calculator
You can use R as a basic calculator. If you pass the command 2+2 it will return 4.
2 + 2
[1] 4
You can use arithmetic and mathematical operators in your commands
Code | Actionf |
---|---|
+ | Addition |
- | Subtraction |
_*_ | Multiplication |
/ | Division |
^ | Raise to the power of |
You can also do more complex calculations
2^6/32 - 1
[1] 1
Assignment and Variables
You can use the “<-” symbol to assign a variable. This is not a variable in the social science/statistics sense. Think if it as storing something to be used again later. So if we want to store the number 4 and call it a we can. Then we ask R to return the value of a we get 4.
a <- 4
a
[1] 4
With our newly stored a we can do calculations.
a * 20
[1] 80
Scalars, vectors, and matrices
Like in many other programs, R organizes numbers in
- Scalars (a single number – 0-dimensional)
- Vectors (a row of numbers, also called arrays – 1-dimensional)
- Matrices (like a table – 2- dimensional)
We can define a vector with the numbers 1, 2, 3, and 4 with the function c
, which is short for concatenate (paste together).
vector <- c(1, 2, 3, 4)
vector
[1] 1 2 3 4
Matrices and dataframes will be introduced later…
Functions
We can apply functions to the vector we just created. For example, if we wanted to compute the mean of the vector we could write
(1 + 2 + 3 + 4)/4
[1] 2.5
That is long and error prone. So we can call a function from Base R named mean
to do the same thing.
mean(vector)
[1] 2.5
Within the brackets you specify the arguments
. Arguments give extra information to the function.
There a lots of pre-existing arguments in R. If you are thinking about it, someone probably already wrote a function for it. Here is a non-exaustive list of mathematical and statistical functions borrowed from Wagner – BIO360.
Code | Action |
---|---|
mean(x) | Average |
median(x) | Median |
var(x) | Variance |
sd(x) | Standard Deviation |
log(x) | Logarithm of x |
exp(x) | Exponential function ex |
sin(x) | Sin |
cos(x) | Cosine |
tan(x) | Tangent |
min(x) | Minimum value |
max(x) | Maximum value |
sum(x) | Sum of values |
cumsum(x) | Cummulative sum of values |
length(x) | Length of vector |
5. Help
To get help for a specific function type help(rnorm)
. This will provide you with the documentation. You can also do this on the bottom right panel in RStudio.
If you want an example you can type example(rnorm)
to get an example. Typically, the help documentation will provide an example.
One of the best things about R is that everyone is in the same boat. There are tons of online tutorials and help to be found. Some good places to go are:
- UCLA Institute for Digital Research and Education for data analysis examples with annotated output.
- stackoverflow/r for asking and finding answers to R related questions. Make sure to create a reproducible example.
- Generally Google using the term R will also be very helpful.
6. Packages/Libraries
R can do many statistical and data analyses. They are organized in so-called packages or libraries. With the standard installation, most common packages are installed. There are lots of packages. It’s probable that if you have thought about the analysis, there is a package that is able to do it already. There are two basic steps to using a package:
- Installing the package
install.packages("ggplot2")
- Loading the package
library(ggplot2)
Matrices and Data Frames
Matrices
R understands matrices. This can be used to your advantage. The following function creates a 3 by 4 matrix and fills it by columns with the numbers 1 to 12:
M <- matrix(1:12, 3, 4)
M
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
The elements of a matrix may be addressed using the row and column subscripts in square brackets, separated by a comma. Thus, M[1,1]
is the first element of M.
A subscript may be left blank to select an entire row or column: M[1,]
is the first row and M[,1]
is the first column of M. Any of the subscripts can be a vector, so M[1:2,1:2]
is the upper-left 2 by 2 corner of M.
M[1:2, 1:2]
[,1] [,2]
[1,] 1 4
[2,] 2 5
Data Frames
Data frames are probably what you are used to working with. This is the typical data format other statistical software, like SPSS and Stata, use. In R, are a data frame can be comprised of many different data types (i.e., numeric, string, factor) so long as all variables are the same length. Data frames also typically have labelled headers with some meaning, unlike the matrix example above.
7. Working with real data
Set a working directory
Mac setwd(“/Users/dfuller/Desktop/”)
PC setwd(“C:\Users\Andrie\R\win-library\”)
You can import pretty much any type of data into R with the foreign
package. It allows you to import and work with data without owning the software. Once you do that you can also write the data to .csv or another reasonable data format. Some examples:
Package | Code | Action |
---|---|---|
haven | read.sav | SPSS data |
foreign | read.dta | Stata data |
read_excel | readxl | Excel data |
You can read many file formats using the haven
package.
Example of read.csv
cchs <- read.csv("cchs_RWorkshop.csv")
Example of read_sav
library(haven)
cchs_spss <- read_sav("cchs_RWorkshop.sav")
Example of readxl
library(readxl)
cchs_xl <- read_excel("cchs_RWorkshop.xlsx", col_types = c("numeric", "text",
"text", "numeric", "numeric"))
Can also write data to many different formats SPSS, Stata, SAS, csv…
Example 1. Survey type data
Exploring the Data
Now we have some data. The variables are as follows:
- CASEID
- Unique identifier for each participant
- verdate
- Date the survey was complete
- geogprv
- Province
- hwtghtm
- Height in Meters
- hwtgwtk
- Weight in Kilograms
We can also use head
to get a quick view of the data frame
head(cchs)
CASEID verdate geogprv hwtghtm hwtgwtk
1 1 20130913 35 1.575 65.25
2 2 20130913 59 1.905 99.00
3 3 20130913 35 1.803 77.40
4 4 20130913 46 1.727 85.50
5 5 20130913 24 1.803 81.00
6 6 20130913 48 1.727 78.75
We will mostly be working the Hadley/Rstudio way to do this. That said, one of the nice/horrific things about R is that there are many many ways to do the same thing.
Some basic concepts:
- Unlike other stats programs R makes you call the data and variable together. So if I want the
weight
variable from thecchs
data you can do that a few different ways: cchs$hwtghtm
– Call the data and variable by namecchs[,4]
– Use matrix logic to call the 4th variable- Sometimes packages have a
data =
to call the data and then you can call the varible name at the correct places. This is common in regression type packages.
Descriptive Statistics
library(psych)
psych::describe(cchs$hwtghtm)
vars n mean sd median trimmed mad min max range skew kurtosis
X1 1 124929 2.01 1.62 1.68 1.69 0.11 1.27 10 8.73 4.71 20.24
se
X1 0
psych::describe(cchs[, 4])
vars n mean sd median trimmed mad min max range skew kurtosis
X1 1 124929 2.01 1.62 1.68 1.69 0.11 1.27 10 8.73 4.71 20.24
se
X1 0
library(stargazer)
stargazer(cchs[c("hwtghtm")], type = "text")
============================================
Statistic N Mean St. Dev. Min Max
--------------------------------------------
hwtghtm 124,929 2.014 1.620 1.270 9.999
--------------------------------------------
Graph the Data
library(ggplot2)
scatter <- ggplot(cchs, aes(x = hwtghtm, y = hwtgwtk)) + geom_point() + theme(axis.line = element_line(colour = "black"),
panel.background = element_blank(), axis.title = element_text(size = 16),
text = element_text(size = 14)) + xlab("Height in Meters") + ylab("Weight in Kilograms")
plot(scatter)
Wooo what the heck is happening here? Looks like the data are coded in a specific way (SPSS style) for missing data. With 999.96 and 999.99 for weight and 9.996 and 9.999 for height.
Missing Values
“The value NA is used to represent missing values for input in R. An NA can be assigned to a variable directly to create a missing value, but to test for missing values, the function is.na must be used. R propogates missing values throughout computations, so often computations performed on data con- taining missing values will result in more missing values. Some of the basic statistical functions (like mean, min, max, etc.) have an argument called na.rm, which, if set to TRUE, will remove the NAs from your data before calculations are performed. In addition, the statistical modeling functions (like aov, glm, loess, etc.) provide an argument called na.action. This argument can be set to a function which will be called to operate on the data before it is processed. One very useful choice for the na.action argument is na.omit. This function (which can be called idependently of the statistical modelling functions) will remove all the rows of a data frame or matrix which contain any missing values.” (Spector – An introduction to R)
Recoding data
Recoding a values from a continous variable.
library(car)
# Recoding the height data
cchs$height_m <- car::recode(cchs$hwtghtm, "9.996=NA;
9.999=NA;",
as.factor.result = FALSE)
# Recoding the weight data
cchs$weight_kg <- car::recode(cchs$hwtgwtk, "999.99=NA;
999.96=NA;",
as.factor.result = FALSE)
Creating a new variable from 2 numerical variables.
cchs$bmi <- (cchs$weight_kg/cchs$height_m^2)
Categorizing a continuous variable.
library(car)
cchs$c_bmi <- car::recode(cchs$bmi, "lo:18.499='1. Underweight';
18.5:25='2. Normal';
25.00001:30='3. Overweight';
30.00001:hi='4. Obese';",
as.factor.result = TRUE)
## Let's also recode geogprv so it makes more sense
cchs$province <- car::recode(cchs$geogprv, "10='NL';
11='PEI';
12='NS';
13='NB';
24='QC';
35='ON';
46='MN';
47='SK';
48='AB';
59='BC';
60='TR';
96:99=NA;",
as.factor.result = TRUE)
Let’s graph that data again
library(ggplot2)
scatter <- ggplot(cchs, aes(x = height_m, y = weight_kg)) + geom_point(aes(colour = factor(c_bmi)),
alpha = 1/5) + theme(axis.line = element_line(colour = "black"), panel.background = element_blank(),
axis.title = element_text(size = 16), text = element_text(size = 14)) +
xlab("Height in Meters") + ylab("Weight in Kilograms")
plot(scatter)
Here I added aes(colour = factor(c_bmi))
to the geom_point
command to get colours by BMI category. There are many ways to change the colour scheme but I’m not going into that at the moment.
Correlation Table
What if we wanted to look at the correlation between height and weight now that we have cleaned the data.
library(PerformanceAnalytics)
library(dplyr)
### Bit of jump here but we are going to subset columns
bmi_corr <- select(cchs, height_m, weight_kg, bmi)
### Run the correlation chart on the new data frame
chart.Correlation(bmi_corr, histogram = TRUE, pch = 19)
Subsetting
Image want to run a t-test with only 2 provinces, NL and BC. To simplify that task I can create a new data frame with only NL and BC. Then I can run the t-test relatively easily.
library(dplyr)
bmi_NL_BC <- filter(cchs, province == c("NL", "BC"))
table(bmi_NL_BC$province)
AB BC MN NB NL NS ON PEI QC SK TR
0 7694 0 0 1747 0 0 0 0 0 0
T-Test
Now that we have created the new data frame we can run a t-test comparing the 2 provinces. Also nice to graph the data.
# Sort by group then ID
bmi_NL_BC <- arrange(bmi_NL_BC, province, CASEID)
# t-test
t.test(bmi ~ province, bmi_NL_BC)
Welch Two Sample t-test
data: bmi by province
t = -12, df = 2200, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.089 -1.512
sample estimates:
mean in group BC mean in group NL
25.4 27.2
Let’s make a graph of the t-test data
library(ggplot2)
histo_ttest <- ggplot(bmi_NL_BC, aes(bmi, colour = factor(province))) + geom_density(alpha = 0.5) +
theme(axis.line = element_line(colour = "black"), panel.background = element_blank(),
axis.title = element_text(size = 16), text = element_text(size = 14)) +
xlab("BMI") + ylab("Density")
plot(histo_ttest)
Collapsing by
Let’s say I want to get the mean and SD of BMI for each province. I can do that with dplyr
.
library(dplyr)
bmi_prov <- cchs %>% group_by(province) %>% summarise(mean_BMI = mean(bmi, na.rm = TRUE),
sd_BMI = sd(bmi, na.rm = TRUE), total = n())
bmi_prov
# A tibble: 11 × 4
province mean_BMI sd_BMI total
<fctr> <dbl> <dbl> <int>
1 AB 26.31 5.493 11321
2 BC 25.45 4.970 15413
3 MN 26.86 5.680 6962
4 NB 27.00 5.662 4786
5 NL 27.11 5.427 3625
6 NS 26.79 5.591 4629
7 ON 26.07 5.336 42915
8 PEI 26.64 5.319 1774
9 QC 25.47 5.031 23260
10 SK 26.83 5.478 7161
11 TR 26.36 5.589 3083
Running a regression
regressionModel <- lm(height_m ~ weight_kg, data = cchs)
summary(regressionModel)
Call:
lm(formula = height_m ~ weight_kg, data = cchs)
Residuals:
Min 1Q Median 3Q Max
-0.3591 -0.0578 0.0014 0.0605 0.3375
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.4628429 0.0011057 1323 <2e-16 ***
weight_kg 0.0030080 0.0000144 208 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.0862 on 116980 degrees of freedom
(7947 observations deleted due to missingness)
Multiple R-squared: 0.271, Adjusted R-squared: 0.271
F-statistic: 4.34e+04 on 1 and 116980 DF, p-value: <2e-16
Here we see an example of the data =
call we talked about earlier.
We use summary to get the summary information from the model. We can also use anova
or plot
to look at model fit.
anova(regressionModel)
Analysis of Variance Table
Response: height_m
Df Sum Sq Mean Sq F value Pr(>F)
weight_kg 1 322 322 43383 <2e-16 ***
Residuals 116980 868 0
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## plot(regressionModel)
Example 2. Repeated Measures
Data are approximations from Froman et al. Experiment 1.
Generate simulated data
library(dplyr)
froman <- NULL # Create a NULL data.frame
froman$participant <- list(1:40) # Create a list
froman <- as.data.frame(froman) # Convert list to data.frame
froman$position <- c("neutral", "pronated") # Create a variable that is neutral or pronated
froman$position <- as.factor(froman$position)
froman <- arrange(froman, position) # Sort the position variable
froman$oclock <- c("six", "twelve") # Create a variable called oclock
froman$oclock <- as.factor(froman$oclock)
# Create an mMax variable using the rnorm function and a mega ifelse
# function
set.seed(100) # Make the data the same for each rnorm command
froman$mMax <- ifelse(froman$position == "neutral" & froman$oclock == "six",
rnorm(10, mean = 81.2, sd = 10), ifelse(froman$position == "pronated" &
froman$oclock == "six", rnorm(10, mean = 65.8, sd = 10), ifelse(froman$position ==
"neutral" & froman$oclock == "twelve", rnorm(10, mean = 5.3, sd = 3),
ifelse(froman$position == "pronated" & froman$oclock == "twelve", rnorm(10,
mean = 3.6, sd = 3), NA))))
froman <- arrange(froman, oclock, position)
# Create a participant variable to capture the repeated measures
froman$participant <- ifelse(froman$position == "neutral" & froman$oclock ==
"six", seq(c(1:10)), ifelse(froman$position == "pronated" & froman$oclock ==
"six", seq(c(1:10)), ifelse(froman$position == "neutral" & froman$oclock ==
"twelve", seq(c(1:10)), ifelse(froman$position == "pronated" & froman$oclock ==
"twelve", seq(c(1:10)), NA))))
froman$participant <- as.factor(froman$participant)
# Sort by participant
froman <- arrange(froman, position, oclock)
write.csv(froman, "froman.csv")
Now we have some data. The variables are as follows:
- positions
- Neutral or Pronated
- oclock
- six or twelve
- mMax
- Max of normalized MEP value
- participant
- Participant number
Get some descriptive statistics by position and oclock
library(dplyr)
froman %>% group_by(position, oclock) %>% summarise(mean_mMax = mean(mMax, na.rm = TRUE),
sd_mMax = sd(mMax, na.rm = TRUE), total = n())
Source: local data frame [4 x 5]
Groups: position [?]
position oclock mean_mMax sd_mMax total
<fctr> <fctr> <dbl> <dbl> <int>
1 neutral six 77.458 3.625 10
2 neutral twelve 6.246 1.408 10
3 pronated six 63.218 3.991 10
4 pronated twelve 5.287 2.317 10
Graph the data using ggplot2
library(ggplot2)
#### Bar Graph
bargraph <- ggplot(froman, aes(factor(oclock), mMax, fill = factor(position))) +
geom_bar(stat = "identity", position = position_dodge()) + theme(axis.line = element_line(colour = "black"),
panel.background = element_blank(), axis.title = element_text(size = 16),
text = element_text(size = 14)) + xlab("Grip") + ylab("MEP mMax")
plot(bargraph)
#### BoxPlot Graph
boxplot <- ggplot(froman, aes(factor(oclock), mMax, fill = factor(position))) +
geom_boxplot() + theme(axis.line = element_line(colour = "black"), panel.background = element_blank(),
axis.title = element_text(size = 16), text = element_text(size = 14)) +
xlab("Grip") + ylab("MEP mMax")
plot(boxplot)
### Histogram
histogram_froman <- ggplot(froman, aes(mMax, colour = factor(position))) + geom_density() +
theme(axis.line = element_line(colour = "black"), panel.background = element_blank(),
axis.title = element_text(size = 16), text = element_text(size = 14)) +
xlab("Grip") + ylab("MEP mMax") + facet_wrap(~oclock)
plot(histogram_froman)
ANOVA
library(ez)
### Need to figure out between versus within for the study...
anova_results <- ezANOVA(dv = mMax, wid = participant, within = oclock, between = position,
data = froman, return_aov = TRUE, detaile = TRUE)
anova_results #show the summary table
$ANOVA
Effect DFn DFd SSn SSd F p p<.05 ges
1 (Intercept) 1 18 57918.9 139.1 7496.68 4.820e-25 * 0.9944
2 position 1 18 577.5 139.1 74.75 7.974e-08 * 0.6379
3 oclock 1 18 41694.3 188.7 3976.48 1.424e-22 * 0.9922
4 position:oclock 1 18 440.9 188.7 42.05 4.240e-06 * 0.5736
$aov
Call:
aov(formula = formula(aov_formula), data = data)
Grand Mean: 38.05
Stratum 1: participant
Terms:
position Residuals
Sum of Squares 577.5 139.1
Deg. of Freedom 1 18
Residual standard error: 2.78
1 out of 2 effects not estimable
Estimated effects are balanced
Stratum 2: participant:oclock
Terms:
oclock position:oclock Residuals
Sum of Squares 41694 441 189
Deg. of Freedom 1 1 18
Residual standard error: 3.238
Estimated effects may be unbalanced