Welcome to the R Workshop @ Memorial HKR
The R project for statistical computing is a free open source statistical programming language and project. Follow these steps to get started:
- Download and install R from the Dalhousie CRAN (Comprehensive R Archive Network) Mirror http://mirror.its.dal.ca/cran/
- Download and install the Open Source Rstudio Desktop License https://www.rstudio.com/products/rstudio/download/. RStudio is a free IDE (Integrated Development Environment) built for R. It makes your life easier!
You should see something that looks like in the video.
- If yes, great
- If no, email me.
1. What is R?
R was first written as a research project by Ross Ihaka and Robert Gentleman, and is now under active development by a group of statisticians called ‘the R core team’. R is available free of charge and is distributed under the terms of the Free Software Foundation’s GNU General Public License. You can download the program from the Comprehensive R Archive Network (CRAN). Ready-to-run ‘binaries’ are available for Windows, Mac OS X, and Linux. The source code is also available for download and can be compiled for other platforms. German Rodriguez
2. Why use R?
The main advantages of R are the fact that R is freeware and that there is a lot of help available online.
- free and open source, supported by a strong user community
- highly extensible and flexible
- implementation of modern statistical methods
- assists with replication and extension of your work
- slow with very large data sets
- non-standard programming paradigms
3. Important Notes
- First a quick note that the first sections of this document borrow from Torfs & Bruaer – A very short introduction to R, Spector – An introduction to R, and Rodriguez – Introducing R.
- We will mostly be working in the RStudio environment. You can find many cheat sheets if you can’t quite remember the correct code. That said, there is almost always more than one way to do the same thing in R so find a way that works for you.
- Coding style is important. Follow the Hadley or Goolge style guide. Leave comments so your team can pass the “hit by a bus test.”
4. Let’s get started
First R Commands
You can use R as a basic calculator. If you pass the command 2+2 it will return 4.
2 + 2
You can use arithmetic and mathematical operators in your commands
|^||Raise to the power of|
You can also do more complex calculations
2^6/32 - 1
Assignment and Variables
You can use the “<-” symbol to assign a variable. This is not a variable in the social science/statistics sense. Think if it as storing something to be used again later. So if we want to store the number 4 and call it a we can. Then we ask R to return the value of a we get 4.
a <- 4 a
With our newly stored a we can do calculations.
a * 20
Scalars, vectors, and matrices
Like in many other programs, R organizes numbers in
- Scalars (a single number – 0-dimensional)
- Vectors (a row of numbers, also called arrays – 1-dimensional)
- Matrices (like a table – 2- dimensional)
We can define a vector with the numbers 1, 2, 3, and 4 with the function
c, which is short for concatenate (paste together).
vector <- c(1, 2, 3, 4) vector
 1 2 3 4
Matrices and dataframes will be introduced later…
We can apply functions to the vector we just created. For example, if we wanted to compute the mean of the vector we could write
(1 + 2 + 3 + 4)/4
That is long and error prone. So we can call a function from Base R named
mean to do the same thing.
Within the brackets you specify the
arguments. Arguments give extra information to the function.
There a lots of pre-existing arguments in R. If you are thinking about it, someone probably already wrote a function for it. Here is a non-exaustive list of mathematical and statistical functions borrowed from Wagner – BIO360.
|log(x)||Logarithm of x|
|exp(x)||Exponential function ex|
|sum(x)||Sum of values|
|cumsum(x)||Cummulative sum of values|
|length(x)||Length of vector|
To get help for a specific function type
help(rnorm). This will provide you with the documentation. You can also do this on the bottom right panel in RStudio.
If you want an example you can type
example(rnorm) to get an example. Typically, the help documentation will provide an example.
One of the best things about R is that everyone is in the same boat. There are tons of online tutorials and help to be found. Some good places to go are:
R can do many statistical and data analyses. They are organized in so-called packages or libraries. With the standard installation, most common packages are installed. There are lots of packages. It’s probable that if you have thought about the analysis, there is a package that is able to do it already. There are two basic steps to using a package:
- Installing the package
- Loading the package
Matrices and Data Frames
R understands matrices. This can be used to your advantage. The following function creates a 3 by 4 matrix and fills it by columns with the numbers 1 to 12:
M <- matrix(1:12, 3, 4) M
[,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12
The elements of a matrix may be addressed using the row and column subscripts in square brackets, separated by a comma. Thus,
M[1,1] is the first element of M.
A subscript may be left blank to select an entire row or column:
M[1,] is the first row and
M[,1] is the first column of M. Any of the subscripts can be a vector, so
M[1:2,1:2] is the upper-left 2 by 2 corner of M.
[,1] [,2] [1,] 1 4 [2,] 2 5
Data frames are probably what you are used to working with. This is the typical data format other statistical software, like SPSS and Stata, use. In R, are a data frame can be comprised of many different data types (i.e., numeric, string, factor) so long as all variables are the same length. Data frames also typically have labelled headers with some meaning, unlike the matrix example above.
7. Working with real data
Set a working directory
You can import pretty much any type of data into R with the
foreign package. It allows you to import and work with data without owning the software. Once you do that you can also write the data to .csv or another reasonable data format. Some examples:
You can read many file formats using the
cchs <- read.csv("cchs_RWorkshop.csv")
library(haven) cchs_spss <- read_sav("cchs_RWorkshop.sav")
library(readxl) cchs_xl <- read_excel("cchs_RWorkshop.xlsx", col_types = c("numeric", "text", "text", "numeric", "numeric"))
Can also write data to many different formats SPSS, Stata, SAS, csv…
Example 1. Survey type data
Exploring the Data
Now we have some data. The variables are as follows:
- Unique identifier for each participant
- Date the survey was complete
- Height in Meters
- Weight in Kilograms
We can also use
head to get a quick view of the data frame
CASEID verdate geogprv hwtghtm hwtgwtk 1 1 20130913 35 1.575 65.25 2 2 20130913 59 1.905 99.00 3 3 20130913 35 1.803 77.40 4 4 20130913 46 1.727 85.50 5 5 20130913 24 1.803 81.00 6 6 20130913 48 1.727 78.75
We will mostly be working the Hadley/Rstudio way to do this. That said, one of the nice/horrific things about R is that there are many many ways to do the same thing.
Some basic concepts:
- Unlike other stats programs R makes you call the data and variable together. So if I want the
weightvariable from the
cchsdata you can do that a few different ways:
cchs$hwtghtm– Call the data and variable by name
cchs[,4]– Use matrix logic to call the 4th variable
- Sometimes packages have a
data =to call the data and then you can call the varible name at the correct places. This is common in regression type packages.
vars n mean sd median trimmed mad min max range skew kurtosis X1 1 124929 2.01 1.62 1.68 1.69 0.11 1.27 10 8.73 4.71 20.24 se X1 0
vars n mean sd median trimmed mad min max range skew kurtosis X1 1 124929 2.01 1.62 1.68 1.69 0.11 1.27 10 8.73 4.71 20.24 se X1 0
library(stargazer) stargazer(cchs[c("hwtghtm")], type = "text")
============================================ Statistic N Mean St. Dev. Min Max -------------------------------------------- hwtghtm 124,929 2.014 1.620 1.270 9.999 --------------------------------------------
Graph the Data
library(ggplot2) scatter <- ggplot(cchs, aes(x = hwtghtm, y = hwtgwtk)) + geom_point() + theme(axis.line = element_line(colour = "black"), panel.background = element_blank(), axis.title = element_text(size = 16), text = element_text(size = 14)) + xlab("Height in Meters") + ylab("Weight in Kilograms") plot(scatter)
Wooo what the heck is happening here? Looks like the data are coded in a specific way (SPSS style) for missing data. With 999.96 and 999.99 for weight and 9.996 and 9.999 for height.
“The value NA is used to represent missing values for input in R. An NA can be assigned to a variable directly to create a missing value, but to test for missing values, the function is.na must be used. R propogates missing values throughout computations, so often computations performed on data con- taining missing values will result in more missing values. Some of the basic statistical functions (like mean, min, max, etc.) have an argument called na.rm, which, if set to TRUE, will remove the NAs from your data before calculations are performed. In addition, the statistical modeling functions (like aov, glm, loess, etc.) provide an argument called na.action. This argument can be set to a function which will be called to operate on the data before it is processed. One very useful choice for the na.action argument is na.omit. This function (which can be called idependently of the statistical modelling functions) will remove all the rows of a data frame or matrix which contain any missing values.” (Spector – An introduction to R)
Recoding a values from a continous variable.
library(car) # Recoding the height data cchs$height_m <- car::recode(cchs$hwtghtm, "9.996=NA; 9.999=NA;", as.factor.result = FALSE) # Recoding the weight data cchs$weight_kg <- car::recode(cchs$hwtgwtk, "999.99=NA; 999.96=NA;", as.factor.result = FALSE)
Creating a new variable from 2 numerical variables.
cchs$bmi <- (cchs$weight_kg/cchs$height_m^2)
Categorizing a continuous variable.
library(car) cchs$c_bmi <- car::recode(cchs$bmi, "lo:18.499='1. Underweight'; 18.5:25='2. Normal'; 25.00001:30='3. Overweight'; 30.00001:hi='4. Obese';", as.factor.result = TRUE) ## Let's also recode geogprv so it makes more sense cchs$province <- car::recode(cchs$geogprv, "10='NL'; 11='PEI'; 12='NS'; 13='NB'; 24='QC'; 35='ON'; 46='MN'; 47='SK'; 48='AB'; 59='BC'; 60='TR'; 96:99=NA;", as.factor.result = TRUE)
Let’s graph that data again
library(ggplot2) scatter <- ggplot(cchs, aes(x = height_m, y = weight_kg)) + geom_point(aes(colour = factor(c_bmi)), alpha = 1/5) + theme(axis.line = element_line(colour = "black"), panel.background = element_blank(), axis.title = element_text(size = 16), text = element_text(size = 14)) + xlab("Height in Meters") + ylab("Weight in Kilograms") plot(scatter)
Here I added
aes(colour = factor(c_bmi)) to the
geom_point command to get colours by BMI category. There are many ways to change the colour scheme but I’m not going into that at the moment.
What if we wanted to look at the correlation between height and weight now that we have cleaned the data.
library(PerformanceAnalytics) library(dplyr) ### Bit of jump here but we are going to subset columns bmi_corr <- select(cchs, height_m, weight_kg, bmi) ### Run the correlation chart on the new data frame chart.Correlation(bmi_corr, histogram = TRUE, pch = 19)
Image want to run a t-test with only 2 provinces, NL and BC. To simplify that task I can create a new data frame with only NL and BC. Then I can run the t-test relatively easily.
library(dplyr) bmi_NL_BC <- filter(cchs, province == c("NL", "BC")) table(bmi_NL_BC$province)
AB BC MN NB NL NS ON PEI QC SK TR 0 7694 0 0 1747 0 0 0 0 0 0
Now that we have created the new data frame we can run a t-test comparing the 2 provinces. Also nice to graph the data.
# Sort by group then ID bmi_NL_BC <- arrange(bmi_NL_BC, province, CASEID) # t-test t.test(bmi ~ province, bmi_NL_BC)
Welch Two Sample t-test data: bmi by province t = -12, df = 2200, p-value <2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2.089 -1.512 sample estimates: mean in group BC mean in group NL 25.4 27.2
Let’s make a graph of the t-test data
library(ggplot2) histo_ttest <- ggplot(bmi_NL_BC, aes(bmi, colour = factor(province))) + geom_density(alpha = 0.5) + theme(axis.line = element_line(colour = "black"), panel.background = element_blank(), axis.title = element_text(size = 16), text = element_text(size = 14)) + xlab("BMI") + ylab("Density") plot(histo_ttest)
Let’s say I want to get the mean and SD of BMI for each province. I can do that with
library(dplyr) bmi_prov <- cchs %>% group_by(province) %>% summarise(mean_BMI = mean(bmi, na.rm = TRUE), sd_BMI = sd(bmi, na.rm = TRUE), total = n()) bmi_prov
# A tibble: 11 × 4 province mean_BMI sd_BMI total <fctr> <dbl> <dbl> <int> 1 AB 26.31 5.493 11321 2 BC 25.45 4.970 15413 3 MN 26.86 5.680 6962 4 NB 27.00 5.662 4786 5 NL 27.11 5.427 3625 6 NS 26.79 5.591 4629 7 ON 26.07 5.336 42915 8 PEI 26.64 5.319 1774 9 QC 25.47 5.031 23260 10 SK 26.83 5.478 7161 11 TR 26.36 5.589 3083
Running a regression
regressionModel <- lm(height_m ~ weight_kg, data = cchs) summary(regressionModel)
Call: lm(formula = height_m ~ weight_kg, data = cchs) Residuals: Min 1Q Median 3Q Max -0.3591 -0.0578 0.0014 0.0605 0.3375 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.4628429 0.0011057 1323 <2e-16 *** weight_kg 0.0030080 0.0000144 208 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.0862 on 116980 degrees of freedom (7947 observations deleted due to missingness) Multiple R-squared: 0.271, Adjusted R-squared: 0.271 F-statistic: 4.34e+04 on 1 and 116980 DF, p-value: <2e-16
Here we see an example of the
data = call we talked about earlier.
We use summary to get the summary information from the model. We can also use
plot to look at model fit.
Analysis of Variance Table Response: height_m Df Sum Sq Mean Sq F value Pr(>F) weight_kg 1 322 322 43383 <2e-16 *** Residuals 116980 868 0 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Example 2. Repeated Measures
Data are approximations from Froman et al. Experiment 1.
Generate simulated data
library(dplyr) froman <- NULL # Create a NULL data.frame froman$participant <- list(1:40) # Create a list froman <- as.data.frame(froman) # Convert list to data.frame froman$position <- c("neutral", "pronated") # Create a variable that is neutral or pronated froman$position <- as.factor(froman$position) froman <- arrange(froman, position) # Sort the position variable froman$oclock <- c("six", "twelve") # Create a variable called oclock froman$oclock <- as.factor(froman$oclock) # Create an mMax variable using the rnorm function and a mega ifelse # function set.seed(100) # Make the data the same for each rnorm command froman$mMax <- ifelse(froman$position == "neutral" & froman$oclock == "six", rnorm(10, mean = 81.2, sd = 10), ifelse(froman$position == "pronated" & froman$oclock == "six", rnorm(10, mean = 65.8, sd = 10), ifelse(froman$position == "neutral" & froman$oclock == "twelve", rnorm(10, mean = 5.3, sd = 3), ifelse(froman$position == "pronated" & froman$oclock == "twelve", rnorm(10, mean = 3.6, sd = 3), NA)))) froman <- arrange(froman, oclock, position) # Create a participant variable to capture the repeated measures froman$participant <- ifelse(froman$position == "neutral" & froman$oclock == "six", seq(c(1:10)), ifelse(froman$position == "pronated" & froman$oclock == "six", seq(c(1:10)), ifelse(froman$position == "neutral" & froman$oclock == "twelve", seq(c(1:10)), ifelse(froman$position == "pronated" & froman$oclock == "twelve", seq(c(1:10)), NA)))) froman$participant <- as.factor(froman$participant) # Sort by participant froman <- arrange(froman, position, oclock) write.csv(froman, "froman.csv")
Now we have some data. The variables are as follows:
- Neutral or Pronated
- six or twelve
- Max of normalized MEP value
- Participant number
Get some descriptive statistics by position and oclock
library(dplyr) froman %>% group_by(position, oclock) %>% summarise(mean_mMax = mean(mMax, na.rm = TRUE), sd_mMax = sd(mMax, na.rm = TRUE), total = n())
Source: local data frame [4 x 5] Groups: position [?] position oclock mean_mMax sd_mMax total <fctr> <fctr> <dbl> <dbl> <int> 1 neutral six 77.458 3.625 10 2 neutral twelve 6.246 1.408 10 3 pronated six 63.218 3.991 10 4 pronated twelve 5.287 2.317 10
Graph the data using ggplot2
library(ggplot2) #### Bar Graph bargraph <- ggplot(froman, aes(factor(oclock), mMax, fill = factor(position))) + geom_bar(stat = "identity", position = position_dodge()) + theme(axis.line = element_line(colour = "black"), panel.background = element_blank(), axis.title = element_text(size = 16), text = element_text(size = 14)) + xlab("Grip") + ylab("MEP mMax") plot(bargraph)
#### BoxPlot Graph boxplot <- ggplot(froman, aes(factor(oclock), mMax, fill = factor(position))) + geom_boxplot() + theme(axis.line = element_line(colour = "black"), panel.background = element_blank(), axis.title = element_text(size = 16), text = element_text(size = 14)) + xlab("Grip") + ylab("MEP mMax") plot(boxplot)
### Histogram histogram_froman <- ggplot(froman, aes(mMax, colour = factor(position))) + geom_density() + theme(axis.line = element_line(colour = "black"), panel.background = element_blank(), axis.title = element_text(size = 16), text = element_text(size = 14)) + xlab("Grip") + ylab("MEP mMax") + facet_wrap(~oclock) plot(histogram_froman)
library(ez) ### Need to figure out between versus within for the study... anova_results <- ezANOVA(dv = mMax, wid = participant, within = oclock, between = position, data = froman, return_aov = TRUE, detaile = TRUE) anova_results #show the summary table
$ANOVA Effect DFn DFd SSn SSd F p p<.05 ges 1 (Intercept) 1 18 57918.9 139.1 7496.68 4.820e-25 * 0.9944 2 position 1 18 577.5 139.1 74.75 7.974e-08 * 0.6379 3 oclock 1 18 41694.3 188.7 3976.48 1.424e-22 * 0.9922 4 position:oclock 1 18 440.9 188.7 42.05 4.240e-06 * 0.5736 $aov Call: aov(formula = formula(aov_formula), data = data) Grand Mean: 38.05 Stratum 1: participant Terms: position Residuals Sum of Squares 577.5 139.1 Deg. of Freedom 1 18 Residual standard error: 2.78 1 out of 2 effects not estimable Estimated effects are balanced Stratum 2: participant:oclock Terms: oclock position:oclock Residuals Sum of Squares 41694 441 189 Deg. of Freedom 1 1 18 Residual standard error: 3.238 Estimated effects may be unbalanced