Download and install R from this website http://cran.utstat.utoronto.ca/
Regular R can be a bit tricky to use. This is why the developers came up with a more user friendly version called R Studio. It has all of the same functions, but rather that using only the command line, one can use the command line as well as various buttons. Please follow the instructions below to download R Studio.
Go to https://www.rstudio.com/products/rstudio/download/#download
Look for R Studio Desktop and click “Download”
Select the appropriate installer for your operating system (Mac, Windows, etc.)
Now we are ready to start!
To keep the materials and documents from this tutorial organized on your computer, I recommend you create a new R project. You may do this by following the steps below 1. Create a new folder on your desktop (or anywhere on your computer) or just move the folder I shared to your preferred location on your computer 2. Open R Studio and click the menu item File > New Project… 3. Choose “Existing Directory” and navigate to your project folder 4. Choose “Create Project” 5. Check that a “.Rproj” file is in your project folder (created in step 1)
To get started working in R Studio, you must first open a new R session. You may do this by clicking on the button in the far left corner of the top panel. It is white and has a small green plus sign. Click on the button, then select “R Script”. Alternatively, you can hit shift + cmd + N
?mean
If you are not sure what package a function is in or whether the package is loaded, you can search all packages for help. For example, you can learn about the plot function using this command:
??plot
Note: R is open source, which means that it has a very large user base and a lot of documentation online. As a result, a google search for the function or error message you need help with is likely to be a more effective solution. Also see the cheatsheets in the project folder.
Whenever you open R (especially when you are not working inside a project as we are today), you’ll want to set your working directory. This means identifying the folder where any output (scripts, graphs, etc.) that you may generate will be saved.
First, check where your working directory is currently located on your computer
getwd()
Next, you can set your working directory to wherever you want it. I am setting my working directory to my desktop.
setwd("~/Desktop")
In order to complete certain operation, you’ll often need to download a package containing the required functions. If you are using a package for the first time, you mus first install it. To do this, click on packages on the bottom panel to the right of where your code appears, then click install in the top lefthand corner and type the name of the package you would like to install. Once the package is installes, you will need to “call on the package” to use it. Note that you will NOT need to reinstall your packages when you close and reopen R, but you will need tto call on your packages again (i.e., library()) For example, let’s install and call on the package “ggplot2”
install.packages("ggplot2")
library(ggplot2)
Most commonly, you will use R to examine, analyze and visualize data collected in an experiment. You can create a data frame (by hand or using various functions including randomly generated values or simulated data) but usually, you would import your data into R (from an excel sheet or text file).
As mentioned above, you usually won’t be creating data files from scratch. We are just doing it once here to practice. You can create a data frame using the command data.frame().
Let’s say we want to generate data for a group of students from an introductory Psychology course. There are 5 students in total and we have recorded their midterm grades and their majors.
myDataFrame = data.frame(MidtermGrade = c(20,49,100,75,80), Major = c("Psychology", "Biology", "Psychology", "Psychology", "Geography"))
Once you have run the command above, you should be seeing “myDataFrame” appear in the Global Environment on the bottom right. Your N (obs.) and number of variables are also listed there.
If you have a data file saved on your computer, you can load it in to your R session. R works best with .csv files but it is also possible to load in .txt or excel files. You can load in a dataset by clicking on the “Import Dataset” button above the Global Environment, then click on “From Text”, then select the dataset you want to load in. Alternatively, you can use a command such as the one below. Here, I am loading a dataset called Disney Vices that list how much time Disney heroes spend consuming tabacco in different movies, the names of the movies, length of the movies and my rating of them as a child and as an adult on a scale from 1-7.
disney_vices <- read.csv("~/Desktop/Post-Doc I/R Tutorial/disney_vices.csv")
When you load in a new data frame, you may be interested in seeing how many variables you have, what they are called, etc. You can view your data frame in a separate window using the View() command. Just type the name of your data frame inside the parentheses
Some data frames are too large to conveniently view them in R. Instead, you can have R print parts of it to the console (the window below the script you are typing in). Below are some examples
head(disney_vices) ##shows the first few rows of your data frame as well as column names
Similarly, you might only want to look at what variables are included in your dataframe. To do this, you can have R list all the column names
colnames(disney_vices)
## [1] "Movie" "Length_Minutes" "Tobacco_Seconds"
## [4] "Alcohol_Seconds" "My_Rating" "Animals_MainCharacters"
## [7] "Release_Season" "Avg_Viewing_HeartRate"
To learn more about the levels of a variable, you can have R either count up the number of unique levels (i.e., categories) or print out the names of the unique categories
length(unique(disney_vices$Release_Season)) #length counts the number of observations specified within the ()
## [1] 4
unique(disney_vices$Release_Season) # unique prints out each unique category name
## [1] "Summer" "Fall" "Winter" "Spring"
When R reads in data, it tries to identify the type of variable in each column. R will categorize your data into the following types:
For the types of statistics we usually run, we want numeric data or factors (these are normally used to group variables into a fixed number of unique categories or levels - more on that when we discuss inferential statistics). Sometimes it can be helpful to verify that a variable is in fact the right type of variable and if it is not, to transform the variable. There are a couple of ways to approach this. We can ask R what type of variable by using either the str() or the typeof() command.
str(disney_vices$Movie) #gives you a snapshot of what the variable looks like
## chr [1:50] "101 Dalmations" "A Troll in Central Park" "Aladdin" ...
typeof(disney_vices$Movie) #only tells you what type of variable it is
## [1] "character"
When you just want to verify that a variable is in fact numeric (or any other specific type of variable), you can ask R directly. Here, we give R a statement (as if we were asking: “Hey R, the variable”Movie" in the disney_vices dataset is numeric, right?") and R will tell us if it is TRUE or FALSE
is.numeric(disney_vices$Movie)
## [1] FALSE
You can assign values to variables. Your variable name can be (almost) anything. This can be used to create new stand-alone variables or to add variables to a data frame
NewVar = 5
NewVar <- 5
CityVar <- "Montreal"
MultiNewVar <- c(5, 3, 2)
MultiCityVar <- c("Montreal", "New York", "Berlin")
MultiNewVar
## [1] 5 3 2
Count <- c(0, 2, 4, 5, 6, 7)
NewCount <- Count + 1
EXCERCISE 1: You can also add a variable to an existing data frame. Try to add two variables to the disney vices data frame. The first should be called “location” and code that all the data in this data frame was collected in Montreal. The second variable should be called “year” and code that the data was collected in 2021.
disney_vices$location <- "Montreal"
disney_vices$year <- 2021
EXCERCISE 2: Try to add a new column to the disneyvices data frame that multiplies each observation in the My_Rating column by two. This can be done in two ways.
disney_vices$NewRate <- 2*disney_vices$My_Rating
disney_vices$NewRate2 <-disney_vices$My_Rating + disney_vices$My_Rating
Notes on variables:
No spaces or special characters are allowed in variable names
When you run two different commands with the same variable name, R will use the second one to overwrite the first. So make sure you do not duplicate variable names if they contain different values
There are a couple of different ways to get descriptive statistics for your data. The summary() command provides an overview for either the whole data frame at once or a single variable at a time.
summary(disney_vices) ##provides some summary statistics about your variables
## Movie Length_Minutes Tobacco_Seconds Alcohol_Seconds
## Length:50 Length:50 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Mode :character Median : 5.50 Median : 1.50
## Mean : 57.44 Mean : 32.46
## 3rd Qu.: 69.25 3rd Qu.: 39.00
## Max. :548.00 Max. :414.00
##
## My_Rating Animals_MainCharacters Release_Season Avg_Viewing_HeartRate
## Min. :1.00 Length:50 Length:50 Min. : 40.00
## 1st Qu.:3.00 Class :character Class :character 1st Qu.: 65.75
## Median :4.00 Mode :character Mode :character Median : 76.00
## Mean :4.18 Mean : 79.40
## 3rd Qu.:5.00 3rd Qu.: 98.00
## Max. :7.00 Max. :120.00
## NA's :2
## location year NewRate NewRate2
## Length:50 Min. :2021 Min. : 2.00 Min. : 2.00
## Class :character 1st Qu.:2021 1st Qu.: 6.00 1st Qu.: 6.00
## Mode :character Median :2021 Median : 8.00 Median : 8.00
## Mean :2021 Mean : 8.36 Mean : 8.36
## 3rd Qu.:2021 3rd Qu.:10.00 3rd Qu.:10.00
## Max. :2021 Max. :14.00 Max. :14.00
##
##if you have a lot of variables, you can run this on one variable at a time
summary(disney_vices$Tobacco_Seconds)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 5.50 57.44 69.25 548.00
Alternatively, R lets you easily find a variable’s mean, median, range, variance, standard deviation, etc. Let’s go over some example. NOTE: Most common descriptive statistics have a command associated with it in base R (no additional packages required). If you are ever unsure about how to obtain a descriptive statistic in R, the best thing is to google it. R code is extremely well documented online so you’ll easily find answers to your questions. For example, you could type in “find standard deviation in R” and you’ll find lots of answers.
EXCERCISE 3: Try to find the mean, standard deviation and variance of the My_Rating and Tobacco_Seconds variables. There are two different ways to find the standard deviation in R, try both of them.
#My_Rating
mean(as.numeric(disney_vices$My_Rating)) #mean
## [1] 4.18
##mean() is the command
##as.numeric() is an additional command that tells R that the variable we are about to give it is numeric (and it is thus possible to obtain the mean)
##disney_vices is the data frame that we want R to call on
##$ indicates that we want R to look at a specific column within that data frame
##Length_Minutes is the specific column that we are interested in here
var(as.numeric(disney_vices$My_Rating)) #variance
## [1] 2.681224
sd(as.numeric(disney_vices$My_Rating)) #standard deviation
## [1] 1.637444
sqrt(var(as.numeric(disney_vices$My_Rating))) #alternative way to get standard deviation
## [1] 1.637444
Homework for next session: please send me a data set that you would like to work on. Make sure there is no identifying information in the data (i.e., no names, emails, sensitive medical info, etc.). It’s ok if the data has not been cleaned yet as we will cover data cleaning in the next session so you will be able to apply what you have learned. Ideally, your data set should contain at least one continuous dependent variable (such as reaction time) and at least one continuous independent variable.Please email me your data set ahead of next week’s session along with a short description. Of course, this is completely optional. If you do not feel comfortable sharing a data set or do not have access to one, you will be able to follow along using the disney vices data set and others that I will provide later on
Next week: Data cleaning and exploratory analysis