Session 1: Introduction

Downloading R and R Studio

Download and install R from this website http://cran.utstat.utoronto.ca/

Regular R can be a bit tricky to use. This is why the developers came up with a more user friendly version called R Studio. It has all of the same functions, but rather that using only the command line, one can use the command line as well as various buttons. Please follow the instructions below to download R Studio.

  1. Go to https://www.rstudio.com/products/rstudio/download/#download

  2. Look for R Studio Desktop and click “Download”

  3. Select the appropriate installer for your operating system (Mac, Windows, etc.)

Now we are ready to start!

Getting Started

To keep the materials and documents from this tutorial organized on your computer, I recommend you create a new R project. You may do this by following the steps below 1. Create a new folder on your desktop (or anywhere on your computer) or just move the folder I shared to your preferred location on your computer 2. Open R Studio and click the menu item File > New Project… 3. Choose “Existing Directory” and navigate to your project folder 4. Choose “Create Project” 5. Check that a “.Rproj” file is in your project folder (created in step 1)

To get started working in R Studio, you must first open a new R session. You may do this by clicking on the button in the far left corner of the top panel. It is white and has a small green plus sign. Click on the button, then select “R Script”. Alternatively, you can hit shift + cmd + N

What’s What In RStudio (clockwise)

  • Text Editor: This is the upper left window in RStudio. It is where you create your R file by typing code and comments. Whatever you see here is what is getting saved to your R script when you close the program.
  • Environment: This is the top right window in in RStudio. It lists the variables, functions and data frames present in the current R session. Any datafile that you load in or create in your session will appear here. You can view a data frame by clicking on the listed filename. It will open in another tab in the panel at the top of the text editor window. Alternatively, you can do this using the “View()” command.
  • Console: This is the lower left window in RStudio. It is where you give R commands and it actually works the same way you would interact with R on the command line or terminal - you give a command and R spits out an answer. Importantly, commands that only appear in the console but not in the text editor will not be saved to your script.
  • The file browser tab: The default tab in the lower right window is a basic file browser. It will show the contents of your current workign directory. You can also manually change the location of your working directory here. You will notice that there are several tabs at the top. The plots tab is where any graphics generated in your script will show up. The packages tab is where you can install and call on packages (by checking the box next to the name). The help tab is where documentation will show up when you have looked something up by using “?” along with the name of a function (more on help below).

Tips & Tricks

  • To empty your environment (i.e., remove any dataframes, funcitons, etc that you are no longer using), use remove(list = ls())
  • R is case sensitive (i.e. the command view() does not exist but View() does).
  • To indicate that something is equal to something else, use “==”. To indicate that something is not equal to something else, use “!=”
  • When creating a new data frame or variable, you can assign it to a label using “<-” or “=”. Be sure to store anything you create to an easy-to-understand label.
  • The dollar sign means you are calling on something contained in something else, such as a column in a data frame. For example, to call on the participant column in our data frame, you might use “dataframe$participant”
  • Anything you type in an R script is considered code and R will try to run it. If you want to add a note, it must be preceded by a “#” symbol
  • Help: When you are not sure how to use a function in R or something is not working, you can cosult the R documentation embedded in R studio. If you are looking for help with a package that is already loaded, type “?” immediately followed by the name of a function. If you are looking for help but are unsure whether the associated package is loaded, type “??” immediately followed by the name of a function. For example, the function to find the mean is a part of base R so no package is needed. If you need help with this function, you can use the code below and an explanation will appear in the window on the bottom right
?mean

If you are not sure what package a function is in or whether the package is loaded, you can search all packages for help. For example, you can learn about the plot function using this command:

??plot

Note: R is open source, which means that it has a very large user base and a lot of documentation online. As a result, a google search for the function or error message you need help with is likely to be a more effective solution. Also see the cheatsheets in the project folder.

  • R Markdown: This document is created using R Markdown. You can open it in your browser. It consists of both text and code element. The chunks of code are in grey boxes and can easily be copied and pasted into your R script. R Markdown can be useful when you want to share model output, plots or explanations of your data and findings with others. It allows you to create HTML, word or pdf documents that contain your comments in a more readable format, as well as the code you choose to include and any output this generates. Please let me know if you would like to learn more about markdown and I can prepare a small tutorial

Setting the Working Directory:

Whenever you open R (especially when you are not working inside a project as we are today), you’ll want to set your working directory. This means identifying the folder where any output (scripts, graphs, etc.) that you may generate will be saved.

First, check where your working directory is currently located on your computer

getwd() 

Next, you can set your working directory to wherever you want it. I am setting my working directory to my desktop.

setwd("~/Desktop")

Packages:

In order to complete certain operation, you’ll often need to download a package containing the required functions. If you are using a package for the first time, you mus first install it. To do this, click on packages on the bottom panel to the right of where your code appears, then click install in the top lefthand corner and type the name of the package you would like to install. Once the package is installes, you will need to “call on the package” to use it. Note that you will NOT need to reinstall your packages when you close and reopen R, but you will need tto call on your packages again (i.e., library()) For example, let’s install and call on the package “ggplot2”

install.packages("ggplot2") 
library(ggplot2)

Data Frames

Most commonly, you will use R to examine, analyze and visualize data collected in an experiment. You can create a data frame (by hand or using various functions including randomly generated values or simulated data) but usually, you would import your data into R (from an excel sheet or text file).

Creating A Data Frame

As mentioned above, you usually won’t be creating data files from scratch. We are just doing it once here to practice. You can create a data frame using the command data.frame().

Let’s say we want to generate data for a group of students from an introductory Psychology course. There are 5 students in total and we have recorded their midterm grades and their majors.

myDataFrame = data.frame(MidtermGrade = c(20,49,100,75,80), Major = c("Psychology", "Biology", "Psychology", "Psychology", "Geography"))

Once you have run the command above, you should be seeing “myDataFrame” appear in the Global Environment on the bottom right. Your N (obs.) and number of variables are also listed there.

Loading In A Data Frame

If you have a data file saved on your computer, you can load it in to your R session. R works best with .csv files but it is also possible to load in .txt or excel files. You can load in a dataset by clicking on the “Import Dataset” button above the Global Environment, then click on “From Text”, then select the dataset you want to load in. Alternatively, you can use a command such as the one below. Here, I am loading a dataset called Disney Vices that list how much time Disney heroes spend consuming tabacco in different movies, the names of the movies, length of the movies and my rating of them as a child and as an adult on a scale from 1-7.

disney_vices <- read.csv("~/Desktop/Post-Doc I/R Tutorial/disney_vices.csv")

Inspecting Your Data

When you load in a new data frame, you may be interested in seeing how many variables you have, what they are called, etc. You can view your data frame in a separate window using the View() command. Just type the name of your data frame inside the parentheses

Some data frames are too large to conveniently view them in R. Instead, you can have R print parts of it to the console (the window below the script you are typing in). Below are some examples

head(disney_vices) ##shows the first few rows of your data frame as well as column names

Similarly, you might only want to look at what variables are included in your dataframe. To do this, you can have R list all the column names

colnames(disney_vices)
## [1] "Movie"                  "Length_Minutes"         "Tobacco_Seconds"       
## [4] "Alcohol_Seconds"        "My_Rating"              "Animals_MainCharacters"
## [7] "Release_Season"         "Avg_Viewing_HeartRate"

To learn more about the levels of a variable, you can have R either count up the number of unique levels (i.e., categories) or print out the names of the unique categories

length(unique(disney_vices$Release_Season)) #length counts the number of observations specified within the ()
## [1] 4
unique(disney_vices$Release_Season) # unique prints out each unique category name
## [1] "Summer" "Fall"   "Winter" "Spring"

Variables

Types of Variables:

When R reads in data, it tries to identify the type of variable in each column. R will categorize your data into the following types:

  • numeric data consists of numbers such as integers (e.g. 1 ,-3 ,33 ,0) or doubles (e.g. 0.3, 12.4, -0.04, 1.0).
  • character data consists of letters or words such as “a”, “f”, “English”, “word” (e.g., categorical data).
  • logical values can take on one of two values: TRUE or FALSE. These can also be represented as 1 or 0 (e.g., accuracy scores).

For the types of statistics we usually run, we want numeric data or factors (these are normally used to group variables into a fixed number of unique categories or levels - more on that when we discuss inferential statistics). Sometimes it can be helpful to verify that a variable is in fact the right type of variable and if it is not, to transform the variable. There are a couple of ways to approach this. We can ask R what type of variable by using either the str() or the typeof() command.

str(disney_vices$Movie) #gives you a snapshot of what the variable looks like
##  chr [1:50] "101 Dalmations" "A Troll in Central Park" "Aladdin" ...
typeof(disney_vices$Movie) #only tells you what type of variable it is
## [1] "character"

When you just want to verify that a variable is in fact numeric (or any other specific type of variable), you can ask R directly. Here, we give R a statement (as if we were asking: “Hey R, the variable”Movie" in the disney_vices dataset is numeric, right?") and R will tell us if it is TRUE or FALSE

is.numeric(disney_vices$Movie)
## [1] FALSE

Creating Variables

You can assign values to variables. Your variable name can be (almost) anything. This can be used to create new stand-alone variables or to add variables to a data frame

  • Below are two possible options to assign numeric values to a variable called “NewVar”. These yield identical results
NewVar = 5
NewVar <- 5
  • you can also assign nominal values, such as a city name, to a variable
CityVar <- "Montreal"
  • As well, you can assign multiple numeric or nominal values to a variable name
MultiNewVar <- c(5, 3, 2)
MultiCityVar <- c("Montreal", "New York", "Berlin")
  • If you run the name of any of the variables we have just created, R will print out the contents. Your workspace now also contains these items.
MultiNewVar
## [1] 5 3 2
  • You can perform various mathematical operations on existing variables. For example, some experiment softwares might begin countss with 0, rather than 1. When this is the case, we usually want to add 1 to every observation to make the count more intuitive (i.e., have it start with 1)
Count <- c(0, 2, 4, 5, 6, 7)
NewCount <- Count + 1

EXCERCISE 1: You can also add a variable to an existing data frame. Try to add two variables to the disney vices data frame. The first should be called “location” and code that all the data in this data frame was collected in Montreal. The second variable should be called “year” and code that the data was collected in 2021.

disney_vices$location  <- "Montreal"
disney_vices$year <- 2021

EXCERCISE 2: Try to add a new column to the disneyvices data frame that multiplies each observation in the My_Rating column by two. This can be done in two ways.

disney_vices$NewRate <- 2*disney_vices$My_Rating
disney_vices$NewRate2 <-disney_vices$My_Rating + disney_vices$My_Rating

Notes on variables:

  • No spaces or special characters are allowed in variable names

  • When you run two different commands with the same variable name, R will use the second one to overwrite the first. So make sure you do not duplicate variable names if they contain different values

Descriptive Statistics

There are a couple of different ways to get descriptive statistics for your data. The summary() command provides an overview for either the whole data frame at once or a single variable at a time.

summary(disney_vices) ##provides some summary statistics about your variables
##     Movie           Length_Minutes     Tobacco_Seconds  Alcohol_Seconds 
##  Length:50          Length:50          Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character   1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Mode  :character   Median :  5.50   Median :  1.50  
##                                        Mean   : 57.44   Mean   : 32.46  
##                                        3rd Qu.: 69.25   3rd Qu.: 39.00  
##                                        Max.   :548.00   Max.   :414.00  
##                                                                         
##    My_Rating    Animals_MainCharacters Release_Season     Avg_Viewing_HeartRate
##  Min.   :1.00   Length:50              Length:50          Min.   : 40.00       
##  1st Qu.:3.00   Class :character       Class :character   1st Qu.: 65.75       
##  Median :4.00   Mode  :character       Mode  :character   Median : 76.00       
##  Mean   :4.18                                             Mean   : 79.40       
##  3rd Qu.:5.00                                             3rd Qu.: 98.00       
##  Max.   :7.00                                             Max.   :120.00       
##                                                           NA's   :2            
##    location              year         NewRate         NewRate2    
##  Length:50          Min.   :2021   Min.   : 2.00   Min.   : 2.00  
##  Class :character   1st Qu.:2021   1st Qu.: 6.00   1st Qu.: 6.00  
##  Mode  :character   Median :2021   Median : 8.00   Median : 8.00  
##                     Mean   :2021   Mean   : 8.36   Mean   : 8.36  
##                     3rd Qu.:2021   3rd Qu.:10.00   3rd Qu.:10.00  
##                     Max.   :2021   Max.   :14.00   Max.   :14.00  
## 
##if you have a lot of variables, you can run this on one variable at a time
summary(disney_vices$Tobacco_Seconds)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    5.50   57.44   69.25  548.00

Alternatively, R lets you easily find a variable’s mean, median, range, variance, standard deviation, etc. Let’s go over some example. NOTE: Most common descriptive statistics have a command associated with it in base R (no additional packages required). If you are ever unsure about how to obtain a descriptive statistic in R, the best thing is to google it. R code is extremely well documented online so you’ll easily find answers to your questions. For example, you could type in “find standard deviation in R” and you’ll find lots of answers.

EXCERCISE 3: Try to find the mean, standard deviation and variance of the My_Rating and Tobacco_Seconds variables. There are two different ways to find the standard deviation in R, try both of them.

#My_Rating
mean(as.numeric(disney_vices$My_Rating)) #mean
## [1] 4.18
##mean() is the command
##as.numeric() is an additional command that tells R that the variable we are about to give it is numeric (and it is thus possible to obtain the mean)
##disney_vices is the data frame that we want R to call on
##$ indicates that we want R to look at a specific column within that data frame
##Length_Minutes is the specific column that we are interested in here
var(as.numeric(disney_vices$My_Rating)) #variance
## [1] 2.681224
sd(as.numeric(disney_vices$My_Rating)) #standard deviation
## [1] 1.637444
sqrt(var(as.numeric(disney_vices$My_Rating))) #alternative way to get standard deviation
## [1] 1.637444

Homework for next session: please send me a data set that you would like to work on. Make sure there is no identifying information in the data (i.e., no names, emails, sensitive medical info, etc.). It’s ok if the data has not been cleaned yet as we will cover data cleaning in the next session so you will be able to apply what you have learned. Ideally, your data set should contain at least one continuous dependent variable (such as reaction time) and at least one continuous independent variable.Please email me your data set ahead of next week’s session along with a short description. Of course, this is completely optional. If you do not feel comfortable sharing a data set or do not have access to one, you will be able to follow along using the disney vices data set and others that I will provide later on

Next week: Data cleaning and exploratory analysis