Session 2: Data Cleaning and Exploratory Analysis (EDA)

We will continue to use the disney_vices data set in this session. So please start by loading this in to you R session.

We will also be using an additional data set called season_info, so load that in as well.

Let’s also add in a chunk of code to ensure that our environment is clean, we are working in the correct directory and all required packages are installed and loaded.

#clean environment
remove(list = ls())

#check working directory
getwd() 
## [1] "/Users/naominewmacbook/Desktop/Post-Doc I/R Tutorial"
#load in data
disney_vices <- read.csv("~/Desktop/Post-Doc I/R Tutorial/disney_vices.csv")
season_info <- read.csv("~/Desktop/Post-Doc I/R Tutorial/season_info.csv")

#load packages (we won't need all of them today, but it's good practice to always begin by loading in some frequently used packages)
library(lme4)
library(lmerTest)
library(dplyr)
library(plyr)
library(ggplot2)
library(effects)
library(png)

#tell R not to use scientific notation
options(scipen=999)

Data Cleaning

R offers powerful data cleaning tools that can be used to transform large, messy data sets (e.g., data scraped from websites, corpora, user data, etc.) into easy-to-analyze data frames. When working with R, you will often encounter the term Tidyverse, which refers to a collection of R packages for data cleaning. Generally, when data scientists speak about tidy data (i.e., the goal of the tidyverse), it means that the following criteria are met:

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each type of observational unit forms a table

Source: R for Data Science

Most of the experimental software we use in psycholignuistics generate relatively clean data. Nevertheless,when collecting experimental data, we sometimes end up with imperfect data that we want to exclude from our analysis (e.g., when a participant didn’t finish a task or a stimulus was not displaying properly). Similarly, we may want to group our data to make it easier to use for analysis or reporting it in a manuscript (i.e., comparing younger and older adults). R provides easy to use tools to reshape, clean, subset and recode your data. We will go over some of the most commonly used ones below using examples from the disney_vices dataset

Goal for today’s session: We want to gather all of the information about the disney movies and do an exploratory data analysis. We have two separate data sets, the disney vices file contains information organized by movie and the season info file contains information organized by season. In order to combine them, we must first clean and streamline the data.

First, let’s take a look at the data in season_info.csv

head(season_info)

If we want to combine this data file with the disney vices file, we must identify at least one variable that is common to both data files (in R, this is often refered to as the “join by” column). In this case, both data files contain a column that codes for the season in which a movie was released. However, the values are not identical, which will cause the joining of the two data files to fail. Specifically, the Season variable in the season_info file differs from the Release_Season variable in disney_vices in three important ways: column name, spelling of specific observations, labels used for specific observations. Below, we will go over the data cleanign steps required to handle each of these issues.

1. Column Names:

Let’s adopt the labels used in disney_vices and make the following transformations accordingly.

This command follows a similar logic to the subsetting command and recoding variables commands below that also makes use of the square brackets.

names(season_info)[names(season_info) == 'Season'] <- 'Release_Season'

# names(season_info) looks into all the names in the season_info
# [names(season_info) == 'Season'] extracts the variable name you want to check
# <- 'Release_Season' assigns the new variable name

To check if this worked, just use the colnames() command on season_info to see all current column names

colnames(season_info)
## [1] "Release_Season"     "TotalMovieReleases" "AvgTemp"

2. Recoding Variables:

You or a participant might have made minor coding mistakes when entering information during data collection. In our case, the season_info contains a different spelling of “spring” and uses the word “Autumn” instead of “Fall”. Before we can combine season_info with disney_vices, we must streamline this.

season_info$Release_Season[season_info$Release_Season == "Autumn"] <- "Fall"
season_info$Release_Season[season_info$Release_Season == "spring"] <- "Spring"

Excercise 1: Inspect the “Length_Minutes” column in the disney vices dataframe using one or more of the commands we have discussed. What potential issue to you see?

str(disney_vices$Length_Minutes) ##this variable codes duration. Therefore it should be numeric rather than character. This indicates that something might be off
##  chr [1:50] "79" "76" "90" "75" "89" "82" "77" "75" "94" "74" "74" "69" ...
unique(disney_vices$Length_Minutes) ## we can now see that one of the numbers is written as a word
##  [1] "79"          "76"          "90"          "75"          "89"         
##  [6] "82"          "77"          "94"          "74"          "69"         
## [11] "84"          "64"          "120"         "92"          "78"         
## [16] "73"          "72"          "71"          "88"          "81"         
## [21] "83"          "80"          "70"          "seventy-one" "86"

How can we fix this?

disney_vices$Length_Minutes[disney_vices$Length_Minutes == "seventy-one"] <- 71 
# within the disney_vices column called Length_Minutes, find only the observation that matches "seventy-one", then replace this observation with 71
unique(disney_vices$Length_Minutes) # have R print out the unique values to check if it worked
##  [1] "79"  "76"  "90"  "75"  "89"  "82"  "77"  "94"  "74"  "69"  "84"  "64" 
## [13] "120" "92"  "78"  "73"  "72"  "71"  "88"  "81"  "83"  "80"  "70"  "86"
disney_vices$Length_Minutes <- as.numeric(disney_vices$Length_Minutes) # convert the variable to numeric 

3. Merging/ Combining Data:

We are now ready to add the season_info data frame to the disney_vices data frame. Specifically, we will join the two data frames using the release season.

We will use the left_join() command, which can detect the best column to join by. However, I would recommend specifying this column. This can help identify possible sources of error when dealing with larger data files.

Note that this is a dplyr function. As mentioned above, there are other ways to accomplish the same thing, but I find the dplyr package approach to be most intuitive

disney_vices_compelete <- left_join(disney_vices, season_info, by = "Release_Season") 
# the result of the left_join command is being assigned to the label disney_vices_compelete
# the order in which the data frames are listed within the parenthesis is important

4. Missing Data:

R codes missing data (e.g., when a participant skips a question or forgets to press a button in time) as NA. When a column contains NA values, we cannot run descriptive statistics on it. Depending on why data is missing, we can either replace NA with a 0 (or another value, if appropriate) or eliminate the whole row

First, we must identify which columns contain missing data. This command will print out the rows that contain missing data in one or more columns. The output has two rows. When inspecting each of the columns, you will see that both of those rows are missing an observation in the Avg_Viewing_HeartRate column

disney_vices_compelete[!complete.cases(disney_vices_compelete),]

In order to be able to obtain the mean, SD, etc. for this variable, we must remove the cells containing missing data. We can either remove them permanently from the data frame or only remove them for a specific operation.

To create new dataset without any missing data (containing 48, rather than 50 rows):

newdata <- na.omit(disney_vices_compelete)

Only remove missing data from a specific column:

newdata1 <- disney_vices_compelete[!is.na(disney_vices_compelete$Avg_Viewing_HeartRate),]

Remove missing data in a specific column for the purpose or running a command (does not remove missing data from the data frame):

mean(na.omit(disney_vices$Avg_Viewing_HeartRate))
## [1] 79.39583

5. Reorganizing Data:

We may want to regroup or reorganize data in one column based on values in another column. For example, let’s say we want to create a variable that groups movies based on how much alcohol and tobacco is seen in them. To do this, we can draw on information from the Alcohol_Seconds and tobacco_Seconds columns. We will categorize movies that show alcohol OR tobacco for more than the median number of seconds as “high vices” and movies that show alcohol AND tobacco for less than the median number of seconds as “low vices”. The new variables will be called vices_category

disney_vices_compelete$vices_category <- ifelse(disney_vices_compelete$Tobacco_Seconds > median(disney_vices_compelete$Tobacco_Seconds)|disney_vices_compelete$Alcohol_Seconds > median(disney_vices_compelete$Alcohol_Seconds), "high vices", "low vices")
#to the left of the arrow, we define the new column name in the df
#read the ifelse command as "if the a given observation in the column called tobacco_seconds within disney_vices_compelete is larger than the median of the same column OR (|) if a given observation in the column called Alcohol_seconds within disney_vices_compelete is larger than the median of the same column, then label this observation "high vices". If this is not the case, label the observation "low vices"

If we wanted to categorize movies that show alcohol AND tobacco for more than the median number of seconds as “high vices” and movies that show alcohol AND tobacco for less than the median number of seconds as “low vices”, we only have to switch the “|” for a “&”

disney_vices_compelete$vices_category2 <- ifelse(disney_vices_compelete$Tobacco_Seconds > median(disney_vices_compelete$Tobacco_Seconds) & disney_vices_compelete$Alcohol_Seconds > median(disney_vices_compelete$Alcohol_Seconds), "high vices", "low vices")

6. Subsetting:

The process of retrieving just the parts of large files which are of interest for a specific purpose. To do this, we tell R to only include the observation where a certain condition is met. In fact, there are multiple ways to do this (you’ll find that this is very often the case in R). Below, I will show two commonly used subsetting commands. For example, let’s say we only want to look at movies released in the Fall. The command below is telling R that we want to create a new data frame that consists only of observations where the release season column indicates that the movie was released in the fall

Fall <- subset(disney_vices_compelete, Release_Season == "Fall")
#Fall is the name we assign to the new subset
#subset is the command
#disney_vices_compelete is the original data frame that we are pulling from
#Release_Season is the name of the column that contains the information relevant to our subset
#Release_Season == "Fall" describes the condition that must be met in order 

Alternatively, you can use this command to get the same result

Fall <- disney_vices_compelete[disney_vices_compelete$Release_Season == "Fall", ]

Excercise 2: Create a subset that only includes movies with animals as the main character that were released in the Summer or Winter

Excercise 3: Create a subset that only includes movies with the words “cat” or “dog” in the title that are more than 75 minutes long

7. Summarizing/ aggregating data:

Once our data is cleaned up, we can reformat it to make it easier to look at summary statistics and plot it. You can use the ddply function to do summarize your data (with mean, standard deviation, etc.), broken down by group. For example, we can aggregate our data to look at the mean and standard deviation of movie ratings by release season and vices category

SummaryTable <- ddply(disney_vices_compelete, .(Release_Season, vices_category), summarise, 
    N=length(unique(Movie)),
    Mean_Rating=mean(My_Rating),
    SD_Rating=sd(My_Rating),
    SE=SD_Rating/sqrt(N))

You can flexibly add different variables and functions to this command. Just be sure to assign unique and informative names to each.

Next week: data visualization: use of the ggplot package to create figures that may be used in presentations and manuscripts