Chapter 14 New year, new data - Workshop 4 - Athletes, Week 2

Good news! There are only three chapters of content remaining. Each chapter from this point onwards is associated with a timetabled 2 hour workshop. However if you don’t complete a chapter in the timetabled slot you are expected to complete it at home. Here we have material for Workshop 4 in Week 2 of Semester 2.

14.1 A new project

We will be working on a new data set this term to walk you through a full simple analysis workflow. This will help you prepare for the Skills for Biologists portfolio task and act as a good primer for completing other reports that you will be asked to produce throughout your university career. Over the next three workshops we will work through the complete analytical workflow needed to analyse this data set. You can ask for feedback from either myself or demonstrators during your workshop sessions.

In this project you are interested in exploring the differences and relationships in weight, height and red blood cell in a group of male and female athletes.

14.1.1 Project setup

Generally I recommend starting a new R Project every time you start working on new and unrelated data sets. Go to posit Cloud, open a New RStudio Project and name it athletes. Spend a few minutes setting up your workspace, refer back to Chapters 4.2 to remind yourself of how to do this. Remember you will need to create sensible places to save scripts, data and figures. You will also need to freshly install and load any required packages.

For this workshop series you will need the following packages:

tidyverse
patchwork

14.1.2 Script set up

Its fairly safe to say we will be creating some new scripts so open a new R script and set it up as described in Chapter 4.3. The project title will be An analysis of red blood cell counts among athletes. Make sure you have used library() to load your freshly installed packages.

14.1.3 The data

The data set we will be working on was collected from 202 Australian athletes (102 identified as male (m), 100 identified as female (f)) (data set adapted from Telford and Cunningham, 1991).

The following variables were measured for each athlete:

height (ht) was recorded in cm
weight (wt) was recorded in kg
red blood cell count (rcc) in 10¹²l^-1

You can download the new data set here. When downloaded you will need to unzip the file, save it somewhere sensible on your computer and then load it in to posit cloud (see Chapter 4.4 if you are unsure). If you have trouble downloading the data via the link I have also uploaded it to the Workbook and Data Sets folder in the Data Sciences Learning Module in your Skills module blackboard page, the file is called athletes.csv.

14.2 Check the data

Once you have imported and loaded your athletes.csv spreadsheet into your new r project you will need to spend a few minutes checking it over to make sure that R has correctly identified and formatted your variables.

14.2.1 Task 1 - Make sure your data are clean and tidy

First of all make sure your data have imported correctly, do you have the expected numbers of rows and columns?

Hint - consider using functions such as ncol(), nrow(), colnames() to check this, for more details on how consider revisiting Chapters 4 and 6.

Then check that R has correctly identified the data types for each variable. Do you need to adjust any variables to factors?

Hint - consider using functions such as head() to check and as.factor() to identify variables as factors where required, consider revisiting Chapters 4.4 and 8.1 for further assistance

Now you can check your variable names, are they nice and concise, do you want to change them?

Hint - you can use the rename() function, as seen in Chapter 4.5 to change variable names if you wish to.

Now we know our that R has read our data correctly and we have all of our variables named as we would like we can run a few further checks. Frequently you will find that your data have originally been manually entered into a spreadsheet, maybe from hand drafted notes in a field or lab notebook. Every time something is copied there is the opportunity for error. Maybe a row has been accidentally duplicated, or there is a typo or maybe some data is missed out altogether. There are a few tricks we can use to check for each of these and to make sure we are confident in the fidelity of our dataset.

Checking for duplicates - When you are manually entering data into a spreadsheet, it is very easy to accidentally enter the same row twice. With very large data sets this is something that is very hard to pick out by eye. Thankfully R has some very useful functions to check for this. Try running;

athletes %>%
  duplicated() # check for duplicated rows

Note that %>% used here is just another notation for piping, rather like the + that ggplot uses. This means that the output from one function is fed directly into the next function.

This chunk of code will spit out a long list of TRUE (row is duplicated) or FALSE (row is not duplicated) statements. Again not very human readable, especially if we have a very large data set. Try amending the code to;

athletes %>%
  duplicated() %>% # check for duplicated rows
  sum() # Sums any TRUE statements in the list

Do we have any duplicated rows?

Checking for typos - As with duplicates it is very easy to enter a typo when manually entering data into a spreadsheet. Generally if you have been collecting continuous data you will have an idea of what a sensible upper and lower bound within your data set should be. We can use the summarise() function to see what these are within this data set, as shown below;

athletes %>% 
  summarise(min=min(wt, na.rm=TRUE), # reports the minimum value in the wt variable and ignores cells with N/A
            max=max(wt, na.rm=TRUE)) # reports the maximum value in the wt variable and ignores cells with N/A

Try manipulating the above chunk to report the upper and lower bounds for the height variable in your data set. Do these values all look reasonable to you?

But how can we check for typos in categorical data? We can use the distinct() function to identify all of the options stored under a categorical variable name. Try using the following chunk;

athletes %>% 
  distinct(sex) # reports the categories stored under sex

So do you think there are any potential typos in your data set?

Checking for missing data - finally sometimes when entering data manually, you may miss or delete a spreadsheet cell by mistake, leaving it empty. Again this is really difficult to spot by eye in a large data set. Try running the following chunk of code, can you work out what each line does? Try adding comments to it in your script yourself.

athletes %>% 
  is.na() %>% 
  sum()

Click-me to check your code interpretation

So here you are piping your initial data set athletes into the is.na() function which is then looking for cells containing N/A and reporting a TRUE/FALSE data frame (where TRUE indicates N/A). We are then piping that output straight into the sum() function which is summing the number of TRUE values.

Do we have any missing data?

14.3 Explore the data

We finished last terms workshops by looking at how we can cannibalise code to fit our purposes. We will be spending the first part of todays workshop going over a few of these skills on our shiney new data set! It may be helpful to look through some of the scripts in Chapter 13 if you get stuck.

14.3.1 Task 2 - Look at your data distributions

Its always helpful to look at how your data is distributed, as this could have implication for how you tackle downstream analysis.

Create a histogram for the following variables, start with a 6 bins and then make a second set of histogams with 20 bins:

Athlete height
Athlete weight
Athlete red blood cell count

Pitt Stop; pause your analysis and check your interpretation Compare your six histograms; How would you describe each plot? How does changing the bin size effect the shape of the plot? Look at your histogram for red blood cell count with 20 bins, why do you think it looks like this?

Click-me to check your interpretation

Your histograms for height and weight both appear normally distributed. Changing the number of bins doesn’t really alter this. However when you look at the histogram for red blood cell count with 6 bins it looks like it might be skewed, with 20 bins however we get more resolution over the data distribution and two peaks appear. The red blood cell count data are bi-modal. This suggest that we may need to think about our data a little more. These histograms were performed on all the data, which includes both male and female athletes, the bi-modal distribution suggests we should consider splitting the data and analysing male and female athletes separately.

So lets have a think about how we could split out data set up to examine male and female athletes separately. Try using this chunk;

# Pipe your data set to the filter function, this will pull out the male athletes and store then in a new object.
male_athletes <- athletes %>%
  filter(sex == "m")

Make a similar data frame for female athletes using the above chunk. Then create two new histograms for:

Male athlete red blood cell count
Female athlete red blood cell count

Pitt Stop; pause your analysis and check your interpretation How would you describe these two new histograms?

14.3.2 Task 3 - Look for differences and relationships

Now that we have a better idea for how our data are distributed we can start to ask some questions of the data set. First of all think, what kind of plot would you build to explore the following questions? We covered this in Chapter 7.

Is there a difference between male and female athlete height?
Is there a difference between male and female athlete weight?
Is there a difference between male and female athlete red blood cell count?
Is there a relationship between height and weight in male and female athletes?
Is there a relationship between red blood cell count and weight in male and female athletes?

Click-me to check your plot choices

Is there a difference between male and female athlete height? - box plot or bar chart
Is there a difference between male and female athlete weight? - box plot or bar chart
Is there a difference between male and female athlete red blood cell count? - box plot or bar chart
Is there a relationship between height and weight in male and female athletes? - scatter plot
Is there a relationship between red blood cell count and weight in male and female athletes? - scatter plot

If you are unsure about any of the logic behind these choices - please ask a demonstrator

Build a draft plot to explore each of your questions.

For your scatter plots add some code to colour the points by sex and add a regression line using geom_smooth().

If you would like to make your axes to be super fancy with full superscript lettering you can incorporate this line of code to describe your axis labels;

labs(x="Weight (kg)\n", y = expression(Red~Blood~Cell~Count~" "~10^{12}~l^{-1})) +

Pitt Stop; pause your analysis and check your interpretation How would you interpret each of these plots? Write down a sentence or two to describe the pattern displayed in each. Then look at the scatter plot that explores the relationhsip between red blood cell count and weight in male and female athletes. What does the regression line indicate? Is there a relationship between weight and red blood cell count or could there be another variable effecting the relationship? Remember we know the red blood cell count data are bimodal, how might this effect what we are seeing?

Click-me to check your interpretation

Just from glancing at the data, median weight and height are slightly higher in male athletes, this difference is greater when looking at red blood cell counts. There appears to be a positive relationship between weight and height in both males and female athletes, i.e. if we plotted male and female athletes separately, it looks like there would still be a positive relationship between height and weight (we will do this in a moment). When looking at the relationship between red blood cell count and weight in male and female athletes, initially there appears to be a positive relationship, but looking at the positioning of male and female points it looks like within each sex the relationship may be less strong. If we think back to our histograms, the red blood cell count data followed a bi-modal distribution, so we need to think about how we explore these data a bit more. In this scatter plot we cant really tell if either of the variables weight or sex explain most of the variation in red blood cell count. So we need to make some more plots to look at the relationship between weight and red blood cell count in male and female athletes separately.

So it looks like we need to tease our scatter plots apart a little more to make sure our understanding of the data is good. Have a go at making plots for the following four questions (make sure you include a regression line);

Is there a relationship between height and weight in female athletes?
Is there a relationship between height and weight in male athletes?
Is there a relationship between red blood cell count and weight in female athletes?
Is there a relationship between red blood cell count and weight in male athletes?

Pitt Stop; pause your analysis and check your interpretation Has the relationship changed between weight and height in male and female athletes? Has the relationship changed between red blood cell count and weight in male and female athletes? Do you understand why it was important to explore these scatter graphs a little more? Do you understand why seeing the bi-modal distribution in our histograms was a clue, suggesting that we needed to explore red blood cell count in males and females individually?

Click-me to check interpretation

Essentially, the relationship between height and weight in male and female athletes has not changed, and we didn’t really expect it too. By looking at these scatter plots we can see that height explains most of the variation in weight in this data set, makes sense right? Taller people are usually going to be heavier then shorter people. When we look at the relationship between weight and red blood cell count in male and female athletes separately we now see that there is, essentially, no relationship. So weight doesn’t explain the variation in red blood cell count, heavier people don’t necessarily have more red blood cells. So the positive relationship we saw in our first scatter plot was likely explained by sex not weight.

14.4 Wrapping up

In this chapter we have loaded a new data set into a new R project. We have checked over the data to makes sure that its been loaded correctly and that there are no duplicated rows, typos or missing data. We have then moved on further to exploring the data set, looking at its shape/distribution and identifying any early patterns. As a really important take home message you can hopefully see and appreciate why its important to think about interpretation at each stage of your analysis. If we had not broken our data set down and analysed male and female athletes separately, we may have ended up drawing false conclusions from our data set and this could have derailed much or our downstream analysis as well.

14.5 Before you leave!

Log out of posit Cloud and make sure you save your script!

14.6 References

Pedersen, T. L. (2020). Patchwork: The composer of plots. https://CRAN.R-project.org/package=patchwork
Telford, R.D. and Cunningham, R.B. 1991. Sex, sport and body-size dependency of hematology in highly trained athletes. Medicine and Science in Sports and Exercise 23: 788-794.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, and Dewey Dunnington. 2021. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.