Data Frames

While R has many different data types, the one that is central to much of the power and popularity of R is the data.frame. A data.frame looks a bit like an R matrix in that it has two dimensions, rows and columns. However, data.frames are usually viewed as a set of columns representing variables and the rows representing the values of those variables. Importantly, a data.frame may contain different data types in each of its columns; matrices must contain only one data type. This distinction is important to remember, as there are specific approaches to working with R data.frames that may be different than those for working with matrices.

Learning goals

Understand how data.frames are different from matrices.
Know a few functions for examing the contents of a data.frame.
List approaches for subsetting data.frames.
Be able to load and save tabular data from and to disk.
Show how to create a data.frames from scratch.

Learning objectives

Load the yeast growth dataset into R using read.csv.
Examine the contents of the dataset.
Use subsetting to find genes that may be involved with nutrient metabolism and transport.
Summarize data measurements by categories.

Dataset

The data used here are borrowed directly from the fantastic Bioconnector tutorials and are a cleaned up version of the data from Brauer et al. Coordination of Growth Rate, Cell Cycle, Stress Response, and Metabolic Activity in Yeast (2008) Mol Biol Cell 19:352-367. These data are from a gene expression microarray, and in this paper the authors examine the relationship between growth rate and gene expression in yeast cultures limited by one of six different nutrients (glucose, leucine, ammonium, sulfate, phosphate, uracil). If you give yeast a rich media loaded with nutrients except restrict the supply of a single nutrient, you can control the growth rate to any rate you choose. By starving yeast of specific nutrients you can find genes that:

Raise or lower their expression in response to growth rate. Growth-rate dependent expression patterns can tell us a lot about cell cycle control, and how the cell responds to stress. The authors found that expression of >25% of all yeast genes is linearly correlated with growth rate, independent of the limiting nutrient. They also found that the subset of negatively growth-correlated genes is enriched for peroxisomal functions, and positively correlated genes mainly encode ribosomal functions.
Respond differently when different nutrients are being limited. If you see particular genes that respond very differently when a nutrient is sharply restricted, these genes might be involved in the transport or metabolism of that specific nutrient.

The dataset can be downloaded directly from:

brauer2007_tidy.csv

We are going to read this dataset into R and then use it as a playground for learning about data.frames.

Reading in data

R has many capabilities for reading in data. Many of the functions have names that help us to understand what data format is to be expected. In this case, the filename that we want to read ends in .csv, meaning comma-separated-values. The read.csv() function reads in .csv files. As usual, it is worth reading help('read.csv') to get a better sense of the possible bells-and-whistles.

The read.csv() function can read directly from a URL, so we do not need to download the file directly. This dataset is relatively large (about 16MB), so this may take a bit depending on your network connection speed.

ydat <- read.csv('https://raw.githubusercontent.com/bioconnector/workshops/master/data/brauer2007_tidy.csv')

Our variable, ydat, now “contains” the downloaded and read data. We can check to see what data type read.csv gave us:

class(ydat)

## [1] "data.frame"

Inspecting data.frames

Our ydat variable is a data.frame. As I mentioned, the dataset is fairly large, so we will not be able to look at it all at once on the screen. However, R gives us many tools to inspect a data.frame.

Overviews of content
- head() to show first few rows
- tail() to show last few rows
Size
- dim() for dimensions (rows, columns)
- nrow()
- ncol()
- object.size() for power users interested in the memory used to store an object
Data and attribute summaries
- colnames() to get the names of the columns
- rownames() to get the “names” of the rows–may not be present
- summary() to get per-column summaries of the data in the data.frame.

head(ydat)

##   symbol systematic_name nutrient rate expression                           bp
## 1   SFB2         YNL049C  Glucose 0.05      -0.24        ER to Golgi transport
## 2   <NA>         YNL095C  Glucose 0.05       0.28   biological process unknown
## 3   QRI7         YDL104C  Glucose 0.05      -0.02 proteolysis and peptidolysis
## 4   CFT2         YLR115W  Glucose 0.05      -0.33      mRNA polyadenylylation*
## 5   SSO2         YMR183C  Glucose 0.05       0.05              vesicle fusion*
## 6   PSP2         YML017W  Glucose 0.05      -0.69   biological process unknown
##                              mf
## 1    molecular function unknown
## 2    molecular function unknown
## 3 metalloendopeptidase activity
## 4                   RNA binding
## 5              t-SNARE activity
## 6    molecular function unknown

tail(ydat)

##        symbol systematic_name nutrient rate expression
## 198425   DOA1         YKL213C   Uracil  0.3       0.14
## 198426   KRE1         YNL322C   Uracil  0.3       0.28
## 198427   MTL1         YGR023W   Uracil  0.3       0.27
## 198428   KRE9         YJL174W   Uracil  0.3       0.43
## 198429   UTH1         YKR042W   Uracil  0.3       0.19
## 198430   <NA>         YOL111C   Uracil  0.3       0.04
##                                                bp
## 198425    ubiquitin-dependent protein catabolism*
## 198426      cell wall organization and biogenesis
## 198427      cell wall organization and biogenesis
## 198428     cell wall organization and biogenesis*
## 198429 mitochondrion organization and biogenesis*
## 198430                 biological process unknown
##                                         mf
## 198425          molecular function unknown
## 198426 structural constituent of cell wall
## 198427          molecular function unknown
## 198428          molecular function unknown
## 198429          molecular function unknown
## 198430          molecular function unknown

dim(ydat)

## [1] 198430      7

nrow(ydat)

## [1] 198430

ncol(ydat)

## [1] 7

colnames(ydat)

## [1] "symbol"          "systematic_name" "nutrient"        "rate"           
## [5] "expression"      "bp"              "mf"

summary(ydat)

##     symbol          systematic_name      nutrient              rate       
##  Length:198430      Length:198430      Length:198430      Min.   :0.0500  
##  Class :character   Class :character   Class :character   1st Qu.:0.1000  
##  Mode  :character   Mode  :character   Mode  :character   Median :0.2000  
##                                                           Mean   :0.1752  
##                                                           3rd Qu.:0.2500  
##                                                           Max.   :0.3000  
##    expression             bp                 mf           
##  Min.   :-6.500000   Length:198430      Length:198430     
##  1st Qu.:-0.290000   Class :character   Class :character  
##  Median : 0.000000   Mode  :character   Mode  :character  
##  Mean   : 0.003367                                        
##  3rd Qu.: 0.290000                                        
##  Max.   : 6.640000

In RStudio, there is an additional function, View() (note the capital “V”) that opens the first 1000 rows (default) in the RStudio window, akin to a spreadsheet view.

View(ydat)

Accessing variables (columns) and subsetting

In R, data.frames can be subset similarly to other two-dimensional data structures. The [ in R is used to denote subsetting of any kind. When working with two-dimensional data, we need two values inside the [ ] to specify the details. The specification is [rows, columns]. For example, to get the first three rows of ydat, use:

ydat[1:3, ]

##   symbol systematic_name nutrient rate expression                           bp
## 1   SFB2         YNL049C  Glucose 0.05      -0.24        ER to Golgi transport
## 2   <NA>         YNL095C  Glucose 0.05       0.28   biological process unknown
## 3   QRI7         YDL104C  Glucose 0.05      -0.02 proteolysis and peptidolysis
##                              mf
## 1    molecular function unknown
## 2    molecular function unknown
## 3 metalloendopeptidase activity

Note how the second number, the columns, is blank. R takes that to mean “all the columns”. Similarly, we can combine rows and columns specification arbitrarily.

ydat[1:3, 1:3]

##   symbol systematic_name nutrient
## 1   SFB2         YNL049C  Glucose
## 2   <NA>         YNL095C  Glucose
## 3   QRI7         YDL104C  Glucose

Because selecting a single variable, or column, is such a common operation, there are two shortcuts for doing so with data.frames. The first, the $ operator works like so:

# Look at the column names, just to refresh memory
colnames(ydat)

## [1] "symbol"          "systematic_name" "nutrient"        "rate"           
## [5] "expression"      "bp"              "mf"

# Note that I am using "head" here to limit the output
head(ydat$symbol)

## [1] "SFB2" NA     "QRI7" "CFT2" "SSO2" "PSP2"

# What is the actual length of "symbol"?
length(ydat$symbol)

## [1] 198430

The second is related to the fact that, in R, data.frames are also lists. We subset a list by using [[]] notation. To get the second column of ydat, we can use:

head(ydat[[2]])

## [1] "YNL049C" "YNL095C" "YDL104C" "YLR115W" "YMR183C" "YML017W"

Alternatively, we can use the column name:

head(ydat[["systematic_name"]])

## [1] "YNL049C" "YNL095C" "YDL104C" "YLR115W" "YMR183C" "YML017W"

Some data exploration

There are a couple of columns that include numeric values. Which columns are numeric?

class(ydat$symbol)

## [1] "character"

class(ydat$rate)

## [1] "numeric"

class(ydat$expression)

## [1] "numeric"

Make histograms of: - the expression values - the rate values

What does the table() function do? Could you use that to look a the rate column given that that column appears to have repeated values?

What rate corresponds to the most nutrient-starved condition?

More advanced indexing and subsetting

We can use, for example, logical values (TRUE/FALSE) to subset data.frames.

head(ydat[ydat$symbol == 'LEU1', ])

##      symbol systematic_name nutrient rate expression   bp   mf
## NA     <NA>            <NA>     <NA>   NA         NA <NA> <NA>
## NA.1   <NA>            <NA>     <NA>   NA         NA <NA> <NA>
## NA.2   <NA>            <NA>     <NA>   NA         NA <NA> <NA>
## NA.3   <NA>            <NA>     <NA>   NA         NA <NA> <NA>
## NA.4   <NA>            <NA>     <NA>   NA         NA <NA> <NA>
## NA.5   <NA>            <NA>     <NA>   NA         NA <NA> <NA>

tail(ydat[ydat$symbol == 'LEU1', ])

##          symbol systematic_name nutrient rate expression   bp   mf
## NA.47244   <NA>            <NA>     <NA>   NA         NA <NA> <NA>
## NA.47245   <NA>            <NA>     <NA>   NA         NA <NA> <NA>
## NA.47246   <NA>            <NA>     <NA>   NA         NA <NA> <NA>
## NA.47247   <NA>            <NA>     <NA>   NA         NA <NA> <NA>
## NA.47248   <NA>            <NA>     <NA>   NA         NA <NA> <NA>
## NA.47249   <NA>            <NA>     <NA>   NA         NA <NA> <NA>

What is the problem with this approach? It appears that there are a bunch of NA values. Taking a quick look at the symbol column, we see what the problem.

summary(ydat$symbol)

##    Length     Class      Mode 
##    198430 character character

Using the is.na() function, we can make filter further to get down to values of interest.

head(ydat[ydat$symbol == 'LEU1' & !is.na(ydat$symbol), ])

##       symbol systematic_name nutrient rate expression                   bp
## 1526    LEU1         YGL009C  Glucose 0.05      -1.12 leucine biosynthesis
## 7043    LEU1         YGL009C  Glucose 0.10      -0.77 leucine biosynthesis
## 12555   LEU1         YGL009C  Glucose 0.15      -0.67 leucine biosynthesis
## 18071   LEU1         YGL009C  Glucose 0.20      -0.59 leucine biosynthesis
## 23603   LEU1         YGL009C  Glucose 0.25      -0.20 leucine biosynthesis
## 29136   LEU1         YGL009C  Glucose 0.30       0.03 leucine biosynthesis
##                                           mf
## 1526  3-isopropylmalate dehydratase activity
## 7043  3-isopropylmalate dehydratase activity
## 12555 3-isopropylmalate dehydratase activity
## 18071 3-isopropylmalate dehydratase activity
## 23603 3-isopropylmalate dehydratase activity
## 29136 3-isopropylmalate dehydratase activity

Sometimes, looking at the data themselves is not that important. Using dim() is one possibility to look at the number of rows and columns after subsetting.

dim(ydat[ydat$expression > 3, ])

## [1] 714   7

Find the high expressed genes when leucine-starved. For this task we can also use subset which allows us to treat column names as R variables (no $ needed).

subset(ydat, nutrient == 'Leucine' & rate == 0.05 & expression > 3)

##        symbol systematic_name nutrient rate expression
## 133768   QDR2         YIL121W  Leucine 0.05       4.61
## 133772   LEU1         YGL009C  Leucine 0.05       3.84
## 133858   BAP3         YDR046C  Leucine 0.05       4.29
## 135186   <NA>         YPL033C  Leucine 0.05       3.43
## 135187   <NA>         YLR267W  Leucine 0.05       3.23
## 135288   HXT3         YDR345C  Leucine 0.05       5.16
## 135963   TPO2         YGR138C  Leucine 0.05       3.75
## 135965   YRO2         YBR054W  Leucine 0.05       4.40
## 136102   GPG1         YGL121C  Leucine 0.05       3.08
## 136109  HSP42         YDR171W  Leucine 0.05       3.07
## 136119   HXT5         YHR096C  Leucine 0.05       4.90
## 136151   <NA>         YJL144W  Leucine 0.05       3.06
## 136152   MOH1         YBL049W  Leucine 0.05       3.43
## 136153   <NA>         YBL048W  Leucine 0.05       3.95
## 136189  HSP26         YBR072W  Leucine 0.05       4.86
## 136231   NCA3         YJL116C  Leucine 0.05       4.03
## 136233   <NA>         YBR116C  Leucine 0.05       3.28
## 136486   <NA>         YGR043C  Leucine 0.05       3.07
## 137443   ADH2         YMR303C  Leucine 0.05       4.15
## 137448   ICL1         YER065C  Leucine 0.05       3.54
## 137451   SFC1         YJR095W  Leucine 0.05       3.72
## 137569   MLS1         YNL117W  Leucine 0.05       3.76
##                                               bp
## 133768                       multidrug transport
## 133772                      leucine biosynthesis
## 133858                      amino acid transport
## 135186                                  meiosis*
## 135187                biological process unknown
## 135288                          hexose transport
## 135963                       polyamine transport
## 135965                biological process unknown
## 136102                       signal transduction
## 136109                       response to stress*
## 136119                          hexose transport
## 136151                   response to dessication
## 136152                biological process unknown
## 136153                                      <NA>
## 136189                       response to stress*
## 136231 mitochondrion organization and biogenesis
## 136233                                      <NA>
## 136486                biological process unknown
## 137443                             fermentation*
## 137448                          glyoxylate cycle
## 137451                       fumarate transport*
## 137569                          glyoxylate cycle
##                                            mf
## 133768         multidrug efflux pump activity
## 133772 3-isopropylmalate dehydratase activity
## 133858        amino acid transporter activity
## 135186             molecular function unknown
## 135187             molecular function unknown
## 135288          glucose transporter activity*
## 135963          spermine transporter activity
## 135965             molecular function unknown
## 136102             signal transducer activity
## 136109               unfolded protein binding
## 136119          glucose transporter activity*
## 136151             molecular function unknown
## 136152             molecular function unknown
## 136153                                   <NA>
## 136189               unfolded protein binding
## 136231             molecular function unknown
## 136233                                   <NA>
## 136486                 transaldolase activity
## 137443         alcohol dehydrogenase activity
## 137448              isocitrate lyase activity
## 137451 succinate:fumarate antiporter activity
## 137569               malate synthase activity

Aggregating data

Aggregating data, or summarizing by category, is a common way to look for trends or differences in measurements between categories. Use aggregate to find the mean expression by gene symbol.

head(aggregate(ydat$expression, by=list( ydat$symbol), mean))

##   Group.1           x
## 1    AAC1  0.52888889
## 2    AAC3 -0.21628571
## 3   AAD10  0.43833333
## 4   AAD14 -0.07166667
## 5   AAD16  0.24194444
## 6    AAD4 -0.79166667

# or 
head(aggregate(expression ~ symbol, mean, data=ydat))

##   symbol  expression
## 1   AAC1  0.52888889
## 2   AAC3 -0.21628571
## 3  AAD10  0.43833333
## 4  AAD14 -0.07166667
## 5  AAD16  0.24194444
## 6   AAD4 -0.79166667

Creating a data.frame from scratch

Sometimes it is useful to combine related data into one object. For example, let’s simulate some data.

smoker = factor(rep(c("smoker", "non-smoker"), each=50))
smoker_numeric = as.numeric(smoker)
x = rnorm(100)
risk = x + 2*smoker_numeric

We have two varibles, risk and smoker that are related. We can make a data.frame out of them:

smoker_risk = data.frame(smoker = smoker, risk = risk)
head(smoker_risk)

##   smoker     risk
## 1 smoker 3.314167
## 2 smoker 4.211780
## 3 smoker 3.296419
## 4 smoker 5.228781
## 5 smoker 4.422127
## 6 smoker 3.930748

R also has plotting shortcuts that work with data.frames to simplify plotting

plot( risk ~ smoker, data=smoker_risk)

Saving a data.frame

Once we have a data.frame of interest, we may want to save it. The most portable way to save a data.frame is to use one of the write functions. In this case, let’s save the data as a .csv file.

write.csv(smoker_risk, "smoker_risk.csv")