5 Rectangular Data

5.0.1 Matrices and Data Frames

  • A matrix is a rectangular array. It can be viewed as a collection of column vectors all of the same length and the same type (i.e. numeric, character or logical).

  • A data frame is also a rectangular array. All of the columns must be the same length, but they may be of different types.

  • The rows and columns of a matrix or data frame can be given names.

  • However these are implemented differently in R; many operations will work for one but not both.

5.1 Matrix Operations

5.1.1 Matrix Operations

x<-1:10 
y<-rnorm(10)
# make a matrix by column binding two numeric vectors
mat<-cbind(x,y)
mat
##        x          y
##  [1,]  1  0.3842634
##  [2,]  2 -1.2536470
##  [3,]  3 -1.4098608
##  [4,]  4  1.7139083
##  [5,]  5 -0.1876347
##  [6,]  6 -0.3343492
##  [7,]  7  1.6553040
##  [8,]  8  0.5974976
##  [9,]  9 -1.8443792
## [10,] 10  1.4203497
# And the names of the rows and columns
rownames(mat)
## NULL
colnames(mat)
## [1] "x" "y"

5.1.2 Matrix Operations

Indexing for matrices works as for vectors except that we now need to include both the row and column (in that order).

# The 2nd element of the 1st row of mat
mat[1,2]
##         y 
## 0.3842634
# The first ROW of mat
mat[1,]
##         x         y 
## 1.0000000 0.3842634
# The first COLUMN of mat
mat[,1]
##  [1]  1  2  3  4  5  6  7  8  9 10
# and all elements of mat that are > 4; note no comma
mat[mat>4]
## [1]  5  6  7  8  9 10

5.1.3 Matrix Operations

# create a matrix with 2 columns and 10 rows
# filled with random normal deviates
m = matrix(rnorm(20),nrow=10)
# multiply all values in the matrix by 20
m = m*20
# and add 100 to the first column of m
m[,1] = m[,1] + 100
# summarize m
summary(m)
##        V1               V2          
##  Min.   : 44.58   Min.   :-42.4628  
##  1st Qu.: 93.26   1st Qu.:-19.8712  
##  Median :100.72   Median :  3.6691  
##  Mean   : 95.27   Mean   : -0.4271  
##  3rd Qu.:106.33   3rd Qu.: 17.3559  
##  Max.   :111.29   Max.   : 31.2418

5.2 Data Frames

5.2.1 Matrices Versus Data Frames

mat<-cbind(x,y)
class(mat[,1])          
## [1] "numeric"
z = paste0('a',1:10)
tab<-cbind(x,y,z)
class(tab)
## [1] "matrix"
mode(tab[,1])
## [1] "character"
head(tab,4)
##      x   y                   z   
## [1,] "1" "0.384263423333191" "a1"
## [2,] "2" "-1.25364703104426" "a2"
## [3,] "3" "-1.40986082624299" "a3"
## [4,] "4" "1.7139082579506"   "a4"

5.2.2 Matrices Versus Data Frames

tab<-data.frame(x,y,z)
class(tab)
## [1] "data.frame"
head(tab)
##   x          y  z
## 1 1  0.3842634 a1
## 2 2 -1.2536470 a2
## 3 3 -1.4098608 a3
## 4 4  1.7139083 a4
## 5 5 -0.1876347 a5
## 6 6 -0.3343492 a6
mode(tab[,1])
## [1] "numeric"
class(tab[,3])
## [1] "factor"
rownames(tab)           
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
rownames(tab)<-paste0("row",1:10)
rownames(tab)
##  [1] "row1"  "row2"  "row3"  "row4"  "row5"  "row6"  "row7"  "row8" 
##  [9] "row9"  "row10"

5.2.3 Data Frames, Continued

  • Data frame columns can be refered to by name using the “dollar sign” operator

    tab\$x
    ##  [1]  1  2  3  4  5  6  7  8  9 10
    tab\$y
    ##  [1]  0.3842634 -1.2536470 -1.4098608  1.7139083 -0.1876347 -0.3343492
    ##  [7]  1.6553040  0.5974976 -1.8443792  1.4203497
  • Column names can be set, which can be useful for referring to data later

    colnames(tab)
    ## [1] "x" "y" "z"
    colnames(tab) = paste0('col',1:3)

5.2.4 Exercise: Subsetting Data Frames

Try these

ncol(tab)
nrow(tab)
dim(tab)
summary(tab)
tab[1:3,]
tab[,2:3]
tab[,1]>7
tab[tab[,1]>7,]
tab[tab[,1]>7,3]
tab[tab[,1]>7,2:3]
tab[tab\$x>7,3]
tab\$z[tab\$x>3]

5.3 Basic Textual Input and Output

5.3.1 Reading and Writing Data Frames to Disk

  • The write.table function and friends write a data.frame or matrix to disk as a text file.

    write.table(tab,file='tab.txt',sep="\t",col.names=TRUE)
    # remove tab from the workspace
    rm(tab)
    # make sure it is gone
    ls(pattern="tab")
    ## character(0)
  • The read.table function and friends read a data.frame or matrix from a text file.

    tab = read.table('tab.txt',sep="\t",header=TRUE)
    head(tab,3)
    ##      col1       col2 col3
    ## row1    1  0.3842634   a1
    ## row2    2 -1.2536470   a2
    ## row3    3 -1.4098608   a3

5.4 Lists and Objects

5.4.1 Lists

  • A list is a collection of objects that may be the same or different types.

  • [The objects generally have names, and may be indexed either by name (e.g. my.list$name3) or component number (e.g. my.list[[3]])

  • A data frame is a list of matched column vectors.

5.4.2 Lists in Practice

  • Create a list, noting the different data types involved.

    a = list(1,"b",c(1,2,3))
    a
    ## [[1]]
    ## [1] 1
    ## 
    ## [[2]]
    ## [1] "b"
    ## 
    ## [[3]]
    ## [1] 1 2 3
    length(a)
    ## [1] 3
    class(a)
    ## [1] "list"
    a[[3]]
    ## [1] 1 2 3

5.4.3 Lists in Practice

  • A data frame is a list.

    # test if our friend "tab" is a list
    is.list(tab)
    ## [1] TRUE
    tab[[2]]
    ##  [1]  0.3842634 -1.2536470 -1.4098608  1.7139083 -0.1876347 -0.3343492
    ##  [7]  1.6553040  0.5974976 -1.8443792  1.4203497
    names(tab)
    ## [1] "col1" "col2" "col3"

5.4.4 Summary of Simple Data Types

Data type Stores
real floating point numbers
integer integers
complex complex numbers
factor categorical data
character strings
logical TRUE or FALSE
NA missing
NULL empty
function function type

5.4.5 Summary of Aggregate Data Types

Data type Stores
vector one-dimensional data, single data type
matrix two-dimensional data, single data type
data frame two-dimensional data, multiple data types
list list of data types, not all need to be the same type
object a list with attributes and potentially slots and methods