4 Introduction to R data structures
As in many programming languages, understanding how data are stored and manipulated is important to getting the most out of the experience. In these next few sections, we will introduce some basic R data types and structures as well as some general approaches for working with them.
4.1 Vectors
In R, even a single value is a vector with length=1.
z = 1
z
## [1] 1
length(z)
## [1] 1
In the code above, we “assigned” the value 1 to the variable named z
. Typing z
by itself is an “expression” that returns a result which is, in this case, the value that we just assigned. The length
method takes an R object and returns the R length. There are numerous ways of asking R about what an object represents, and length
is one of them.
Vectors can contain numbers, strings (character data), or logical values (TRUE
and FALSE
) or other “atomic” data types (table 4.1). Vectors cannot contain a mix of types! We will introduce another data structure, the R list
for situations when we need to store a mix of base R data types.
Data type | Stores |
---|---|
numeric | floating point numbers |
integer | integers |
complex | complex numbers |
factor | categorical data |
character | strings |
logical | TRUE or FALSE |
NA | missing |
NULL | empty |
function | function type |
4.1.1 Creating vectors
Character vectors (also sometimes called “string” vectors) are entered with each value surrounded by single or double quotes; either is acceptable, but they must match. They are always displayed by R with double quotes. Here are some examples of creating vectors:
# examples of vectors
c('hello','world')
## [1] "hello" "world"
c(1,3,4,5,1,2)
## [1] 1 3 4 5 1 2
c(1.12341e7,78234.126)
## [1] 11234100.00 78234.13
c(TRUE,FALSE,TRUE,TRUE)
## [1] TRUE FALSE TRUE TRUE
# note how in the next case the TRUE is converted to "TRUE"
# with quotes around it.
c(TRUE,'hello')
## [1] "TRUE" "hello"
We can also create vectors as “regular sequences” of numbers. For example:
# create a vector of integers from 1 to 10
x = 1:10
# and backwards
x = 10:1
The seq
function can create more flexible regular sequences. You did read the help for seq
, right?
# create a vector of numbers from 1 to 4 skipping by 0.3
y = seq(1,4,0.3)
And creating a new vector by concatenating existing vectors is possible, as well.
# create a sequence by concatenating two other sequences
z = c(y,x)
z
## [1] 1.0 1.3 1.6 1.9 2.2 2.5 2.8 3.1 3.4 3.7 4.0 10.0 9.0 8.0
## [15] 7.0 6.0 5.0 4.0 3.0 2.0 1.0
4.1.2 Vector Operations
Operations on a single vector are typically done element-by-element. For example, we can add 2
to a vector, 2
is added to each element of the vector and a new vector of the same length is returned.
x = 1:10
x + 2
## [1] 3 4 5 6 7 8 9 10 11 12
If the operation involves two vectors, the following rules apply. If the vectors are the same length: R simply applies the operation to each pair of elements.
x + x
## [1] 2 4 6 8 10 12 14 16 18 20
If the vectors are different lengths, but one length a multiple of the other, R reuses the shorter vector as needed.
x = 1:10
y = c(1,2)
x * y
## [1] 1 4 3 8 5 12 7 16 9 20
If the vectors are different lengths, but one length not a multiple of the other, R reuses the shorter vector as needed and delivers a warning.
x = 1:10
y = c(2,3,4)
x * y
## Warning in x * y: longer object length is not a multiple of shorter object
## length
## [1] 2 6 12 8 15 24 14 24 36 20
Typical operations include multiplication (“*”), addition, subtraction, division, exponentiation (“^”), but many operations in R operate on vectors and are then called “vectorized”.
4.1.3 Logical Vectors
Logical vectors are vectors composed on only the values TRUE
and FALSE
. Note the all-upper-case and no quotation marks.
a = c(TRUE,FALSE,TRUE)
# we can also create a logical vector from a numeric vector
# 0 = false, everything else is 1
b = c(1,0,217)
d = as.logical(b)
d
## [1] TRUE FALSE TRUE
# test if a and d are the same at every element
all.equal(a,d)
## [1] TRUE
# We can also convert from logical to numeric
as.numeric(a)
## [1] 1 0 1
4.1.4 Logical Operators
Some operators like <, >, ==, >=, <=, !=
can be used to create logical vectors.
# create a numeric vector
x = 1:10
# testing whether x > 5 creates a logical vector
x > 5
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
x <= 5
## [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
x != 5
## [1] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
x == 5
## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
We can also assign the results to a variable:
y = (x == 5)
y
## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
4.1.5 Indexing Vectors
In R, an index is used to refer to a specific element or set of elements in an vector (or other data structure). [R uses [
and ]
to perform indexing, although other approaches to getting subsets of larger data structures are common in R.
x = seq(0,1,0.1)
# create a new vector from the 4th element of x
x[4]
## [1] 0.3
We can even use other vectors to perform the “indexing”.
x[c(3,5,6)]
## [1] 0.2 0.4 0.5
y = 3:6
x[y]
## [1] 0.2 0.3 0.4 0.5
4.1.6 Indexing Vectors and Logical Vectors
Combining the concept of indexing with the concept of logical vectors results in a very power combination.
# use help('rnorm') to figure out what is happening next
myvec = rnorm(10)
# create logical vector that is TRUE where myvec is >0.25
gt1 = (myvec > 0.25)
sum(gt1)
## [1] 2
# and use our logical vector to create a vector of myvec values that are >0.25
myvec[gt1]
## [1] 0.8500783 1.2182233
# or <=0.25 using the logical "not" operator, "!"
myvec[!gt1]
## [1] -0.52361528 -1.96734990 0.09722437 -0.09474050 -0.92634247 0.10862202
## [7] -0.01423129 -1.10639317
# shorter, one line approach
myvec[myvec > 0.25]
## [1] 0.8500783 1.2182233
4.2 String Handling in R
4.2.1 Concatenating Strings
R uses the paste
function to concatenate strings.
paste("abc","def")
## [1] "abc def"
paste("abc","def",sep="THISSEP")
## [1] "abcTHISSEPdef"
paste0("abc","def")
## [1] "abcdef"
paste(c("X","Y"),1:10)
## [1] "X 1" "Y 2" "X 3" "Y 4" "X 5" "Y 6" "X 7" "Y 8" "X 9" "Y 10"
paste(c("X","Y"),1:10,sep="_")
## [1] "X_1" "Y_2" "X_3" "Y_4" "X_5" "Y_6" "X_7" "Y_8" "X_9" "Y_10"
4.2.2 More String Functions
Number of characters in a string
nchar('abc') ## [1] 3 nchar(c('abc','d',123456)) ## [1] 3 1 6
Extract substrings
substr('This is a good sentence.',start=10,stop=15) ## [1] " good "
String replacement
sub('This','That','This is a good sentence.') ## [1] "That is a good sentence."
Finding matching strings
grep('bcd',c('abcdef','abcd','bcde','cdef','defg')) ## [1] 1 2 3 grep('bcd',c('abcdef','abcd','bcde','cdef','defg'),value=TRUE) ## [1] "abcdef" "abcd" "bcde"
4.3 Special Data Types
4.3.1 Missing Values, AKA “NA”
R has a special value, “NA”, that represents a “missing” value in a vector or other data structure.
x = 1:5
x
## [1] 1 2 3 4 5
length(x)
## [1] 5
is.na(x)
## [1] FALSE FALSE FALSE FALSE FALSE
x[2] = NA
x
## [1] 1 NA 3 4 5
length(x)
## [1] 5
is.na(x)
## [1] FALSE TRUE FALSE FALSE FALSE
x[!is.na(x)]
## [1] 1 3 4 5
4.3.2 Factors
A factor is a special type of vector, normally used to hold a categorical variable in many statistical functions.
Such vectors have class “factor”.
Factors are primarily used in Analysis of Variance (ANOVA). When a factor is used as a predictor variable, the corresponding indicator variables are created.
Note of caution Factors in R often appear to be character vectors when printed, but you will notice that they do not have double quotes around them. They are stored in R as numbers with a key name, so sometimes you will note that the factor behaves like a numeric vector.
4.3.3 Factors in Practice
# create the character vector
citizen<-c("uk","us","no","au","uk","us","us","no","au")
# convert to factor
citizenf<-factor(citizen)
citizen
## [1] "uk" "us" "no" "au" "uk" "us" "us" "no" "au"
citizenf
## [1] uk us no au uk us us no au
## Levels: au no uk us
# convert factor back to character vector
as.character(citizenf)
## [1] "uk" "us" "no" "au" "uk" "us" "us" "no" "au"
# convert to numeric vector
as.numeric(citizenf)
## [1] 3 4 2 1 3 4 4 2 1
4.3.4 Factors in Practice
# R stores many data structures as vectors with "attributes" and "class"
attributes(citizenf)
## \$levels
## [1] "au" "no" "uk" "us"
##
## \$class
## [1] "factor"
class(citizenf)
## [1] "factor"
# note that after unclassing, we can see the
# underlying numeric structure again
unclass(citizenf)
## [1] 3 4 2 1 3 4 4 2 1
## attr(,"levels")
## [1] "au" "no" "uk" "us"
table(citizenf)
## citizenf
## au no uk us
## 2 2 2 3