10 R: First Impressions
Type values and mathematical formulas into R’s command prompt
1 + 1
## [1] 2
Assign values to symbols (variables)
x = 1
x + x
## [1] 2
Invoke functions such as c()
, which takes any number of values and returns a single vector
x = c(1, 2, 3)
x
## [1] 1 2 3
R functions, such as sqrt()
, often operate efficiently on vectors
y = sqrt(x)
y
## [1] 1.000000 1.414214 1.732051
There are often several ways to accomplish a task in R
x = c(1, 2, 3)
x
## [1] 1 2 3
x <- c(4, 5, 6)
x
## [1] 4 5 6
x <- 7:9
x
## [1] 7 8 9
10:12 -> x
x
## [1] 10 11 12
Sometimes R does ‘surprising’ things that can be fun to figure out
x <- c(1, 2, 3) -> y
x
## [1] 1 2 3
y
## [1] 1 2 3
10.1 R Data types: vector and list
‘Atomic’ vectors
Types include integer, numeric (float-point; real), complex, logical, character, raw (bytes)
people <- c("Lori", "Nitesh", "Valerie", "Herve") people
## [1] "Lori" "Nitesh" "Valerie" "Herve"
Atomic vectors can be named
population <- c(Buffalo=259000, Rochester=210000, `New York`=8400000) population
## Buffalo Rochester New York ## 259000 210000 8400000
log10(population)
## Buffalo Rochester New York ## 5.413300 5.322219 6.924279
Statistical concepts like
NA
(“not available”)truthiness <- c(TRUE, FALSE, NA) truthiness
## [1] TRUE FALSE NA
Logical concepts like ‘and’ (
&
), ‘or’ (|
), and ‘not’ (!
)!truthiness
## [1] FALSE TRUE NA
truthiness | !truthiness
## [1] TRUE TRUE NA
truthiness & !truthiness
## [1] FALSE FALSE NA
Numerical concepts like infinity (
Inf
) or not-a-number (NaN
, e.g., 0 / 0)undefined_numeric_values <- c(NA, 0/0, NaN, Inf, -Inf) undefined_numeric_values
## [1] NA NaN NaN Inf -Inf
sqrt(undefined_numeric_values)
## Warning in sqrt(undefined_numeric_values): NaNs produced
## [1] NA NaN NaN Inf NaN
Common string manipulations
toupper(people)
## [1] "LORI" "NITESH" "VALERIE" "HERVE"
substr(people, 1, 3)
## [1] "Lor" "Nit" "Val" "Her"
R is a green consumer – recycling short vectors to align with long vectors
x <- 1:3 x * 2 # '2' (vector of length 1) recycled to c(2, 2, 2)
## [1] 2 4 6
truthiness | NA
## [1] TRUE NA NA
truthiness & NA
## [1] NA FALSE NA
It’s very common to nest operations, which can be simultaneously compact, confusing, and expressive (
[
: subset;<
: less than)substr(tolower(people), 1, 3)
## [1] "lor" "nit" "val" "her"
population[population < 1000000]
## Buffalo Rochester ## 259000 210000
Lists
The list type can contain other vectors, including other lists
frenemies = list( friends=c("Larry", "Richard", "Vivian"), enemies=c("Dick", "Mike") ) frenemies
## $friends ## [1] "Larry" "Richard" "Vivian" ## ## $enemies ## [1] "Dick" "Mike"
[
subsets one list to create another list,[[
extracts a list elementfrenemies[1]
## $friends ## [1] "Larry" "Richard" "Vivian"
frenemies[c("enemies", "friends")]
## $enemies ## [1] "Dick" "Mike" ## ## $friends ## [1] "Larry" "Richard" "Vivian"
frenemies[["enemies"]]
## [1] "Dick" "Mike"
Factors
Character-like vectors, but with values restricted to specific levels
sex = factor(c("Male", "Male", "Female"), levels=c("Female", "Male", "Hermaphrodite")) sex
## [1] Male Male Female ## Levels: Female Male Hermaphrodite
sex == "Female"
## [1] FALSE FALSE TRUE
table(sex)
## sex ## Female Male Hermaphrodite ## 1 2 0
sex[sex == "Female"]
## [1] Female ## Levels: Female Male Hermaphrodite
10.2 Classes: data.frame and beyond
Variables are often related to one another in a highly structured way, e.g., two ‘columns’ of data in a spreadsheet
x = rnorm(1000) # 1000 random normal deviates
y = x + rnorm(1000) # another 1000 deviates, as a function of x
plot(y ~ x) # relationship between x and y
Convenient to manipulate them together
data.frame()
: like columns in a spreadsheetdf = data.frame(X=x, Y=y) head(df) # first 6 rows
## X Y ## 1 -0.23926790 -0.3937137 ## 2 -0.07412402 -0.3799202 ## 3 0.99624820 1.1667748 ## 4 0.39307812 0.8211909 ## 5 0.57966738 2.3522918 ## 6 0.42136109 0.2216827
plot(Y ~ X, df) # same as above
See all data with
View(df)
. Summarize data withsummary(df)
summary(df)
## X Y ## Min. :-3.488112 Min. :-3.69944 ## 1st Qu.:-0.669377 1st Qu.:-0.90624 ## Median : 0.004582 Median : 0.03182 ## Mean : 0.007979 Mean : 0.05199 ## 3rd Qu.: 0.703727 3rd Qu.: 1.02445 ## Max. : 3.362927 Max. : 4.96088
Easy to manipulate data in a coordinated way, e.g., access column
X
with$
and subset for just those values greater than 0positiveX = df[df$X > 0,] head(positiveX)
## X Y ## 3 0.9962482 1.1667748 ## 4 0.3930781 0.8211909 ## 5 0.5796674 2.3522918 ## 6 0.4213611 0.2216827 ## 7 0.9418160 1.2713186 ## 8 0.6829870 2.8699771
plot(Y ~ X, positiveX)
R is introspective – ask it about itself
class(df)
## [1] "data.frame"
dim(df)
## [1] 1000 2
colnames(df)
## [1] "X" "Y"
matrix()
a related class, where all elements have the same type (adata.frame()
requires elements within a column to be the same type, but elements between columns can be different types).
A scatterplot makes one want to fit a linear model (do a regression analysis)
- Use a formula to describe the relationship between variables
Variables found in the second argument
fit <- lm(Y ~ X, df)
Visualize the points, and add the regression line
plot(Y ~ X, df) abline(fit, col="red", lwd=3)
Summarize the fit as an ANOVA table
anova(fit)
## Analysis of Variance Table ## ## Response: Y ## Df Sum Sq Mean Sq F value Pr(>F) ## X 1 982.56 982.56 1007.4 < 2.2e-16 *** ## Residuals 998 973.42 0.98 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
N.B. – ‘Type I’ sums-of-squares, so order of independent variables matters; use
drop1()
for ‘Type III’. See DataCamp Quick-RIntrospection – what class is
fit
? What methods can I apply to an object of that class?class(fit)
## [1] "lm"
methods(class=class(fit))
## [1] add1 alias anova case.names ## [5] coerce confint cooks.distance deviance ## [9] dfbeta dfbetas drop1 dummy.coef ## [13] effects extractAIC family formula ## [17] hatvalues influence initialize kappa ## [21] labels logLik model.frame model.matrix ## [25] nobs plot predict print ## [29] proj qr residuals rstandard ## [33] rstudent show simulate slotsFromS3 ## [37] summary variable.names vcov ## see '?methods' for accessing help and source code
10.3 Help!
Help available in Rstudio or interactively
Check out the help page for
rnorm()
?rnorm
‘Usage’ section describes how the function can be used
rnorm(n, mean = 0, sd = 1)
Arguments, some with default values. Arguments matched first by name, then position
‘Arguments’ section describes what the arguments are supposed to be
‘Value’ section describes return value
‘Examples’ section illustrates use
Often include citations to relevant technical documentation, reference to related functions, obscure details
Can be intimidating, but in the end actually very useful