10 R: First Impressions

Type values and mathematical formulas into R’s command prompt

1 + 1
## [1] 2

Assign values to symbols (variables)

x = 1
x + x
## [1] 2

Invoke functions such as c(), which takes any number of values and returns a single vector

x = c(1, 2, 3)
x
## [1] 1 2 3

R functions, such as sqrt(), often operate efficiently on vectors

y = sqrt(x)
y
## [1] 1.000000 1.414214 1.732051

There are often several ways to accomplish a task in R

x = c(1, 2, 3)
x
## [1] 1 2 3
x <- c(4, 5, 6)
x
## [1] 4 5 6
x <- 7:9
x
## [1] 7 8 9
10:12 -> x
x
## [1] 10 11 12

Sometimes R does ‘surprising’ things that can be fun to figure out

x <- c(1, 2, 3) -> y
x
## [1] 1 2 3
y
## [1] 1 2 3

10.1 R Data types: vector and list

‘Atomic’ vectors

  • Types include integer, numeric (float-point; real), complex, logical, character, raw (bytes)

    people <- c("Lori", "Nitesh", "Valerie", "Herve")
    people
    ## [1] "Lori"    "Nitesh"  "Valerie" "Herve"
  • Atomic vectors can be named

    population <- c(Buffalo=259000, Rochester=210000, `New York`=8400000)
    population
    ##   Buffalo Rochester  New York 
    ##    259000    210000   8400000
    log10(population)
    ##   Buffalo Rochester  New York 
    ##  5.413300  5.322219  6.924279
  • Statistical concepts like NA (“not available”)

    truthiness <- c(TRUE, FALSE, NA)
    truthiness
    ## [1]  TRUE FALSE    NA
  • Logical concepts like ‘and’ (&), ‘or’ (|), and ‘not’ (!)

    !truthiness
    ## [1] FALSE  TRUE    NA
    truthiness | !truthiness
    ## [1] TRUE TRUE   NA
    truthiness & !truthiness
    ## [1] FALSE FALSE    NA
  • Numerical concepts like infinity (Inf) or not-a-number (NaN, e.g., 0 / 0)

    undefined_numeric_values <- c(NA, 0/0, NaN, Inf, -Inf)
    undefined_numeric_values
    ## [1]   NA  NaN  NaN  Inf -Inf
    sqrt(undefined_numeric_values)
    ## Warning in sqrt(undefined_numeric_values): NaNs produced
    ## [1]  NA NaN NaN Inf NaN
  • Common string manipulations

    toupper(people)
    ## [1] "LORI"    "NITESH"  "VALERIE" "HERVE"
    substr(people, 1, 3)
    ## [1] "Lor" "Nit" "Val" "Her"
  • R is a green consumer – recycling short vectors to align with long vectors

    x <- 1:3
    x * 2            # '2' (vector of length 1) recycled to c(2, 2, 2)
    ## [1] 2 4 6
    truthiness | NA
    ## [1] TRUE   NA   NA
    truthiness & NA
    ## [1]    NA FALSE    NA
  • It’s very common to nest operations, which can be simultaneously compact, confusing, and expressive ([: subset; <: less than)

    substr(tolower(people), 1, 3)
    ## [1] "lor" "nit" "val" "her"
    population[population < 1000000]
    ##   Buffalo Rochester 
    ##    259000    210000

Lists

  • The list type can contain other vectors, including other lists

    frenemies = list(
        friends=c("Larry", "Richard", "Vivian"),
        enemies=c("Dick", "Mike")
    )
    frenemies
    ## $friends
    ## [1] "Larry"   "Richard" "Vivian" 
    ## 
    ## $enemies
    ## [1] "Dick" "Mike"
  • [ subsets one list to create another list, [[ extracts a list element

    frenemies[1]
    ## $friends
    ## [1] "Larry"   "Richard" "Vivian"
    frenemies[c("enemies", "friends")]
    ## $enemies
    ## [1] "Dick" "Mike"
    ## 
    ## $friends
    ## [1] "Larry"   "Richard" "Vivian"
    frenemies[["enemies"]]
    ## [1] "Dick" "Mike"

Factors

  • Character-like vectors, but with values restricted to specific levels

    sex = factor(c("Male", "Male", "Female"),
                 levels=c("Female", "Male", "Hermaphrodite"))
    sex
    ## [1] Male   Male   Female
    ## Levels: Female Male Hermaphrodite
    sex == "Female"
    ## [1] FALSE FALSE  TRUE
    table(sex)
    ## sex
    ##        Female          Male Hermaphrodite 
    ##             1             2             0
    sex[sex == "Female"]
    ## [1] Female
    ## Levels: Female Male Hermaphrodite

10.2 Classes: data.frame and beyond

Variables are often related to one another in a highly structured way, e.g., two ‘columns’ of data in a spreadsheet

x = rnorm(1000)       # 1000 random normal deviates
y = x + rnorm(1000)   # another 1000 deviates, as a function of x
plot(y ~ x)           # relationship between x and y

Convenient to manipulate them together

  • data.frame(): like columns in a spreadsheet

    df = data.frame(X=x, Y=y)
    head(df)           # first 6 rows
    ##             X          Y
    ## 1 -0.23926790 -0.3937137
    ## 2 -0.07412402 -0.3799202
    ## 3  0.99624820  1.1667748
    ## 4  0.39307812  0.8211909
    ## 5  0.57966738  2.3522918
    ## 6  0.42136109  0.2216827
    plot(Y ~ X, df)    # same as above

  • See all data with View(df). Summarize data with summary(df)

    summary(df)
    ##        X                   Y           
    ##  Min.   :-3.488112   Min.   :-3.69944  
    ##  1st Qu.:-0.669377   1st Qu.:-0.90624  
    ##  Median : 0.004582   Median : 0.03182  
    ##  Mean   : 0.007979   Mean   : 0.05199  
    ##  3rd Qu.: 0.703727   3rd Qu.: 1.02445  
    ##  Max.   : 3.362927   Max.   : 4.96088
  • Easy to manipulate data in a coordinated way, e.g., access column X with $ and subset for just those values greater than 0

    positiveX = df[df$X > 0,]
    head(positiveX)
    ##           X         Y
    ## 3 0.9962482 1.1667748
    ## 4 0.3930781 0.8211909
    ## 5 0.5796674 2.3522918
    ## 6 0.4213611 0.2216827
    ## 7 0.9418160 1.2713186
    ## 8 0.6829870 2.8699771
    plot(Y ~ X, positiveX)

  • R is introspective – ask it about itself

    class(df)
    ## [1] "data.frame"
    dim(df)
    ## [1] 1000    2
    colnames(df)
    ## [1] "X" "Y"
  • matrix() a related class, where all elements have the same type (a data.frame() requires elements within a column to be the same type, but elements between columns can be different types).

A scatterplot makes one want to fit a linear model (do a regression analysis)

  • Use a formula to describe the relationship between variables
  • Variables found in the second argument

    fit <- lm(Y ~ X, df)
  • Visualize the points, and add the regression line

    plot(Y ~ X, df)
    abline(fit, col="red", lwd=3)

  • Summarize the fit as an ANOVA table

    anova(fit)
    ## Analysis of Variance Table
    ## 
    ## Response: Y
    ##            Df Sum Sq Mean Sq F value    Pr(>F)    
    ## X           1 982.56  982.56  1007.4 < 2.2e-16 ***
    ## Residuals 998 973.42    0.98                      
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • N.B. – ‘Type I’ sums-of-squares, so order of independent variables matters; use drop1() for ‘Type III’. See DataCamp Quick-R

  • Introspection – what class is fit? What methods can I apply to an object of that class?

    class(fit)
    ## [1] "lm"
    methods(class=class(fit))
    ##  [1] add1           alias          anova          case.names    
    ##  [5] coerce         confint        cooks.distance deviance      
    ##  [9] dfbeta         dfbetas        drop1          dummy.coef    
    ## [13] effects        extractAIC     family         formula       
    ## [17] hatvalues      influence      initialize     kappa         
    ## [21] labels         logLik         model.frame    model.matrix  
    ## [25] nobs           plot           predict        print         
    ## [29] proj           qr             residuals      rstandard     
    ## [33] rstudent       show           simulate       slotsFromS3   
    ## [37] summary        variable.names vcov          
    ## see '?methods' for accessing help and source code

10.3 Help!

Help available in Rstudio or interactively

  • Check out the help page for rnorm()

    ?rnorm
  • ‘Usage’ section describes how the function can be used

    rnorm(n, mean = 0, sd = 1)
  • Arguments, some with default values. Arguments matched first by name, then position

  • ‘Arguments’ section describes what the arguments are supposed to be

  • ‘Value’ section describes return value

  • ‘Examples’ section illustrates use

  • Often include citations to relevant technical documentation, reference to related functions, obscure details

  • Can be intimidating, but in the end actually very useful