We will explore a subset of data collected by the CDC through its extensive Behavioral Risk Factor Surveillance System ([BRFSS][]) telephone survey. Check out the link for more information. We’ll look at a subset of the data.
First, we need to get the data. Either download the data from THIS LINK or have R do it directly from the command-line (preferred):
download.file('https://raw.githubusercontent.com/seandavi/ITR/master/BRFSS-subset.csv',
destfile = 'BRFSS-subset.csv')
path <- file.choose() # look for BRFSS-subset.csv
stopifnot(file.exists(path))
brfss <- read.csv(path)
Using the data exploration techniques you have seen to explore the brfss dataset.
You may want to investigate individual columns visually using
plotting like hist(). For categorical data, consider using
something like table().
R read Year as an integer value, but it’s
really a factor
brfss$Year <- factor(brfss$Year)
brfssFemale <- brfss[brfss$Sex == "Female",]
summary(brfssFemale)
## Age Weight Sex Height
## Min. :18.00 Min. : 36.29 Length:12039 Min. :105.0
## 1st Qu.:37.00 1st Qu.: 57.61 Class :character 1st Qu.:157.5
## Median :52.00 Median : 65.77 Mode :character Median :163.0
## Mean :51.92 Mean : 69.05 Mean :163.3
## 3rd Qu.:67.00 3rd Qu.: 77.11 3rd Qu.:168.0
## Max. :99.00 Max. :272.16 Max. :200.7
## NA's :103 NA's :560 NA's :140
## Year
## 1990:5718
## 2010:6321
##
##
##
##
##
plot(Weight ~ Year, brfssFemale)

t.test(Weight ~ Year, brfssFemale)
##
## Welch Two Sample t-test
##
## data: Weight by Year
## t = -27.133, df = 11079, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 1990 and group 2010 is not equal to 0
## 95 percent confidence interval:
## -8.723607 -7.548102
## sample estimates:
## mean in group 1990 mean in group 2010
## 64.81838 72.95424
brfss2010Male <- subset(brfss, Year == 2010 & Sex == "Male")
summary(brfss2010Male)
## Age Weight Sex Height Year
## Min. :18.00 Min. : 36.29 Length:3679 Min. :135 1990: 0
## 1st Qu.:45.00 1st Qu.: 77.11 Class :character 1st Qu.:173 2010:3679
## Median :57.00 Median : 86.18 Mode :character Median :178
## Mean :56.25 Mean : 88.85 Mean :178
## 3rd Qu.:68.00 3rd Qu.: 99.79 3rd Qu.:183
## Max. :99.00 Max. :278.96 Max. :218
## NA's :30 NA's :49 NA's :31
hist(brfss2010Male$Weight)

hist(brfss2010Male$Height)

plot(Weight ~ Height, brfss2010Male)

fit <- lm(Weight ~ Height, brfss2010Male)
fit
##
## Call:
## lm(formula = Weight ~ Height, data = brfss2010Male)
##
## Coefficients:
## (Intercept) Height
## -86.8747 0.9873
Summarize as ANOVA table
anova(fit)
## Analysis of Variance Table
##
## Response: Weight
## Df Sum Sq Mean Sq F value Pr(>F)
## Height 1 197664 197664 693.8 < 2.2e-16 ***
## Residuals 3617 1030484 285
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(Weight ~ Height, brfss2010Male)
abline(fit, col="blue", lwd=2)
# Substitute your own weight and height...
points(73 * 2.54, 178 / 2.2, col="red", cex=4, pch=20)

class(fit) # 'noun'
methods(class=class(fit)) # 'verb'
plot(fit)
# Note that the "plot" above does not have a ".lm"
# However, R will use "plot.lm". Why?
?plot.lm