We will explore a subset of data collected by the CDC through its extensive Behavioral Risk Factor Surveillance System ([BRFSS][]) telephone survey. Check out the link for more information. We’ll look at a subset of the data.
First, we need to get the data. Either download the data from THIS LINK or have R do it directly from the command-line (preferred):
download.file('https://raw.githubusercontent.com/seandavi/ITR/master/BRFSS-subset.csv',
destfile = 'BRFSS-subset.csv')
path <- file.choose() # look for BRFSS-subset.csv
stopifnot(file.exists(path))
brfss <- read.csv(path)
Using the data exploration techniques you have seen to explore the brfss dataset.
You may want to investigate individual columns visually using
plotting like hist()
. For categorical data, consider using
something like table()
.
R read Year
as an integer value, but it’s
really a factor
brfss$Year <- factor(brfss$Year)
brfssFemale <- brfss[brfss$Sex == "Female",]
summary(brfssFemale)
## Age Weight Sex Height
## Min. :18.00 Min. : 36.29 Length:12039 Min. :105.0
## 1st Qu.:37.00 1st Qu.: 57.61 Class :character 1st Qu.:157.5
## Median :52.00 Median : 65.77 Mode :character Median :163.0
## Mean :51.92 Mean : 69.05 Mean :163.3
## 3rd Qu.:67.00 3rd Qu.: 77.11 3rd Qu.:168.0
## Max. :99.00 Max. :272.16 Max. :200.7
## NA's :103 NA's :560 NA's :140
## Year
## 1990:5718
## 2010:6321
##
##
##
##
##
plot(Weight ~ Year, brfssFemale)
t.test(Weight ~ Year, brfssFemale)
##
## Welch Two Sample t-test
##
## data: Weight by Year
## t = -27.133, df = 11079, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 1990 and group 2010 is not equal to 0
## 95 percent confidence interval:
## -8.723607 -7.548102
## sample estimates:
## mean in group 1990 mean in group 2010
## 64.81838 72.95424
brfss2010Male <- subset(brfss, Year == 2010 & Sex == "Male")
summary(brfss2010Male)
## Age Weight Sex Height Year
## Min. :18.00 Min. : 36.29 Length:3679 Min. :135 1990: 0
## 1st Qu.:45.00 1st Qu.: 77.11 Class :character 1st Qu.:173 2010:3679
## Median :57.00 Median : 86.18 Mode :character Median :178
## Mean :56.25 Mean : 88.85 Mean :178
## 3rd Qu.:68.00 3rd Qu.: 99.79 3rd Qu.:183
## Max. :99.00 Max. :278.96 Max. :218
## NA's :30 NA's :49 NA's :31
hist(brfss2010Male$Weight)
hist(brfss2010Male$Height)
plot(Weight ~ Height, brfss2010Male)
fit <- lm(Weight ~ Height, brfss2010Male)
fit
##
## Call:
## lm(formula = Weight ~ Height, data = brfss2010Male)
##
## Coefficients:
## (Intercept) Height
## -86.8747 0.9873
Summarize as ANOVA table
anova(fit)
## Analysis of Variance Table
##
## Response: Weight
## Df Sum Sq Mean Sq F value Pr(>F)
## Height 1 197664 197664 693.8 < 2.2e-16 ***
## Residuals 3617 1030484 285
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(Weight ~ Height, brfss2010Male)
abline(fit, col="blue", lwd=2)
# Substitute your own weight and height...
points(73 * 2.54, 178 / 2.2, col="red", cex=4, pch=20)
class(fit) # 'noun'
methods(class=class(fit)) # 'verb'
plot(fit)
# Note that the "plot" above does not have a ".lm"
# However, R will use "plot.lm". Why?
?plot.lm