11 Exercise 1: BRFSS Survey Data

We will explore a subset of data collected by the CDC through its extensive Behavioral Risk Factor Surveillance System (BRFSS) telephone survey. Check out the link for more information. We’ll look at a subset of the data.

Use file.choose() to find the path to the file ‘BRFSS-subset.csv’
```
path <- file.choose()
```

Input the data using read.csv(), assigning to a variable brfss
```
brfss <- read.csv(path)
```
Use command like class(), head(), dim(), summary() to explore the data.
- What variables have been measured?
- Can you guess at the units used for, e.g., Weight and Height?
```
class(brfss)
head(brfss)
dim(brfss)
summary(brfss)
```

Use the $ operator to extract the ‘Sex’ column, and summarize the number of males and females in the survey using table(). Do the same for ‘Year’, and for both Sex and Year

table(brfss$Sex)

## 
## Female   Male 
##  12039   7961

table(brfss$Year)

## 
##  1990  2010 
## 10000 10000

table(brfss$Sex, brfss$Year)

##         
##          1990 2010
##   Female 5718 6321
##   Male   4282 3679

with(brfss, table(Sex, Year))                # same, but easier

##         Year
## Sex      1990 2010
##   Female 5718 6321
##   Male   4282 3679

Use aggregate() to summarize the mean weight of each group. What about the median weight of each group? What about the number of observations in each group?

with(brfss, aggregate(Weight, list(Year, Sex), mean, na.rm=TRUE))

##   Group.1 Group.2        x
## 1    1990  Female 64.81838
## 2    2010  Female 72.95424
## 3    1990    Male 81.17999
## 4    2010    Male 88.84657

with(brfss, aggregate(Weight, list(Year=Year, Sex=Sex), mean, na.rm=TRUE))

##   Year    Sex        x
## 1 1990 Female 64.81838
## 2 2010 Female 72.95424
## 3 1990   Male 81.17999
## 4 2010   Male 88.84657

Use a formula and the aggregate() function to describe the relationship between Year, Sex, and Weight

aggregate(Weight ~ Year + Sex, brfss, mean)  # same, but more informative

##   Year    Sex   Weight
## 1 1990 Female 64.81838
## 2 2010 Female 72.95424
## 3 1990   Male 81.17999
## 4 2010   Male 88.84657

aggregate(. ~ Year + Sex, brfss, mean)       # all variables

##   Year    Sex      Age   Weight   Height
## 1 1990 Female 46.09153 64.84333 163.2914
## 2 2010 Female 57.07807 73.03178 163.2469
## 3 1990   Male 43.87574 81.19496 178.2242
## 4 2010   Male 56.25465 88.91136 178.0139

Create a subset of the data consisting of only the 1990 observations. Perform a t-test comparing the weight of males and females (“‘Weight’ as a function of ‘Sex’”, Weight ~ Sex)

brfss_1990 = brfss[brfss$Year == 1990,]
t.test(Weight ~ Sex, brfss_1990)

## 
##  Welch Two Sample t-test
## 
## data:  Weight by Sex
## t = -58.734, df = 9214, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -16.90767 -15.81554
## sample estimates:
## mean in group Female   mean in group Male 
##             64.81838             81.17999

t.test(Weight ~ Sex, brfss, subset = Year == 1990)

## 
##  Welch Two Sample t-test
## 
## data:  Weight by Sex
## t = -58.734, df = 9214, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -16.90767 -15.81554
## sample estimates:
## mean in group Female   mean in group Male 
##             64.81838             81.17999

What about differences between weights of males (or females) in 1990 versus 2010? Check out the help page ?t.test.formula. Is there a way of performing a t-test on brfss without explicitly creating the object brfss_1990?

Use boxplot() to plot the weights of the Male individuals. Can you transform weight, e.g., sqrt(Weight) ~ Year? Interpret the results. Do similar boxplots for the t-tests of the previous question.
```
boxplot(Weight ~ Year, brfss, subset = Sex == "Male",
        main="Males")
```

Use hist() to plot a histogram of weights of the 1990 Female individuals.

hist(brfss_1990[brfss_1990$Sex == "Female", "Weight"],
     main="Females, 1990", xlab="Weight" )