6  Exploratory data analysis

6.1 What is exploratory data analysis?

Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

EDA is a crucial step in the data analysis process. It allows us to understand the data, identify patterns, and develop hypotheses. EDA is also useful for identifying outliers, missing values, and other data quality issues.

It is worth noting that EDA is not a formal process. There are no strict rules or guidelines for conducting EDA. Instead, it is an iterative process that involves exploring the data from multiple angles to gain a comprehensive understanding of the data.

6.2 The insurance dataset

We are going to load up our insurance.csv dataset again and explore it in detail.

insurance_url <- "https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv"
insurance <- readr::read_csv(insurance_url)
Rows: 1338 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): sex, smoker, region
dbl (4): age, bmi, children, charges

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

At this point, you might want to refresh your memory of the contents of the insurance dataset.

  • How many rows does thd dataset have?
  • How many columns?
  • What are the column names for the dataset?
  • What are the data types in each column?
Possible solutions to the questions
nrow(insurance)
ncol(insurance)
# You could also use dim(insurance)
colnames(insurance)
# the `tibble` object gives column types
head(insurance)
# OR
summary(insurance)
# OR
lapply(insurance, class)

Before going further, let’s add the obese column again.

insurance$obese <- ifelse(insurance$bmi > 30, "obese", "not obese")

Check the structure of the insurance dataset again to ensure that you got what you wanted, a new column with either ‘obese’ or ‘not obese’ in each row.

6.3 Explore each variable (column)

When presented with a dataframe like the insurance data, we may want to interrogate the various columns to figure out what is in them. In this dataset, the columns are of two flavors. We can use the terms column and variable interchangably here since each column represents the measurements of that variable on a person.

The sex, smoker, and region variables are all categorical variables, in that they represent categories of “sex”, “smoker”, and “region”. The age, bmi, children, and charge variables represent numbers.

6.3.1 Categorical variables

For categorial variables, there are some useful functions to help summarize the data. But to do so, we need to be able to pull out the individual columns. Let’s take the “region” column as an example. All of the following will get us the region column.

insurance$region
insurance[["region"]]
insurance[, "region"]

The unique() function returns all unique values of a variable. The table() function counts the number of each value. While min() and max() are perhaps not that meaningful, they can sometimes be useful, even for categorical variables.

Let’s apply these to the region variable/column.

unique(insurance$region)
[1] "southwest" "southeast" "northwest" "northeast"
table(insurance[["region"]])

northeast northwest southeast southwest 
      324       325       364       325 
min(insurance$region)
[1] "northeast"
max(insurance$region)
[1] "southwest"

Can you explain what the min() and max() are doing here?

Do the same exercise with sex, smoker, and obese variables.

6.3.2 Numeric variables

The other variables in our insurance dataset are numeric. Note that they are not all continuous numbers, though, with some of them being quite discrete (there are no fractional numbers of children).

For numerical variables, we can start to use statistics to summarize the data.

What is a statistic?

A statistic is a single value that summarizes a dataset. We use them all the time in data analysis. The mean, median, standard deviation, mode, etc. are all univariate statistics. Correlation is an example of a bivariate statistic that summarizes the relationship between two variables. Statistics like the t-statistic summarize the differences in centrality between two samples.

The summary() function gets us a bunch of statistics quickly.

summary(insurance)
      age            sex                 bmi           children    
 Min.   :18.00   Length:1338        Min.   :15.96   Min.   :0.000  
 1st Qu.:27.00   Class :character   1st Qu.:26.30   1st Qu.:0.000  
 Median :39.00   Mode  :character   Median :30.40   Median :1.000  
 Mean   :39.21                      Mean   :30.66   Mean   :1.095  
 3rd Qu.:51.00                      3rd Qu.:34.69   3rd Qu.:2.000  
 Max.   :64.00                      Max.   :53.13   Max.   :5.000  
    smoker             region             charges         obese          
 Length:1338        Length:1338        Min.   : 1122   Length:1338       
 Class :character   Class :character   1st Qu.: 4740   Class :character  
 Mode  :character   Mode  :character   Median : 9382   Mode  :character  
                                       Mean   :13270                     
                                       3rd Qu.:16640                     
                                       Max.   :63770                     

We can also apply individual statistical measures to a single column.

mean(insurance$age)
[1] 39.20703

And even though the children variable/column is a numeric column, we may still be interested in the unique values or a table showing the distribution of the number of children per patient.

unique(insurance$children)
[1] 0 1 3 2 5 4
table(insurance$children)

  0   1   2   3   4   5 
574 324 240 157  25  18 

We may also be interested in seeing the distribution of a numeric variable graphically. The histogram hist() is probably the most common way of examining the distribution of a numeric variable.

hist(insurance$charges)

Plot the historgram of the other numeric variables.

Another approach is to use a boxplot.

boxplot(insurance$age)

6.4 Relationships between variables

We can also start to look at the relationships between variables. Let’s start with the relationship between age and charges.

plot(insurance$age, insurance$charges)

What do you see in the plot?

How about the relathionship between bmi and charges?

plot(insurance$bmi, insurance$charges)

What do you see in this plot?

6.4.1 Categorical vs. categorical

We can also look at the relationships between categorical variables. One way to do this is to use a contingency table.

table(insurance$sex, insurance$smoker)
        
          no yes
  female 547 115
  male   517 159

What do you see in this table? Is this a useful way to look at the data?

Another way to look at the relationship between two categorical variables is to use a mosaic plot. What is a mosaic plot? A mosaic plot is a special type of stacked bar chart that shows percentages of data in groups. The plot is a graphical representation of a contingency table. How are mosaic plots used? Mosaic plots are used to show relationships and to provide a visual comparison of groups.

mosaicplot(~ sex + smoker, data = insurance)

What do you see in this plot?

The formula interface

The ~ symbol is used to specify the formula interface in R. The formula interface is used in many R functions to specify the relationship between variables. In this case, we are specifying that we want to look at the relationship between sex and smoker in the insurance dataset.

In other uses, the formula interface is used to specify the relationship between the dependent and independent variables in a regression model. For example, y ~ x specifies that y is the dependent variable and x is the independent variable.

6.4.2 Categorical vs. numeric

We can also look at the relationship between a categorical and a numeric variable. One way to do this is to use a boxplot.

boxplot(charges ~ smoker, data = insurance)

What do you see in this plot?

Another way to look at the relationship between a categorical and a numeric variable is to use a violin plot. What is a violin plot? A violin plot is a method of plotting numeric data and can be considered a combination of the box plot and the kernel density plot.

This is our first introduction to the ggplot2 package. The ggplot2 package is a plotting system for R, based on the grammar of graphics. It is a powerful and flexible package that allows you to create complex plots with relatively little code.

library(ggplot2)
ggplot(insurance, aes(x = smoker, y = charges)) +
    geom_violin()

We’ll be using ggplot2 a lot in the future, so this is just a “hello world” example.