6 Reading and writing data files
6.1 Introduction
In this chapter, we will discuss how to read and write data files in R. Data files are essential for storing and sharing data across different platforms and applications. R provides a variety of functions and packages to read and write data files in different formats, such as text files, CSV files, Excel files. By mastering these functions, you can efficiently import and export data in R, enabling you to perform data analysis and visualization tasks effectively.
6.2 RStudio Projects: Organizing Your Work
Before diving into reading and writing files, it’s essential to understand how to organize your work effectively. RStudio Projects provide a powerful way to keep your files, scripts, and data organized in a self-contained workspace.
6.2.1 What are RStudio Projects?
An RStudio Project is a special folder that contains all the files associated with a particular analysis or research project. When you create a project, RStudio creates a .Rproj
file that serves as the anchor for your project workspace. This approach offers several key benefits:
- Consistent working directory: The project folder automatically becomes your working directory
- File organization: All related files (scripts, data, outputs) are kept together
- Reproducibility: Others can easily run your code without worrying about file paths
- Version control integration: Projects work seamlessly with Git and GitHub
6.2.2 Creating an RStudio Project
You can create a new RStudio Project in several ways:
-
File menu: Go to
File > New Project...
- Project dropdown: Click the project dropdown in the top-right corner and select “New Project”
- Choose New Directory: Create a project in a new folder.
When creating a project, you have three main options:
- New Directory: Create a fresh project folder
- Existing Directory: Turn an existing folder into a project
- Version Control: Clone a repository from GitHub or other version control systems
6.2.3 Project Structure Best Practices
A well-organized project typically follows a consistent structure (that YOU define). Here’s a common structure that you might consider:
my_analysis_project/
├── my_analysis_project.Rproj
├── data/
│ ├── raw/
│ └── processed/
├── scripts/
├── notebooks/
├── outputs/
│ ├── figures/
│ └── tables/
├── README.md
└── .gitignore
This structure separates raw data from processed data, keeps scripts organized, and provides clear locations for outputs.
6.2.4 Working Directories and File Paths
One of the most significant advantages of using RStudio Projects is that they solve the common problem of file path management. When you open a project, RStudio automatically sets the working directory to the project folder. This means:
Relative paths make your code portable—anyone who opens your project will be able to run your scripts without modifying file paths.
6.2.5 The here
Package for Robust File Paths [optional]
For even more robust file path management, especially in complex projects or when using R Markdown documents, the here
package is invaluable:
install.packages("here")
library(here)
# The here() function always refers to the project root
df <- read.csv(here("data", "dataset.csv"))
The here
package works by finding the .Rproj
file and treating that location as the project root, regardless of where your current script is located within the project hierarchy.
6.2.6 Projects and Reproducibility
RStudio Projects can play a key (but optional) role in creating reproducible analyses. When you share a project folder (or push it to GitHub), collaborators can:
- Download/clone the entire project
- Open the
.Rproj
file - Run your scripts without any setup or path modifications
This seamless workflow is essential for collaborative research and makes your work more credible and verifiable.
Now that we understand how to organize our work with RStudio Projects, let’s explore how to read and write the data files that will live within these organized project structures.
6.3 CSV files
Comma-Separated Values (CSV) files are a common file format for storing tabular data. They consist of rows and columns, with each row representing a record and each column representing a variable or attribute. CSV files are widely used for data storage and exchange due to their simplicity and compatibility with various software applications. In R, you can read and write CSV files using the read.csv()
and write.csv()
functions, respectively. A commonly used alternative is to use the readr
package, which provides faster and more user-friendly functions for reading and writing CSV files.
6.3.1 Writing a CSV file
Since we are going to use the readr
package, we need to install it first. You can install the readr
package using the following command:
install.packages("readr")
Once the package is installed, you can load it into your R session using the library()
function:
Since we don’t have a CSV file sitting around, let’s create a simple data frame to write to a CSV file. Here’s an example data frame:
df <- data.frame(
id = c(1, 2, 3, 4, 5),
name = c("Alice", "Bob", "Charlie", "David", "Eve"),
age = c(25, 30, 35, 40, 45)
)
Now, you can write this data frame to a CSV file using the write_csv()
function from the readr
package. Here’s how you can do it:
write_csv(df, "data.csv")
You can check the current working directory to see if the CSV file was created successfully. If you want to specify a different directory or file path, you can provide the full path in the write_csv()
function. If you have created an RStudio Project, the file will be saved in the project directory by default.
6.3.2 Reading a CSV file
Now that we have a CSV file, let’s read it back into R using the read_csv()
function from the readr
package. Here’s how you can do it:
df2 <- read_csv("data.csv")
Rows: 5 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): name
dbl (2): id, age
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
You can check the structure of the data frame df2
to verify that the data was read correctly:
df2
# A tibble: 5 × 3
id name age
<dbl> <chr> <dbl>
1 1 Alice 25
2 2 Bob 30
3 3 Charlie 35
4 4 David 40
5 5 Eve 45
The readr
package can read CSV files with various delimiters, headers, and data types, making it a versatile tool for handling tabular data in R. It can also read CSV files directly from web locations like so:
df3 <- read_csv("https://data.cdc.gov/resource/pwn4-m3yp.csv")
Rows: 1000 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): state
dbl (6): tot_cases, new_cases, tot_deaths, new_deaths, new_historic_cases, ...
dttm (3): date_updated, start_date, end_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The dataset that you just downloaded is described here: Covid-19 data from CDC. As you can see, the read_csv()
function automatically detects the structure of the CSV file and imports it into a data frame in R. Using this approach, you can easily read CSV files from various sources, including local files and online datasets. Particularly for analysis of public datasets, this method is very useful as it allows you to access and analyze data without needing to download it manually and provides a high degree of reproducibility in your analyses.
6.4 Excel files
Microsoft Excel files are another common file format for storing tabular data. Excel files can contain multiple sheets, formulas, and formatting options, making them a popular choice for data storage and analysis. In R, you can read and write Excel files using the readxl
package. This package provides functions to import and export data from Excel files, enabling you to work with Excel data in R.
6.4.1 Reading an Excel file
To read an Excel file in R, you need to install and load the readxl
package. You can install the readxl
package using the following command:
install.packages("readxl")
Once the package is installed, you can load it into your R session using the library()
function:
Now, you can read an Excel file using the read_excel()
function from the readxl
package. We don’t have an excel file available, so let’s download one from the internet. Unlike with CSV riles, R needs to download the file first before reading it. Here’s an example of how to use R to download an Excel file:
download.file('https://www.w3resource.com/python-exercises/pandas/excel/SaleData.xlsx', 'SaleData.xlsx')
You can check the current working directory to see if the Excel file was downloaded successfully:
dir('.')
[1] "_book"
[2] "_extensions"
[3] "_freeze"
[4] "_quarto.yml"
[5] "310_microbiome.qmd"
[6] "additional_resources.qmd"
[7] "ai_tools.qmd"
[8] "appendix.qmd"
[9] "atac-seq"
[10] "atac-seq-talk.qmd"
[11] "bibliography.bib"
[12] "bioc_ranges.qmd"
[13] "bioc-summarizedexperiment.qmd"
[14] "BRFSS-subset.csv"
[15] "CSHLDataCourse.qmd"
[16] "data"
[17] "data_structures_overview.qmd"
[18] "data_vis_hadley.qmd"
[19] "data.csv"
[20] "data.xlsx"
[21] "dataframes_intro_cache"
[22] "dataframes_intro_files"
[23] "dataframes_intro.qmd"
[24] "dataviz.qmd"
[25] "DESCRIPTION"
[26] "dplyr_intro_msleep.qmd"
[27] "dplyr.qmd"
[28] "drawings"
[29] "eda_and_univariate_brfss_cache"
[30] "eda_and_univariate_brfss_files"
[31] "eda_and_univariate_brfss.qmd"
[32] "eda_overview.qmd"
[33] "eda_with_pca.qmd"
[34] "factors.qmd"
[35] "genomic_ranges_tutorial.qmd"
[36] "GenomicRanges Bioconductor educational examples_.md"
[37] "geoquery_cache"
[38] "geoquery_files"
[39] "geoquery.qmd"
[40] "git_and_github.qmd"
[41] "greenleaf.bw"
[42] "GSE2553_series_matrix.txt.gz"
[43] "GSE74089_series_matrix.txt.gz"
[44] "images"
[45] "index.aux"
[46] "index.idx"
[47] "index.lof"
[48] "index.log"
[49] "index.lot"
[50] "index.pdf"
[51] "index.qmd"
[52] "index.tex"
[53] "index.toc"
[54] "intro_to_rstudio.qmd"
[55] "intro.qmd"
[56] "kmeans.qmd"
[57] "license.qmd"
[58] "lists.qmd"
[59] "machine_learning"
[60] "machine_learning_mlr3_cache"
[61] "machine_learning_mlr3_files"
[62] "matrices.qmd"
[63] "matrix_exercises.qmd"
[64] "ml_practical.qmd"
[65] "NAMESPACE"
[66] "norm.qmd"
[67] "packages_and_dice.qmd"
[68] "protein_simulation_basics.qmd"
[69] "QTDublinIrish.otf"
[70] "r_basics.qmd"
[71] "r_intro_mechanics_cache"
[72] "r_intro_mechanics.qmd"
[73] "ranges_and_signals_cache"
[74] "ranges_and_signals_files"
[75] "ranges_and_signals.qmd"
[76] "ranges_exercises_cache"
[77] "ranges_exercises.qmd"
[78] "RBiocBook.Rproj"
[79] "reading_and_writing.qmd"
[80] "reading_and_writing.rmarkdown"
[81] "references.qmd"
[82] "SaleData.xlsx"
[83] "SE.svg"
[84] "simulation_basics_files"
[85] "simulation_basics.qmd"
[86] "single_cell"
[87] "t-statistic-simulation-exercises.qmd"
[88] "t-stats-and-tests.qmd"
[89] "talks"
[90] "tikz2d0f3bd7aa39.log"
[91] "tikz2e8e2666c9c0.log"
[92] "tikz3ed045a6259e.log"
[93] "tikzdc60f2f7d7b.log"
[94] "tikzf28a6369b1c3.log"
[95] "tikzf696458ac6c2.log"
[96] "tikzf83c5476af83.log"
[97] "transcriptdb.qmd"
[98] "vectors.qmd"
[99] "visualization_guide.qmd"
[100] "working_with_distributions.qmd"
Now, you can read the Excel file into R using the read_excel()
function:
df_excel <- read_excel("SaleData.xlsx")
You can check the structure of the data frame df_excel
to verify that the data was read correctly:
df_excel
# A tibble: 45 × 8
OrderDate Region Manager SalesMan Item Units Unit_price Sale_amt
<dttm> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 2018-01-06 00:00:00 East Martha Alexander Tele… 95 1198 113810
2 2018-01-23 00:00:00 Central Hermann Shelli Home… 50 500 25000
3 2018-02-09 00:00:00 Central Hermann Luis Tele… 36 1198 43128
4 2018-02-26 00:00:00 Central Timothy David Cell… 27 225 6075
5 2018-03-15 00:00:00 West Timothy Stephen Tele… 56 1198 67088
6 2018-04-01 00:00:00 East Martha Alexander Home… 60 500 30000
7 2018-04-18 00:00:00 Central Martha Steven Tele… 75 1198 89850
8 2018-05-05 00:00:00 Central Hermann Luis Tele… 90 1198 107820
9 2018-05-22 00:00:00 West Douglas Michael Tele… 32 1198 38336
10 2018-06-08 00:00:00 East Martha Alexander Home… 60 500 30000
# ℹ 35 more rows
The readxl
package provides various options to read Excel files with multiple sheets, specific ranges, and data types, making it a versatile tool for handling Excel data in R.
6.4.2 Writing an Excel file
To write an Excel file in R, you can use the write_xlsx()
function from the writexl
package. You can install the writexl
package using the following command:
install.packages("writexl")
Once the package is installed, you can load it into your R session using the library()
function:
The write_xlsx()
function allows you to write a data frame to an Excel file. Here’s an example:
write_xlsx(df, "data.xlsx")
You can check the current working directory to see if the Excel file was created successfully. If you want to specify a different directory or file path, you can provide the full path in the write_xlsx()
function.
# see what the current working directory is
getwd()
[1] "/Users/davsean/Documents/git/RBiocBook"
# and check to see that the file was created
dir(pattern = "data.xlsx")
[1] "data.xlsx"
And, of course, if you find the file on your computer, you can open it in Excel to verify that the data was written correctly.
6.5 Additional options
- Google Sheets: You can read and write data from Google Sheets using the
googlesheets4
package. This package provides functions to interact with Google Sheets, enabling you to import and export data from Google Sheets to R. - JSON files: You can read and write JSON files using the
jsonlite
package. This package provides functions to convert R objects to JSON format and vice versa, enabling you to work with JSON data in R. - Database files: You can read and write data from database files using the
DBI
andRSQLite
packages. These packages provide functions to interact with various database systems, enabling you to import and export data from databases to R.