6  Reading and writing data files

Published

June 1, 2024

Modified

July 8, 2025

6.1 Introduction

In this chapter, we will discuss how to read and write data files in R. Data files are essential for storing and sharing data across different platforms and applications. R provides a variety of functions and packages to read and write data files in different formats, such as text files, CSV files, Excel files. By mastering these functions, you can efficiently import and export data in R, enabling you to perform data analysis and visualization tasks effectively.

6.2 RStudio Projects: Organizing Your Work

Before diving into reading and writing files, it’s essential to understand how to organize your work effectively. RStudio Projects provide a powerful way to keep your files, scripts, and data organized in a self-contained workspace.

6.2.1 What are RStudio Projects?

An RStudio Project is a special folder that contains all the files associated with a particular analysis or research project. When you create a project, RStudio creates a .Rproj file that serves as the anchor for your project workspace. This approach offers several key benefits:

  • Consistent working directory: The project folder automatically becomes your working directory
  • File organization: All related files (scripts, data, outputs) are kept together
  • Reproducibility: Others can easily run your code without worrying about file paths
  • Version control integration: Projects work seamlessly with Git and GitHub

6.2.2 Creating an RStudio Project

You can create a new RStudio Project in several ways:

  1. File menu: Go to File > New Project...
  2. Project dropdown: Click the project dropdown in the top-right corner and select “New Project”
  3. Choose New Directory: Create a project in a new folder.

When creating a project, you have three main options:

  • New Directory: Create a fresh project folder
  • Existing Directory: Turn an existing folder into a project
  • Version Control: Clone a repository from GitHub or other version control systems

6.2.3 Project Structure Best Practices

A well-organized project typically follows a consistent structure (that YOU define). Here’s a common structure that you might consider:

my_analysis_project/
├── my_analysis_project.Rproj
├── data/
│   ├── raw/
│   └── processed/
├── scripts/
├── notebooks/
├── outputs/
│   ├── figures/
│   └── tables/
├── README.md
└── .gitignore

This structure separates raw data from processed data, keeps scripts organized, and provides clear locations for outputs.

6.2.4 Working Directories and File Paths

One of the most significant advantages of using RStudio Projects is that they solve the common problem of file path management. When you open a project, RStudio automatically sets the working directory to the project folder. This means:

# Instead of using absolute paths like this:
df <- read.csv("/Users/username/Documents/my_analysis/data/dataset.csv")

# You can use relative paths like this:
df <- read.csv("data/dataset.csv")

Relative paths make your code portable—anyone who opens your project will be able to run your scripts without modifying file paths.

6.2.5 The here Package for Robust File Paths [optional]

For even more robust file path management, especially in complex projects or when using R Markdown documents, the here package is invaluable:

install.packages("here")
library(here)

# The here() function always refers to the project root
df <- read.csv(here("data", "dataset.csv"))

The here package works by finding the .Rproj file and treating that location as the project root, regardless of where your current script is located within the project hierarchy.

6.2.6 Projects and Reproducibility

RStudio Projects can play a key (but optional) role in creating reproducible analyses. When you share a project folder (or push it to GitHub), collaborators can:

  1. Download/clone the entire project
  2. Open the .Rproj file
  3. Run your scripts without any setup or path modifications

This seamless workflow is essential for collaborative research and makes your work more credible and verifiable.

Now that we understand how to organize our work with RStudio Projects, let’s explore how to read and write the data files that will live within these organized project structures.

6.3 CSV files

Comma-Separated Values (CSV) files are a common file format for storing tabular data. They consist of rows and columns, with each row representing a record and each column representing a variable or attribute. CSV files are widely used for data storage and exchange due to their simplicity and compatibility with various software applications. In R, you can read and write CSV files using the read.csv() and write.csv() functions, respectively. A commonly used alternative is to use the readr package, which provides faster and more user-friendly functions for reading and writing CSV files.

6.3.1 Writing a CSV file

Since we are going to use the readr package, we need to install it first. You can install the readr package using the following command:

Once the package is installed, you can load it into your R session using the library() function:

Since we don’t have a CSV file sitting around, let’s create a simple data frame to write to a CSV file. Here’s an example data frame:

df <- data.frame(
  id = c(1, 2, 3, 4, 5),
  name = c("Alice", "Bob", "Charlie", "David", "Eve"),
  age = c(25, 30, 35, 40, 45)
)

Now, you can write this data frame to a CSV file using the write_csv() function from the readr package. Here’s how you can do it:

write_csv(df, "data.csv")

You can check the current working directory to see if the CSV file was created successfully. If you want to specify a different directory or file path, you can provide the full path in the write_csv() function. If you have created an RStudio Project, the file will be saved in the project directory by default.

# see what the current working directory is
getwd()
[1] "/Users/davsean/Documents/git/RBiocBook"
# and check to see that the file was created
dir(pattern = "data.csv")
[1] "data.csv"

6.3.2 Reading a CSV file

Now that we have a CSV file, let’s read it back into R using the read_csv() function from the readr package. Here’s how you can do it:

df2 <- read_csv("data.csv")
Rows: 5 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): name
dbl (2): id, age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

You can check the structure of the data frame df2 to verify that the data was read correctly:

df2
# A tibble: 5 × 3
     id name      age
  <dbl> <chr>   <dbl>
1     1 Alice      25
2     2 Bob        30
3     3 Charlie    35
4     4 David      40
5     5 Eve        45

The readr package can read CSV files with various delimiters, headers, and data types, making it a versatile tool for handling tabular data in R. It can also read CSV files directly from web locations like so:

df3 <- read_csv("https://data.cdc.gov/resource/pwn4-m3yp.csv")
Rows: 1000 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): state
dbl  (6): tot_cases, new_cases, tot_deaths, new_deaths, new_historic_cases, ...
dttm (3): date_updated, start_date, end_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The dataset that you just downloaded is described here: Covid-19 data from CDC. As you can see, the read_csv() function automatically detects the structure of the CSV file and imports it into a data frame in R. Using this approach, you can easily read CSV files from various sources, including local files and online datasets. Particularly for analysis of public datasets, this method is very useful as it allows you to access and analyze data without needing to download it manually and provides a high degree of reproducibility in your analyses.

6.4 Excel files

Microsoft Excel files are another common file format for storing tabular data. Excel files can contain multiple sheets, formulas, and formatting options, making them a popular choice for data storage and analysis. In R, you can read and write Excel files using the readxl package. This package provides functions to import and export data from Excel files, enabling you to work with Excel data in R.

6.4.1 Reading an Excel file

To read an Excel file in R, you need to install and load the readxl package. You can install the readxl package using the following command:

Once the package is installed, you can load it into your R session using the library() function:

Now, you can read an Excel file using the read_excel() function from the readxl package. We don’t have an excel file available, so let’s download one from the internet. Unlike with CSV riles, R needs to download the file first before reading it. Here’s an example of how to use R to download an Excel file:

download.file('https://www.w3resource.com/python-exercises/pandas/excel/SaleData.xlsx', 'SaleData.xlsx')

You can check the current working directory to see if the Excel file was downloaded successfully:

dir('.')
  [1] "_book"                                              
  [2] "_extensions"                                        
  [3] "_freeze"                                            
  [4] "_quarto.yml"                                        
  [5] "310_microbiome.qmd"                                 
  [6] "additional_resources.qmd"                           
  [7] "ai_tools.qmd"                                       
  [8] "appendix.qmd"                                       
  [9] "atac-seq"                                           
 [10] "atac-seq-talk.qmd"                                  
 [11] "bibliography.bib"                                   
 [12] "bioc_ranges.qmd"                                    
 [13] "bioc-summarizedexperiment.qmd"                      
 [14] "BRFSS-subset.csv"                                   
 [15] "CSHLDataCourse.qmd"                                 
 [16] "data"                                               
 [17] "data_structures_overview.qmd"                       
 [18] "data_vis_hadley.qmd"                                
 [19] "data.csv"                                           
 [20] "data.xlsx"                                          
 [21] "dataframes_intro_cache"                             
 [22] "dataframes_intro_files"                             
 [23] "dataframes_intro.qmd"                               
 [24] "dataviz.qmd"                                        
 [25] "DESCRIPTION"                                        
 [26] "dplyr_intro_msleep.qmd"                             
 [27] "dplyr.qmd"                                          
 [28] "drawings"                                           
 [29] "eda_and_univariate_brfss_cache"                     
 [30] "eda_and_univariate_brfss_files"                     
 [31] "eda_and_univariate_brfss.qmd"                       
 [32] "eda_overview.qmd"                                   
 [33] "eda_with_pca.qmd"                                   
 [34] "factors.qmd"                                        
 [35] "genomic_ranges_tutorial.qmd"                        
 [36] "GenomicRanges Bioconductor educational examples_.md"
 [37] "geoquery_cache"                                     
 [38] "geoquery_files"                                     
 [39] "geoquery.qmd"                                       
 [40] "git_and_github.qmd"                                 
 [41] "greenleaf.bw"                                       
 [42] "GSE2553_series_matrix.txt.gz"                       
 [43] "GSE74089_series_matrix.txt.gz"                      
 [44] "images"                                             
 [45] "index.aux"                                          
 [46] "index.idx"                                          
 [47] "index.lof"                                          
 [48] "index.log"                                          
 [49] "index.lot"                                          
 [50] "index.pdf"                                          
 [51] "index.qmd"                                          
 [52] "index.tex"                                          
 [53] "index.toc"                                          
 [54] "intro_to_rstudio.qmd"                               
 [55] "intro.qmd"                                          
 [56] "kmeans.qmd"                                         
 [57] "license.qmd"                                        
 [58] "lists.qmd"                                          
 [59] "machine_learning"                                   
 [60] "machine_learning_mlr3_cache"                        
 [61] "machine_learning_mlr3_files"                        
 [62] "matrices.qmd"                                       
 [63] "matrix_exercises.qmd"                               
 [64] "ml_practical.qmd"                                   
 [65] "NAMESPACE"                                          
 [66] "norm.qmd"                                           
 [67] "packages_and_dice.qmd"                              
 [68] "protein_simulation_basics.qmd"                      
 [69] "QTDublinIrish.otf"                                  
 [70] "r_basics.qmd"                                       
 [71] "r_intro_mechanics_cache"                            
 [72] "r_intro_mechanics.qmd"                              
 [73] "ranges_and_signals_cache"                           
 [74] "ranges_and_signals_files"                           
 [75] "ranges_and_signals.qmd"                             
 [76] "ranges_exercises_cache"                             
 [77] "ranges_exercises.qmd"                               
 [78] "RBiocBook.Rproj"                                    
 [79] "reading_and_writing.qmd"                            
 [80] "reading_and_writing.rmarkdown"                      
 [81] "references.qmd"                                     
 [82] "SaleData.xlsx"                                      
 [83] "SE.svg"                                             
 [84] "simulation_basics_files"                            
 [85] "simulation_basics.qmd"                              
 [86] "single_cell"                                        
 [87] "t-statistic-simulation-exercises.qmd"               
 [88] "t-stats-and-tests.qmd"                              
 [89] "talks"                                              
 [90] "tikz2d0f3bd7aa39.log"                               
 [91] "tikz2e8e2666c9c0.log"                               
 [92] "tikz3ed045a6259e.log"                               
 [93] "tikzdc60f2f7d7b.log"                                
 [94] "tikzf28a6369b1c3.log"                               
 [95] "tikzf696458ac6c2.log"                               
 [96] "tikzf83c5476af83.log"                               
 [97] "transcriptdb.qmd"                                   
 [98] "vectors.qmd"                                        
 [99] "visualization_guide.qmd"                            
[100] "working_with_distributions.qmd"                     

Now, you can read the Excel file into R using the read_excel() function:

df_excel <- read_excel("SaleData.xlsx")

You can check the structure of the data frame df_excel to verify that the data was read correctly:

df_excel
# A tibble: 45 × 8
   OrderDate           Region  Manager SalesMan  Item  Units Unit_price Sale_amt
   <dttm>              <chr>   <chr>   <chr>     <chr> <dbl>      <dbl>    <dbl>
 1 2018-01-06 00:00:00 East    Martha  Alexander Tele…    95       1198   113810
 2 2018-01-23 00:00:00 Central Hermann Shelli    Home…    50        500    25000
 3 2018-02-09 00:00:00 Central Hermann Luis      Tele…    36       1198    43128
 4 2018-02-26 00:00:00 Central Timothy David     Cell…    27        225     6075
 5 2018-03-15 00:00:00 West    Timothy Stephen   Tele…    56       1198    67088
 6 2018-04-01 00:00:00 East    Martha  Alexander Home…    60        500    30000
 7 2018-04-18 00:00:00 Central Martha  Steven    Tele…    75       1198    89850
 8 2018-05-05 00:00:00 Central Hermann Luis      Tele…    90       1198   107820
 9 2018-05-22 00:00:00 West    Douglas Michael   Tele…    32       1198    38336
10 2018-06-08 00:00:00 East    Martha  Alexander Home…    60        500    30000
# ℹ 35 more rows

The readxl package provides various options to read Excel files with multiple sheets, specific ranges, and data types, making it a versatile tool for handling Excel data in R.

6.4.2 Writing an Excel file

To write an Excel file in R, you can use the write_xlsx() function from the writexl package. You can install the writexl package using the following command:

install.packages("writexl")

Once the package is installed, you can load it into your R session using the library() function:

The write_xlsx() function allows you to write a data frame to an Excel file. Here’s an example:

write_xlsx(df, "data.xlsx")

You can check the current working directory to see if the Excel file was created successfully. If you want to specify a different directory or file path, you can provide the full path in the write_xlsx() function.

# see what the current working directory is
getwd()
[1] "/Users/davsean/Documents/git/RBiocBook"
# and check to see that the file was created
dir(pattern = "data.xlsx")
[1] "data.xlsx"

And, of course, if you find the file on your computer, you can open it in Excel to verify that the data was written correctly.

6.5 Additional options

  • Google Sheets: You can read and write data from Google Sheets using the googlesheets4 package. This package provides functions to interact with Google Sheets, enabling you to import and export data from Google Sheets to R.
  • JSON files: You can read and write JSON files using the jsonlite package. This package provides functions to convert R objects to JSON format and vice versa, enabling you to work with JSON data in R.
  • Database files: You can read and write data from database files using the DBI and RSQLite packages. These packages provide functions to interact with various database systems, enabling you to import and export data from databases to R.