6 Reading and writing data files
Almost no analysis starts inside R. Your data lives somewhere else first — an expression matrix exported from an instrument, a sample sheet a collaborator typed into Excel, a public dataset posted on a government website. Before you can do anything useful, you have to get that tabular data into R, and once you’ve cleaned or summarized it, you’ll often need to get it back out to share with someone who does not use R.
This chapter is about that round trip. You’ll write a small table of genes and their expression values to disk, read it back, pull a real dataset straight from a URL, and do the same with Excel files. None of it is hard once you know which function to reach for — and knowing that is most of the battle.
6.1 What you’ll learn
- Organize an analysis in an RStudio Project so your file paths stay portable.
- Write a data frame to a CSV file with
write_csv()and read it back withread_csv(). - Read a CSV directly from a URL, without downloading it by hand.
- Read and write Excel files with
read_excel()andwrite_xlsx(). - Recognize when other formats (Google Sheets, JSON, databases) need a different package.
6.2 Organizing your work with RStudio Projects
Before you read or write a single file, it helps to decide where those files live. An RStudio Project is a folder that holds everything for one analysis — scripts, data, and outputs — together in one place. When you create a project, RStudio drops a .Rproj file into the folder, and that file anchors your workspace.
Working in a project buys you a few things:
- A consistent working directory. Open the project and your working directory is automatically the project folder. No more guessing where R thinks it is.
- Everything in one place. Scripts, data, and outputs travel together.
- Reproducibility. A collaborator can open your project and run your code without rewriting any paths.
- Version control. Projects work cleanly with Git and GitHub.
6.2.1 Creating a project
You can make a new project from File > New Project…, or from the project dropdown in the top-right corner of RStudio. You’ll be offered three choices:
- New Directory — start a fresh project folder.
- Existing Directory — turn a folder you already have into a project.
- Version Control — clone a repository from GitHub.
6.2.2 A sensible project layout
A good project has a predictable shape — one that you define and then stick to. A common starting point:
my_analysis_project/
├── my_analysis_project.Rproj
├── data/
│ ├── raw/
│ └── processed/
├── scripts/
├── notebooks/
├── outputs/
│ ├── figures/
│ └── tables/
├── README.md
└── .gitignore
This keeps raw data separate from processed data, scripts in one place, and outputs where you can find them later.
6.2.3 Why this fixes file paths
The biggest practical payoff is path management. When you open a project, RStudio sets the working directory to the project folder, so you can refer to files relative to that folder instead of spelling out the full path from the root of your hard drive.
An absolute path like /Users/username/Documents/... only exists on your computer. Email that script to a colleague and the very first line fails for them. Relative paths (data/dataset.csv), anchored by a project, are the habit that keeps your code portable.
here package for robust paths
In larger projects — especially ones with notebooks tucked into subfolders — the here package makes paths bulletproof. here() always resolves from the project root (the folder with the .Rproj file), no matter which subfolder your script sits in:
install.packages("here")
library(here)
expr <- read.csv(here("data", "dataset.csv"))When you share a project folder (or push it to GitHub), a collaborator can clone it, open the .Rproj file, and run your scripts with no path edits at all. That seamless hand-off is what makes an analysis credible and repeatable.
Now that you have a place to put files, let’s read and write some.
6.3 CSV files
A CSV (“comma-separated values”) file is the lingua franca of tabular data: plain text, one row per line, columns separated by commas. Nearly every tool — Excel, Python, databases, sequencing pipelines — can read and write CSV, which is exactly why it’s so useful for moving data around.
Base R reads and writes CSV with read.csv() and write.csv(). We’ll use the readr package instead, which provides read_csv() and write_csv(). They are faster, give cleaner output, and guess column types more sensibly. Install it once:
install.packages("readr")Then load it at the start of your session:
6.3.1 Writing a CSV file
We need something to write, so let’s build a small biological data frame: a handful of genes, each with a symbol and a measured expression value.
expr <- data.frame(
gene_id = c("YAL001C", "YAL002W", "YAL003W", "YAL005C", "YAL007C"),
symbol = c("TFC3", "VPS8", "EFB1", "SSA1", "ERP2"),
expression = c(8.2, 5.1, 11.7, 9.4, 6.8)
)
expr gene_id symbol expression
1 YAL001C TFC3 8.2
2 YAL002W VPS8 5.1
3 YAL003W EFB1 11.7
4 YAL005C SSA1 9.4
5 YAL007C ERP2 6.8
Each row is one gene; the columns mix text (gene_id, symbol) and numbers (expression), which is the everyday shape of real expression data. Now write it to disk with write_csv():
write_csv(expr, "data.csv")That call created a file named data.csv in your working directory. You can confirm it landed where you expect:
# Where am I working right now?
getwd()[1] "/workspaces/RBiocBook"
# Did the file get created?
dir(pattern = "data.csv")[1] "data.csv"
If you wanted the file somewhere else, you’d pass a fuller path (for example "outputs/tables/data.csv"). Inside an RStudio Project, that path is relative to the project folder.
6.3.2 Reading a CSV file
Reading the file back is the mirror image — read_csv():
expr_in <- read_csv("data.csv")Rows: 5 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): gene_id, symbol
dbl (1): expression
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Notice the message readr prints: it tells you how it guessed each column’s type (text vs. number). Let’s look at what came back:
expr_in# A tibble: 5 × 3
gene_id symbol expression
<chr> <chr> <dbl>
1 YAL001C TFC3 8.2
2 YAL002W VPS8 5.1
3 YAL003W EFB1 11.7
4 YAL005C SSA1 9.4
5 YAL007C ERP2 6.8
The genes and their expression values round-tripped intact. The gene_id and symbol columns came back as character, expression as a number (dbl) — readr inferred all of that automatically.
6.3.3 Reading a CSV straight from the web
read_csv() is not limited to files on your own disk. Hand it a URL and it will fetch and parse the file in one step. This matters because a great deal of useful data lives on public websites, and reading it directly — rather than downloading by hand — keeps your analysis reproducible: anyone who runs your script pulls the same data from the same place.
As an example, the palmerpenguins project publishes its penguin body measurements as a public CSV on GitHub:
penguins <- read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/main/inst/extdata/penguins.csv")Rows: 344 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): species, island, sex
dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
penguins# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <chr>, year <dbl>
The dataset — bill and flipper measurements for three penguin species — is documented here. The point is not the penguins themselves: it’s that read_csv() detected the file’s structure over the network and handed you a data frame, exactly as it did for your local file. Any public CSV, biological or otherwise, comes in the same way.
6.4 Excel files
Excel spreadsheets (.xlsx) are everywhere in biology — sample sheets, plate layouts, hand-curated gene lists. Unlike CSV, an Excel file can hold multiple sheets, formulas, and formatting, so it needs a dedicated package. Reading is handled by readxl; writing by writexl.
6.4.1 Reading an Excel file
Install and load readxl:
install.packages("readxl")We don’t have an Excel file lying around, and — unlike read_csv() — readxl can’t read straight from a URL. So we first download the file, then read it from disk. Here’s a small public example spreadsheet:
download.file(
"https://www.w3resource.com/python-exercises/pandas/excel/SaleData.xlsx",
"SaleData.xlsx",
mode = "wb"
)The mode = "wb" (“write binary”) matters for non-text files like .xlsx; without it, the download can be corrupted on some platforms. Confirm it arrived:
dir(pattern = "SaleData.xlsx")[1] "SaleData.xlsx"
Now read it with read_excel():
sales <- read_excel("SaleData.xlsx")
sales# A tibble: 45 × 8
OrderDate Region Manager SalesMan Item Units Unit_price Sale_amt
<dttm> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 2018-01-06 00:00:00 East Martha Alexander Tele… 95 1198 113810
2 2018-01-23 00:00:00 Central Hermann Shelli Home… 50 500 25000
3 2018-02-09 00:00:00 Central Hermann Luis Tele… 36 1198 43128
4 2018-02-26 00:00:00 Central Timothy David Cell… 27 225 6075
5 2018-03-15 00:00:00 West Timothy Stephen Tele… 56 1198 67088
6 2018-04-01 00:00:00 East Martha Alexander Home… 60 500 30000
7 2018-04-18 00:00:00 Central Martha Steven Tele… 75 1198 89850
8 2018-05-05 00:00:00 Central Hermann Luis Tele… 90 1198 107820
9 2018-05-22 00:00:00 West Douglas Michael Tele… 32 1198 38336
10 2018-06-08 00:00:00 East Martha Alexander Home… 60 500 30000
# ℹ 35 more rows
read_excel() has options for choosing a specific sheet (sheet =), a cell range (range =), or how column types are guessed — handy when a spreadsheet has more than one tab or some messy header rows.
6.4.2 Writing an Excel file
To go the other direction, use write_xlsx() from writexl. Install and load it:
install.packages("writexl")Then write our gene expression data frame to an .xlsx file:
write_xlsx(expr, "data.xlsx")And confirm it was created:
If you open data.xlsx in Excel, you’ll see the same five genes and their expression values you started with.
6.5 Additional formats
CSV and Excel cover most of what you’ll meet, but R reads and writes many other formats too. When you hit one, reach for the matching package:
-
Google Sheets —
googlesheets4reads and writes sheets directly from your Google account. -
JSON —
jsonliteconverts between R objects and JSON, common for data pulled from web APIs. -
Databases —
DBItogether with a backend likeRSQLitelets you query a database and pull results straight into a data frame.
The pattern is always the same: find the package for the format, then look for its read_* and write_* functions.
6.6 Exercises
Use the expr data frame from this chapter (five yeast genes with a symbol and an expression value) unless a problem says otherwise.
-
Round-trip a CSV. Write
exprto a file calledgenes.csv, then read it into a new object calledgenes_in. Confirm the file exists withdir().NoteSolutionRows: 5 Columns: 3 ── Column specification ──────────────────────────────────────────────────────── Delimiter: "," chr (2): gene_id, symbol dbl (1): expression ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.dir(pattern = "genes.csv")[1] "genes.csv"write_csv()saves it;read_csv()brings it back. The file name is just a string, so you can call it anything you like. -
Read a column back. From
genes_in, print just thesymbolcolumn.NoteSolutiongenes_in$symbol[1] "TFC3" "VPS8" "EFB1" "SSA1" "ERP2"The
$operator pulls one column out of a data frame as a vector — the same trick you used in the data frames chapter. -
Save to Excel. Write
exprtogenes.xlsxand confirm it was created.NoteSolutionwrite_xlsx(expr, "genes.xlsx") dir(pattern = "genes.xlsx")[1] "genes.xlsx"write_xlsx()fromwritexlwrites a.xlsxfile with no need for Excel to be installed. -
Why a relative path? A colleague sends you a script whose first line is
read.csv("/Users/dana/project/data/counts.csv"). It fails on your machine. Why, and what would you change it to?NoteSolutionThat absolute path points to a folder on Dana’s computer that doesn’t exist on yours. Inside an RStudio Project, a relative path like
read.csv("data/counts.csv")works for anyone who opens the project, because the working directory is the project folder.
6.7 Summary
You can now move tabular data in and out of R in both directions:
-
Organize first. An RStudio Project gives you a stable working directory so relative paths (
data/file.csv) stay portable; absolute paths do not. -
CSV —
write_csv()saves a data frame,read_csv()reads it back, andread_csv()will even read straight from a URL for reproducible access to public data. -
Excel —
read_excel()reads.xlsx(download first; it can’t read a URL), andwrite_xlsx()writes one. -
Other formats — Google Sheets, JSON, and databases each have a package (
googlesheets4,jsonlite,DBI/RSQLite) that follows the sameread_*/write_*pattern.
With data flowing freely in and out, you’re ready to spend your time on the part that matters: the analysis.