April 15, 2015

Relation Databases and SQL

Definitions

  • Database or Schema: a set of connected or related Tables
  • Table: A two-dimentional collection of data arranged into Rows and Columns
  • Row: A single entry in a Table
  • Field or column: A column in a Table

Example Table

ident personal family
dyer William Dyer
pb Frank Pabodie
lake Anderson Lake
roe Valentina Roerich
danforth Frank Danforth

Example Table Schema

Tables in SQL are defined using a specific Data Description Language (DDL). Here is the DDL for the previous table:

CREATE TABLE Person (
    ident text,
    personal text,
    family text);

SQL Tutorial, in R

Setup

  • Start RStudio or R
  • Download the example database
download.file('http://files.software-carpentry.org/survey.db',
              destfile='survey.db')
  • Install RSQLite:
install.packages('RQLite')

The last step will install the DBI package as well as RSQLite.

The DBI Package

  • The DBI package is the gateway to relational databases in R.
  • DBI is a "virtual package" in that it supplies only the methods, but not the implementation for interacting with relational databases.
  • Implementations are database specific and include:
    • RSQLite
    • RMySQL
    • RPostgreSQL
    • RODBC
    • bigquery

Connecting to a database:

con = dbConnect(SQLite(),"survey.db")
summary(con)
## <SQLiteConnection>
##   SQLite version:      3.8.6
##   Database name:       
##   Loadable extensions: 
##   File open flags:     
##   VFS:

The con object is now a "live" connection to our "survey" database.

Background of Survey database

In the late 1920s and early 1930s, William Dyer, Frank Pabodie, and Valentina Roerich led expeditions to the Pole of Inaccessibility in the South Pacific, and then onward to Antarctica. Two years ago, their expeditions were found in a storage locker at Miskatonic University. We have scanned and OCR'd the data they contain, and we now want to store that information in a way that will make search and analysis easy.

The Survey Database DDL

CREATE TABLE Person(
  ident    text,
    personal text,
    family   text
);
CREATE TABLE Site(
    name text,
    lat  real,
    long real
);
CREATE TABLE Visited(
    ident integer,
    site  text,
    dated text
);
CREATE TABLE Survey(
    taken   integer,
    person  text,
    quant   text,
    reading real
);

SELECTing

dbGetQuery(con,"select * from Person")
dbGetQuery(con,"select ident,family,personal from Person")
dbGetQuery(con,"select * from Person")
dbGetQuery(con,"select ident,family from Person")
dbGetQuery(con,"select ident,ident,ident from Person")

ORDERing

dbGetQuery(con,"select * from Person")
dbGetQuery(con,"select * from person order by ident")
dbGetQuery(con,"select * from person order by ident desc")

Filtering rows (SQL WHERE)

dbGetQuery(con,"select * from Survey")
dbGetQuery(con,"select * from Survey limit 10")
dbGetQuery(con,"select * from Survey where person='lake'")
dbGetQuery(con,"select * from Site limit 10")
dbGetQuery(con,"select * from Site where lat>0")

Counting and Grouping

dbGetQuery(con,"select count(*) from Survey")
dbGetQuery(con,"select person,count(*) from Survey group by person")

Joining

ident personal family
dyer William Dyer
pb Frank Pabodie
lake Anderson Lake
taken person quant reading
619 dyer rad 9.82
619 dyer sal 0.13
622 dyer rad 7.80

Joining

knitr::kable(dbGetQuery(con,paste("select personal,family,taken,quant",
                                  ",reading from Person join Survey",
                                  "on Person.ident=Survey.Person limit 5")))
personal family taken quant reading
William Dyer 619 rad 9.82
William Dyer 619 sal 0.13
William Dyer 622 rad 7.80
William Dyer 622 sal 0.09
Frank Pabodie 734 rad 8.41

The SQL standard

SQL standards

  • Note that SQL includes many data types and functions.
  • Data types in SQL are somewhat standardized, but different database engines may offer a subset or superset of SQL standard data types.
  • Functionality and compliance with SQL standards varies by database engine.

dplyr

Setup

  • The NYC flights data This dataset contains all 336776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transporation Statistics, and is documented in ?nycflights13
install.packages(c("dplyr","nycflights13"))
library(dplyr)
library(nycflights13)
dim(flights)
## [1] 336776     16

Setup

head(flights,6)
## Source: local data frame [6 x 16]
## 
##   year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013     1   1      517         2      830        11      UA  N14228
## 2 2013     1   1      533         4      850        20      UA  N24211
## 3 2013     1   1      542         2      923        33      AA  N619AA
## 4 2013     1   1      544        -1     1004       -18      B6  N804JB
## 5 2013     1   1      554        -6      812       -25      DL  N668DN
## 6 2013     1   1      554        -4      740        12      UA  N39463
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
##   (dbl), distance (dbl), hour (dbl), minute (dbl)

dplyr Verbs

  • select
  • arrange
  • filter

dplyr select

res = select(flights,year,month,day,origin)
head(res,3)
## Source: local data frame [3 x 4]
## 
##   year month day origin
## 1 2013     1   1    EWR
## 2 2013     1   1    LGA
## 3 2013     1   1    JFK

dplyr arrange

res = arrange(flights,desc(arr_delay),dep_time)
head(res,3)
## Source: local data frame [3 x 16]
## 
##   year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013     1   9      641      1301     1242      1272      HA  N384HA
## 2 2013     6  15     1432      1137     1607      1127      MQ  N504MQ
## 3 2013     1  10     1121      1126     1239      1109      MQ  N517MQ
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
##   (dbl), distance (dbl), hour (dbl), minute (dbl)

dplyr filter

res = filter(flights,origin=="JFK")
head(res,3)
## Source: local data frame [3 x 16]
## 
##   year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013     1   1      542         2      923        33      AA  N619AA
## 2 2013     1   1      544        -1     1004       -18      B6  N804JB
## 3 2013     1   1      557        -3      838        -8      B6  N593JB
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
##   (dbl), distance (dbl), hour (dbl), minute (dbl)

Pipelining

res = filter(flights,origin=="JFK") %>%
  arrange(desc(arr_delay),dep_time) %>%
  select(arr_time,arr_delay,flight,origin)
head(res)
## Source: local data frame [6 x 4]
## 
##   arr_time arr_delay flight origin
## 1     1242      1272     51    JFK
## 2     1607      1127   3535    JFK
## 3     1457      1007    177    JFK
## 4     1044       989   3075    JFK
## 5     1342       931   2391    JFK
## 6     1210       856   2391    JFK

Databases and dplyr

Create database

my_db <- src_sqlite("my_db.sqlite3", create = T)
flights_sqlite <- copy_to(my_db, flights, temporary = FALSE, indexes = list(
  c("year", "month", "day"), "carrier", "tailnum"))

Working with database

flights_sqlite <- tbl(nycflights13_sqlite(), "flights")
flights_sqlite
## Source: sqlite 3.8.6 [/var/folders/21/b_rp6qyj1_b1j5cp8qby0tnr0000gn/T//RtmpzSfzpz/nycflights13.sqlite]
## From: flights [336,776 x 16]
## 
##    year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1  2013     1   1      517         2      830        11      UA  N14228
## 2  2013     1   1      533         4      850        20      UA  N24211
## 3  2013     1   1      542         2      923        33      AA  N619AA
## 4  2013     1   1      544        -1     1004       -18      B6  N804JB
## 5  2013     1   1      554        -6      812       -25      DL  N668DN
## 6  2013     1   1      554        -4      740        12      UA  N39463
## 7  2013     1   1      555        -5      913        19      B6  N516JB
## 8  2013     1   1      557        -3      709       -14      EV  N829AS
## 9  2013     1   1      557        -3      838        -8      B6  N593JB
## 10 2013     1   1      558        -2      753         8      AA  N3ALAA
## ..  ...   ... ...      ...       ...      ...       ...     ...     ...
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
##   (dbl), distance (dbl), hour (dbl), minute (dbl)

Working with database

res = filter(flights_sqlite,origin=="JFK") %>%
  arrange(desc(arr_delay),dep_time) %>%
  select(arr_time,arr_delay,flight,origin)
head(res)
##   arr_time arr_delay flight origin
## 1     1242      1272     51    JFK
## 2     1607      1127   3535    JFK
## 3     1457      1007    177    JFK
## 4     1044       989   3075    JFK
## 5     1342       931   2391    JFK
## 6     1210       856   2391    JFK