April 15, 2015
| ident | personal | family |
|---|---|---|
| dyer | William | Dyer |
| pb | Frank | Pabodie |
| lake | Anderson | Lake |
| roe | Valentina | Roerich |
| danforth | Frank | Danforth |
Tables in SQL are defined using a specific Data Description Language (DDL). Here is the DDL for the previous table:
CREATE TABLE Person (
ident text,
personal text,
family text);
download.file('http://files.software-carpentry.org/survey.db',
destfile='survey.db')
install.packages('RQLite')
The last step will install the DBI package as well as RSQLite.
DBI package is the gateway to relational databases in R.DBI is a "virtual package" in that it supplies only the methods, but not the implementation for interacting with relational databases.con = dbConnect(SQLite(),"survey.db") summary(con)
## <SQLiteConnection> ## SQLite version: 3.8.6 ## Database name: ## Loadable extensions: ## File open flags: ## VFS:
The con object is now a "live" connection to our "survey" database.
In the late 1920s and early 1930s, William Dyer, Frank Pabodie, and Valentina Roerich led expeditions to the Pole of Inaccessibility in the South Pacific, and then onward to Antarctica. Two years ago, their expeditions were found in a storage locker at Miskatonic University. We have scanned and OCR'd the data they contain, and we now want to store that information in a way that will make search and analysis easy.
CREATE TABLE Person(
ident text,
personal text,
family text
);
CREATE TABLE Site(
name text,
lat real,
long real
);
CREATE TABLE Visited(
ident integer,
site text,
dated text
);
CREATE TABLE Survey(
taken integer,
person text,
quant text,
reading real
);
dbGetQuery(con,"select * from Person") dbGetQuery(con,"select ident,family,personal from Person") dbGetQuery(con,"select * from Person") dbGetQuery(con,"select ident,family from Person") dbGetQuery(con,"select ident,ident,ident from Person")
dbGetQuery(con,"select * from Person") dbGetQuery(con,"select * from person order by ident") dbGetQuery(con,"select * from person order by ident desc")
dbGetQuery(con,"select * from Survey") dbGetQuery(con,"select * from Survey limit 10") dbGetQuery(con,"select * from Survey where person='lake'") dbGetQuery(con,"select * from Site limit 10") dbGetQuery(con,"select * from Site where lat>0")
dbGetQuery(con,"select count(*) from Survey") dbGetQuery(con,"select person,count(*) from Survey group by person")
| ident | personal | family |
|---|---|---|
| dyer | William | Dyer |
| pb | Frank | Pabodie |
| lake | Anderson | Lake |
| taken | person | quant | reading |
|---|---|---|---|
| 619 | dyer | rad | 9.82 |
| 619 | dyer | sal | 0.13 |
| 622 | dyer | rad | 7.80 |
knitr::kable(dbGetQuery(con,paste("select personal,family,taken,quant",
",reading from Person join Survey",
"on Person.ident=Survey.Person limit 5")))
| personal | family | taken | quant | reading |
|---|---|---|---|---|
| William | Dyer | 619 | rad | 9.82 |
| William | Dyer | 619 | sal | 0.13 |
| William | Dyer | 622 | rad | 7.80 |
| William | Dyer | 622 | sal | 0.09 |
| Frank | Pabodie | 734 | rad | 8.41 |
install.packages(c("dplyr","nycflights13"))
library(dplyr) library(nycflights13) dim(flights)
## [1] 336776 16
head(flights,6)
## Source: local data frame [6 x 16] ## ## year month day dep_time dep_delay arr_time arr_delay carrier tailnum ## 1 2013 1 1 517 2 830 11 UA N14228 ## 2 2013 1 1 533 4 850 20 UA N24211 ## 3 2013 1 1 542 2 923 33 AA N619AA ## 4 2013 1 1 544 -1 1004 -18 B6 N804JB ## 5 2013 1 1 554 -6 812 -25 DL N668DN ## 6 2013 1 1 554 -4 740 12 UA N39463 ## Variables not shown: flight (int), origin (chr), dest (chr), air_time ## (dbl), distance (dbl), hour (dbl), minute (dbl)
res = select(flights,year,month,day,origin) head(res,3)
## Source: local data frame [3 x 4] ## ## year month day origin ## 1 2013 1 1 EWR ## 2 2013 1 1 LGA ## 3 2013 1 1 JFK
res = arrange(flights,desc(arr_delay),dep_time) head(res,3)
## Source: local data frame [3 x 16] ## ## year month day dep_time dep_delay arr_time arr_delay carrier tailnum ## 1 2013 1 9 641 1301 1242 1272 HA N384HA ## 2 2013 6 15 1432 1137 1607 1127 MQ N504MQ ## 3 2013 1 10 1121 1126 1239 1109 MQ N517MQ ## Variables not shown: flight (int), origin (chr), dest (chr), air_time ## (dbl), distance (dbl), hour (dbl), minute (dbl)
res = filter(flights,origin=="JFK") head(res,3)
## Source: local data frame [3 x 16] ## ## year month day dep_time dep_delay arr_time arr_delay carrier tailnum ## 1 2013 1 1 542 2 923 33 AA N619AA ## 2 2013 1 1 544 -1 1004 -18 B6 N804JB ## 3 2013 1 1 557 -3 838 -8 B6 N593JB ## Variables not shown: flight (int), origin (chr), dest (chr), air_time ## (dbl), distance (dbl), hour (dbl), minute (dbl)
res = filter(flights,origin=="JFK") %>% arrange(desc(arr_delay),dep_time) %>% select(arr_time,arr_delay,flight,origin) head(res)
## Source: local data frame [6 x 4] ## ## arr_time arr_delay flight origin ## 1 1242 1272 51 JFK ## 2 1607 1127 3535 JFK ## 3 1457 1007 177 JFK ## 4 1044 989 3075 JFK ## 5 1342 931 2391 JFK ## 6 1210 856 2391 JFK
my_db <- src_sqlite("my_db.sqlite3", create = T)
flights_sqlite <- copy_to(my_db, flights, temporary = FALSE, indexes = list(
c("year", "month", "day"), "carrier", "tailnum"))
flights_sqlite <- tbl(nycflights13_sqlite(), "flights") flights_sqlite
## Source: sqlite 3.8.6 [/var/folders/21/b_rp6qyj1_b1j5cp8qby0tnr0000gn/T//RtmpzSfzpz/nycflights13.sqlite] ## From: flights [336,776 x 16] ## ## year month day dep_time dep_delay arr_time arr_delay carrier tailnum ## 1 2013 1 1 517 2 830 11 UA N14228 ## 2 2013 1 1 533 4 850 20 UA N24211 ## 3 2013 1 1 542 2 923 33 AA N619AA ## 4 2013 1 1 544 -1 1004 -18 B6 N804JB ## 5 2013 1 1 554 -6 812 -25 DL N668DN ## 6 2013 1 1 554 -4 740 12 UA N39463 ## 7 2013 1 1 555 -5 913 19 B6 N516JB ## 8 2013 1 1 557 -3 709 -14 EV N829AS ## 9 2013 1 1 557 -3 838 -8 B6 N593JB ## 10 2013 1 1 558 -2 753 8 AA N3ALAA ## .. ... ... ... ... ... ... ... ... ... ## Variables not shown: flight (int), origin (chr), dest (chr), air_time ## (dbl), distance (dbl), hour (dbl), minute (dbl)
res = filter(flights_sqlite,origin=="JFK") %>% arrange(desc(arr_delay),dep_time) %>% select(arr_time,arr_delay,flight,origin) head(res)
## arr_time arr_delay flight origin ## 1 1242 1272 51 JFK ## 2 1607 1127 3535 JFK ## 3 1457 1007 177 JFK ## 4 1044 989 3075 JFK ## 5 1342 931 2391 JFK ## 6 1210 856 2391 JFK