Christina Maimone
2019-08-12
- Overall Note
- Connection
- Using DBI
- Get Database Information
- Execute Queries
- Modifying a Database
- Transactions
- Close Connection
- Use dplyr
- R Markdown
- PL/R
R is generally better suited to selecting data from databases than for creating database tables or entering data into a database. But there are functions to do all operations.
The DBI
package has the core functionality of connecting R to database servers. There are then packages that implement the core functionality of DBI
for each specific implementation of SQL. A package for PostgreSQL is RPostgres
.
if(!'RPostgres' %in% installed.packages()){ install.packages("RPostgres")}
library(RPostgres)
We connect with a function call like the following.
Note: this code was generated on my local machine connected to a local copy of the database. Your connection details will be different. Note I also have permissions to modify this database.
con <- dbConnect(RPostgres::Postgres(), host="localhost", dbname="dvdrental")
We will need a connection like this for any of the methods of connecting below. You can have multiple connection objects to different databases – just call the variables something different.
Note that the above example doesn’t have a username or password because the database is local and doesn’t require one. If you’re using a remote database, you don’t want to hard core your credentials into your code. There are ways to set environment variables or, for PostgreSQL specifically, use a .pgpass file to store this information. But RStudio also gives you the option to pop up a box to enter the information:
con2 <- dbConnect(RPostgres::Postgres(), host="localhost", dbname="dvdrental", user=rstudioapi::askForPassword("Database username"), password=rstudioapi::askForPassword("Database password"))
We can use the basic functions in the DBI library:
Get Database Information
Note that the following db-
prefixed functions are exported from the DBI and RPostgreSQL namespaces.
dbListTables(con)
## [1] "film_actor" "address" ## [3] "city" "actor" ## [5] "film_category" "inventory" ## [7] "actor_info" "category" ## [9] "country" "customer" ## [11] "customer_list" "film_list" ## [13] "language" "rental" ## [15] "nicer_but_slower_film_list" "staff" ## [17] "sales_by_film_category" "store" ## [19] "payment" "sales_by_store" ## [21] "staff_list" "film" ## [23] "measure"
dbListFields(con, "actor")
## [1] "actor_id" "first_name" "last_name" "last_update"
Execute Queries
actor_subset <- dbGetQuery(con, "select * from actor where actor_id > 50")head(actor_subset)
## actor_id first_name last_name last_update## 1 51 Gary Phoenix 2013-05-26 14:47:57## 2 52 Carmen Hunt 2013-05-26 14:47:57## 3 53 Mena Temple 2013-05-26 14:47:57## 4 54 Penelope Pinkett 2013-05-26 14:47:57## 5 55 Fay Kilmer 2013-05-26 14:47:57## 6 56 Dan Harris 2013-05-26 14:47:57
Note that we don’t need a ;
at the end of the query.
If we want an entire table, there’s a function for that:
actor <- dbReadTable(con, "actor")head(actor)
## actor_id first_name last_name last_update## 1 1 Penelope Guiness 2013-05-26 14:47:57## 2 2 Nick Wahlberg 2013-05-26 14:47:57## 3 3 Ed Chase 2013-05-26 14:47:57## 4 6 Bette Nicholson 2013-05-26 14:47:57## 5 7 Grace Mostel 2013-05-26 14:47:57## 6 8 Matthew Johansson 2013-05-26 14:47:57
If you want part of your query to be determined by a variable – especially if it’s a variable supplied or defined as input or by a user (not you) – you should guard against SQL injection (someone trying to attack your database by tricking your code into running malicious SQL statements) by using a parameterized query:
# YESmyquery <- dbSendQuery(con, "select * from actor where actor_id = $1")dbBind(myquery, list(4))dbFetch(myquery)
## actor_id first_name last_name last_update## 1 4 Jenn Davis 2019-08-12 19:24:02
The $1
stands in for a value you’ll substitute in. For multiple, you can use $2
, etc. (The $1
notation is for PostgreSQL – other types of databases use ?
or other symbols.)
When you’re done with the results from a prepared query, clear the result:
dbClearResult(myquery)
# NO! - at least not if the variable can be manipulated by a userdbSendQuery(con, paste0("select * from actor where actor_id=", myvar))
For more, see http://db.rstudio.com/best-practices/run-queries-safely/.
Parameterizing statements also lets you reuse them with different values.
You may also need to use dbClearResult()
after other calls to dbSendQuery()
that return a result (select statements, table creations, inserts, updates) – but not to dbGetQuery()
. Otherwise you may get some warning messages like:
Warning message: In result_create(conn@ptr, statement) : Closing open result set, cancelling previous query
Which are ok, but could get annoying.
Modifying a Database
If you’re not a superuser on the dvdrental
database, just try connecting to a database you can modify. Then the basic function is dbSendQuery
for any command you want to execute where you aren’t retrieving results.
Note that by default, statements take effect immediately - they are not in a transaction that you need to commit. To use transactions, see below.
res <- dbSendQuery(con, statement="update actor set first_name='Jenn' where actor_id=4")print(res) # contains info on result of updatedbClearResult(res) # prevent warning messages
To create a table, you can give it a data frame
mytbl <-data.frame(number=1:10 , letter=LETTERS[1:10])dbWriteTable(con, "mynewtable", mytbl)
or you could specify the table with SQL, and execute with dbSendQuery
but this can get cumbersome.
To remove a table
dbRemoveTable(con, "mynewtable")
Transactions
There are also methods for managing transactions if you need: dbBegin
, dbRollback
, dbCommit
. Transactions are key for when you need to be sure that a sequence of SQL commands (e.g. UPDATE
, CREATE
, DROP
, DELETE
, etc.) execute correctly before they’re made permanent (i.e. “committed”).
dbBegin(con)dbWriteTable(con, "mynewtable", mytbl)dbRollback(con)dbGetQuery(con, "SELECT * FROM mynewtable")
The above will produce error:
Error in result_create(conn@ptr, statement) : Failed to prepare query: ERROR: relation "mynewtable" does not existLINE 1: SELECT * FROM mynewtable
because the transaction was rolled back, not committed.
Close Connection
Connections will get closed when you quit R, but it’s good practice to explicitly close them.
dbDisconnect(con)
For more complete info, see the RStudio databases site.
needToInstall <- c("tidyverse")needToInstall <- needToInstall[which(!needToInstall %in% installed.packages())]if(length(needToInstall) > 0){ sapply(needToInstall, install.packages)}
library(tidyverse)
First, connect like normal
con <- dbConnect(RPostgres::Postgres(), host="localhost", dbname="dvdrental")
Get a reference to a table:
actortbl <- tbl(con, "actor")
If we look at this object, it doesn’t have data in it:
str(actortbl)
## List of 2## $ src:List of 2## ..$ con :Formal class 'PqConnection' [package "RPostgres"] with 3 slots## .. .. ..@ ptr :<externalptr> ## .. .. ..@ bigint : chr "integer64"## .. .. ..@ typnames:'data.frame': 446 obs. of 2 variables:## .. .. .. ..$ oid : int [1:446] 16 17 18 19 20 21 22 23 24 25 ...## .. .. .. ..$ typname: chr [1:446] "bool" "bytea" "char" "name" ...## ..$ disco: NULL## ..- attr(*, "class")= chr [1:4] "src_PqConnection" "src_dbi" "src_sql" "src"## $ ops:List of 2## ..$ x : 'ident' chr "actor"## ..$ vars: chr [1:4] "actor_id" "first_name" "last_name" "last_update"## ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"## - attr(*, "class")= chr [1:5] "tbl_PqConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
It just has connection information. dplyr
will try to perform operations within the database where it can, instead of pulling all of the data into R.
Yet you can print the object and see observations:
actortbl
## # Source: table<actor> [?? x 4]## # Database: postgres [christina@localhost:5432/dvdrental]## actor_id first_name last_name last_update ## <int> <chr> <chr> <dttm> ## 1 1 Penelope Guiness 2013-05-26 14:47:57## 2 2 Nick Wahlberg 2013-05-26 14:47:57## 3 3 Ed Chase 2013-05-26 14:47:57## 4 6 Bette Nicholson 2013-05-26 14:47:57## 5 7 Grace Mostel 2013-05-26 14:47:57## 6 8 Matthew Johansson 2013-05-26 14:47:57## 7 9 Joe Swank 2013-05-26 14:47:57## 8 10 Christian Gable 2013-05-26 14:47:57## 9 11 Zero Cage 2013-05-26 14:47:57## 10 12 Karl Berry 2013-05-26 14:47:57## # … with more rows
It retrieves them as needed, and also gives you a nice display in notebooks (a special kind of RMarkdown file) – output looks a litle different in the console and in RMarkdown files like this.
You can use dplyr
verbs to work with the table objects from the database, as if they were data frames (or tibbles).
actortbl %>% select(actor_id, first_name, last_name) %>% filter(actor_id > 150)
## # Source: lazy query [?? x 3]## # Database: postgres [christina@localhost:5432/dvdrental]## actor_id first_name last_name## <int> <chr> <chr> ## 1 151 Geoffrey Heston ## 2 152 Ben Harris ## 3 153 Minnie Kilmer ## 4 154 Meryl Gibson ## 5 155 Ian Tandy ## 6 156 Fay Wood ## 7 157 Greta Malden ## 8 158 Vivien Basinger ## 9 159 Laura Brody ## 10 160 Chris Depp ## # … with more rows
The above generates and executes the SQL needed to get the result. It turns filter
into a select statement with the appropriate where clause.
rentaltbl <- tbl(con, "rental")rentaltbl %>% group_by(customer_id) %>% summarize(count=n())
## # Source: lazy query [?? x 2]## # Database: postgres [christina@localhost:5432/dvdrental]## customer_id count ## <int> <integr64>## 1 87 30 ## 2 184 23 ## 3 477 22 ## 4 273 35 ## 5 550 32 ## 6 394 22 ## 7 51 33 ## 8 272 20 ## 9 70 18 ## 10 190 27 ## # … with more rows
What does the above correspond to as a SQL query? select customer_id, count(*) from rental group by customer_id;
rentaltbl %>% group_by(customer_id) %>% summarize(count=n()) %>% show_query()
## <SQL>## SELECT "customer_id", COUNT(*) AS "count"## FROM "rental"## GROUP BY "customer_id"
You can use collect
to pull down all of the data (tell dplyr
to stop being lazy).
# First, without collectingdf1 <- rentaltbl %>% group_by(customer_id) %>% summarize(count=n()) df1
## # Source: lazy query [?? x 2]## # Database: postgres [christina@localhost:5432/dvdrental]## customer_id count ## <int> <integr64>## 1 87 30 ## 2 184 23 ## 3 477 22 ## 4 273 35 ## 5 550 32 ## 6 394 22 ## 7 51 33 ## 8 272 20 ## 9 70 18 ## 10 190 27 ## # … with more rows
Looks OK, except:
df1[1,]
Gives you:
Error in df1[1, ] : incorrect number of dimensions
It’s the wrong dimensions because df1
isn’t actually a data.frame:
str(df1)
## List of 2## $ src:List of 2## ..$ con :Formal class 'PqConnection' [package "RPostgres"] with 3 slots## .. .. ..@ ptr :<externalptr> ## .. .. ..@ bigint : chr "integer64"## .. .. ..@ typnames:'data.frame': 446 obs. of 2 variables:## .. .. .. ..$ oid : int [1:446] 16 17 18 19 20 21 22 23 24 25 ...## .. .. .. ..$ typname: chr [1:446] "bool" "bytea" "char" "name" ...## ..$ disco: NULL## ..- attr(*, "class")= chr [1:4] "src_PqConnection" "src_dbi" "src_sql" "src"## $ ops:List of 4## ..$ name: chr "summarise"## ..$ x :List of 4## .. ..$ name: chr "group_by"## .. ..$ x :List of 2## .. .. ..$ x : 'ident' chr "rental"## .. .. ..$ vars: chr [1:7] "rental_id" "rental_date" "inventory_id" "customer_id" ...## .. .. ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"## .. ..$ dots:List of 1## .. .. ..$ customer_id: symbol customer_id## .. ..$ args:List of 1## .. .. ..$ add: logi FALSE## .. ..- attr(*, "class")= chr [1:3] "op_group_by" "op_single" "op"## ..$ dots:List of 1## .. ..$ count: language ~n()## .. .. ..- attr(*, ".Environment")=<environment: 0x7fa71e16dce8> ## ..$ args: list()## ..- attr(*, "class")= chr [1:3] "op_summarise" "op_single" "op"## - attr(*, "class")= chr [1:5] "tbl_PqConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
It is telling us we need to collect the data first to actually pull it into R.
# Then with collectingdf2 <- rentaltbl %>% group_by(customer_id) %>% summarize(count=n()) %>% collect()df2
## # A tibble: 599 x 2## customer_id count ## <int> <integr64>## 1 87 30 ## 2 184 23 ## 3 477 22 ## 4 273 35 ## 5 550 32 ## 6 394 22 ## 7 51 33 ## 8 272 20 ## 9 70 18 ## 10 190 27 ## # … with 589 more rows
df2[1,]
## # A tibble: 1 x 2## customer_id count ## <int> <integr64>## 1 87 30
You can also use dplyr
’s commands to join:
custtbl <- tbl(con, "customer")addrtbl <- tbl(con, "address")custtbl %>% inner_join(addrtbl, by="address_id") %>% filter(postal_code == '52137') %>% select(first_name, last_name, postal_code)
## # Source: lazy query [?? x 3]## # Database: postgres [christina@localhost:5432/dvdrental]## first_name last_name postal_code## <chr> <chr> <chr> ## 1 James Gannon 52137 ## 2 Freddie Duggan 52137
You could create a table with copy_to
(if you have the correct permissions)
mytbl <-data.frame(number=1:10 , letter=LETTERS[1:10])copy_to(con, mytbl, "mynewtable")
By default, it creates a temporary table. But this is a setting you can change, and you can also specify what columns to index on the table.
Disconnect like we normally do
dbDisconnect(con)
R Markdown lets you execute SQL queries directly. You first set up a DBI
connection like above, and then, instead of having R chunks of code, you can have SQL chunks of code:
```{r}library(RPostgres)con <- dbConnect(RPostgres::Postgres(), host="localhost", dbname="dvdrental")```
```{sql, connection=con}select * from actor where actor_id > 75;```
```{r}dbDisconnect(con)```
Here is the above, actually executed in RMarkdown:
library(RPostgres)con <- dbConnect(RPostgres::Postgres(), host="localhost", dbname="dvdrental")
select * from actor where actor_id > 75;
actor_id | first_name | last_name | last_update |
---|---|---|---|
76 | Angelina | Astaire | 2013-05-26 14:47:57 |
77 | Cary | Mcconaughey | 2013-05-26 14:47:57 |
78 | Groucho | Sinatra | 2013-05-26 14:47:57 |
79 | Mae | Hoffman | 2013-05-26 14:47:57 |
80 | Ralph | Cruz | 2013-05-26 14:47:57 |
81 | Scarlett | Damon | 2013-05-26 14:47:57 |
82 | Woody | Jolie | 2013-05-26 14:47:57 |
83 | Ben | Willis | 2013-05-26 14:47:57 |
84 | James | Pitt | 2013-05-26 14:47:57 |
85 | Minnie | Zellweger | 2013-05-26 14:47:57 |
dbDisconnect(con)
For more details, see knitr Language Engines: SQL.
Database administrators can install functionality in a PostgreSQL database to allow you to write R functions directly in the database, and then call them with normal SQL queries. This is done with PL/R. Enabling this functionality on systems can be risky, because R potentially gives users access to files on the database server. Database admins are usually conservative in allowing PL/R on the system, but it can be very useful in production systems. You can use to to generate reports, compute statistical methods, and even create plots.
We aren’t covering PL/R (or even writing SQL functions more generally), but it’s good to know this functionality exists if you’re ever working with a large production system.
FAQs
Can you use R for database? ›
R can connect to almost any existing database type. Most common database types have R packages that allow you to connect to them (e.g., RSQLite , RMySQL, etc).
Can R be used instead of SQL? ›R is a powerful tool for performing complex statistical analyses and creating visualizations, while SQL is better suited for working with large datasets and performing more basic operations like filtering, sorting, and aggregating data. Ultimately, they're both tools that are incredibly useful for data analysis.
How do SQL and R work together? ›Not only can you easily retrieve data from SQL Sources for analysis and visualisation in R, but you can also use SQL to create, clean, filter, query and otherwise manipulate datasets within R, using a wide choice of relational databases.
Can R connect to SQL database? ›Yes I am! RStudio includes the r-dbi/odbc interface, which enables you to develop R programs that can work with data from any database for which an ODBC driver is available. This blog shows how to use RStudio to connect to SQL Server from Linux by using Easysoft's SQL Server ODBC driver.
Is R good for ETL? ›For R work or any data operations, you need an ETL tool (extract, transform and load) to process your data from its source to your output database or data warehouse. In some cases, R on its own can act as an ETL tool. But it can also be used to build apps that perform specific ETL tasks.
How do I run a SQL query in R? ›- Step 1 - Install necessary package. install.packages("sqldf") library(sqldf)
- Step 2 - Create a dataframe. ...
- Step 3 - Write SQL queries.
- To install the Rio package, open the administrative R console from the C:\Program Files\Microsoft SQL Server\MSSQL15. ...
- Now, rerun the stored procedure, and it returns the Rio package version.
- Download a CSV file from the Web URL and import it.
R can connect to all these relational databases to fetch records from them. The RMySQL packages, helps make a connection between the R environment and the MySQL server. After the connection is made, the data set can be manipulated and analyzed using various functions.
Why R programming is not popular? ›It's slow. R is slower than other programming languages like Python or MATLAB. It takes up a lot of memory. Memory management isn't one of R's strong points.
Is R programming outdated? ›R is a programming language and environment for statistical computing and graphics. R was based on S, which was introduced in 1976. Therefore, R can sometimes be considered as outdated. However, new packages are being developed every day, allowing the language to catch up to the more “modern” Python.
Should I learn SQL or R or Python? ›
For example, if you're interested in the field of business intelligence, learning SQL is probably a better option, as most analytics tasks are done with BI tools, such as Tableau or PowerBI. By contrast, if you want to pursue a pure data science career, you'd better learn Python first.
Which is better SQL or R? ›R is a programming language specifically designed for statistical computing, while SQL is a query language for databases. R is more flexible and powerful than SQL, but can be more difficult to learn and use. SQL is more limited in terms of functionality, but is much easier to use and is more widely known.
Which is easier R Python or SQL? ›Compared to Python, SQL may be easier for some people to learn. SQL can also help you gain some basic knowledge of programming languages that may make it easier to learn other languages like Python.
What is the disadvantage of using R as a data analytics tool? ›Data Handling
Furthermore, R utilizes more memory as compared with Python. Also, R requires the entire data in one single place, that is, in the memory. Therefore, it is not an ideal option when dealing with Big Data.
Python, R and SQL are the three most important languages data engineers use. Engineers need a good understanding of ETL tools and REST-oriented APIs for creating and managing data integration jobs. These skills also help in providing data analysts and business users with simplified access to prepared data sets.
Is Python better for big data than R? ›If you're passionate about the statistical calculation and data visualization portions of data analysis, R could be a good fit for you. If, on the other hand, you're interested in becoming a data scientist and working with big data, artificial intelligence, and deep learning algorithms, Python would be the better fit.
Does RStudio support SQL? ›RStudio can now autocomplete table names and field names associated with a connection. This works in . sql files, R Markdown documents, and R Notebooks.
What package to write SQL in R? ›Connecting to and working with SQL databases in R generally requires two packages, odbc and DBI. The tidyverse suite of packages (dplyr, in particular) was designed to interface with and mimic SQL queries, which can be a very convenient way to work with a SQL database even if you are familiar with R but not SQL (yet)!
Which command is used to connect R to database? ›For creating a connection, R provides dbConnect() function. This function takes the username, password, database name, and host name as input parameters. Let's see an example to understand how the dbConnect() function is used to connect with the database.
How do I import a database into RStudio? ›A flexible way to import data is to click on the Environment tab in the upper right window of RStudio and then click the Import Dataset tab. Multiple file type options are shown, such as text, Excel, SPSS, SAS, and Stata.
How do I import data from a dataset in R? ›
The easiest method of creating a data file to import into R is to enter your data into a spreadsheet using either Microsoft Excel or LibreOffice Calc and save the spreadsheet as a tab delimited file.
Why would you want to use MySQL with R? ›Clearly, using MySQL with R will not only prevent unnecessary data clogging the memory but also saves time since the chunkWise approach cuts the writing time down significantly.
Is R and Tableau the same? ›Tableau is among one of the most commonly used tools in the market, whereas on the other hand, R is considered one of the most helpful programming languages offering different packages and libraries for data visualization at the same time.
Is R still relevant in 2022? ›R Seems to Be Declining
R remains pretty popular overall. Still, R usage does seem to be slipping. It's down from ninth place a year ago to 12th in early 2022 on the TIOBE Index.
R is an advanced language that performs various complex statistical computations and calculations. Therefore, it is widely used by data scientists and business leaders in multiple fields, from academics to business.
Is Python overtaking R? ›Python, on the other hand, is primarily an object-oriented programming language. This attributes to the conviviality of Python over R because in a lot of ways programmers might find using objects with easy functions a lot easier than methodized functionality with polymorphism.
Can I use R instead of SPSS? ›R has a less interactive analytical tool than SPSS but its editors are available for providing GUI support for programming in R. for learning and practicing hands-on analytics R us best tool as it really helps the analyst to master the various analytics steps and commands.
Can R be used for backend? ›R is used for backend statistical calculations and data analysis.
Is R good for big data? ›If you're passionate about the statistical calculation and data visualization portions of data analysis, R could be a good fit for you. If, on the other hand, you're interested in becoming a data scientist and working with big data, artificial intelligence, and deep learning algorithms, Python would be the better fit.
What is the difference between SQL and R? ›R is a programming language specifically designed for statistical computing, while SQL is a query language for databases. R is more flexible and powerful than SQL, but can be more difficult to learn and use. SQL is more limited in terms of functionality, but is much easier to use and is more widely known.