Local BLAST from R
In the previous post I discussed a bigger project, now let’s see a smaller part of it. BLAST (or Basic Local Alignment Search Tool) is defined on its webpage as:
Most of the cases it is used for finding similar sequences (~genes) from other, well-annotated, online available organisms and in this case the web interface is perfect. However if we want to use for more heavy jobs or with custom databases, then there is an option to download executable and run in a local machine (or on a server of course). In our case the database will be built from the primers, which is only available locally, so we need to use this option. If you want to download follow the instructions here.
As we installed we have access through /path/to/blast/ncbi-blast-2.6.0+/bin/blastn
for blastn (nucleotide BLAST). It is nice, but what we want is to control from R. If someone submit a sequence in the Shiny interface, that is sent to BLAST on the system, then the result come back to R. Fortunately in R there are some command for this, namely system()
and system2()
.
The differences between the two generally is how to give the command line arguments. In system()
, you insert one string as you would do in a Unix-alike shell. In system2()
the command
argument holds only the command name you wish to run and the arg
argument contains a character vector of the command’s argument. Let me show in an example where you would like to run:
> /path/to/blast/ncbi-blast-2.6.0+/bin/blastn -db blast_database -query input.fasta -outfmt 6 -evalue 1e-6 -ungapped
In R it looks like these.
system(command = "/path/to/blast/ncbi-blast-2.6.0+/bin/blastn -db blast_database -query input.fasta -outfmt 6 -evalue 1e-6 -ungapped")
system2(command = "/path/to/blast/ncbi-blast-2.6.0+/bin/blastn",
args = c("-db blast_database -query input.fasta -outfmt 6 -evalue 10e-6 -ungapped"))
The good point in system2()
is the possibility we can more easily manipulate the arguments (and the connection is also better).
blastn = "/path/to/blast/ncbi-blast-2.6.0+/bin/blastn"
blast_db = "blast_database"
input = "input_file"
evalue = 1e-6
format = 6
system2(command = blastn,
args = c("-db", blast_db,
"-query", input,
"-outfmt", format,
"-evalue", evalue,
"-ungapped"))
It is very promising so far, but we need to capture the output in R. For this we have to use wait = TRUE
. Then R will wait until this process is finished, not just send out the task and go on the next line. also set the stdout
(standard output) to capture the result as a character vector, otherwise it is sent to the R console. Let’s capture this in the blast_out
object.
blast_out <- system2(command = blastn,
args = c("-db", blast_db,
"-query", input,
"-outfmt", format,
"-evalue", evalue,
"-ungapped"),
wait = TRUE,
stdout = TRUE)
So far so good, but if you look in the blast_out
object, it should look like something like this:
[1] "Homo_Lin28A\tENST00000254231.4\t100.00\t207\t0\t0\t1\t207\t121\t327\t2e-110\t 398"
[2] "Homo_Lin28A\tENST00000326279.10\t100.00\t207\t0\t0\t1\t207\t121\t327\t2e-110\t 398"
It is not exactly what we wanted. We would like to have some kind of data frame, but as you can see the output is literally a tabular formatted output captured in a character vector. But no worries, we have some tricks here.
“Classical” way
The textConnection
function needs an input text, then it is copied to create a connection object. For detailed information write ?textConnection
. From this connection we can use read.table()
to create a data frame.
blast_out <- read.table(textConnection(blast_out))
And we have a data frame with the blast output.
readr
package (e.g. read_tsv()
) cannot read from this connection due to its non-binary state. (Or at least I could not figure out how.)The “tidy” way
If we want to work and stay in the tidyverse and create a tibble, there are two options. The first is to follow the codes above and use as_tibble()
or from the original blast_out
character vector proceed with as_tibble()
and separate()
based on the \t
.
library(tidyverse)
tidy_blast <- blast_out %>%
textConnection() %>%
read.table() %>%
as_tibble()
Or (Personally I like this the best)
colnames <- c("qseqid",
"sseqid",
"pident",
"length",
"mismatch",
"gapopen",
"qstart",
"qend",
"sstart",
"send",
"evalue",
"bitscore")
tidy_blast <- blast_out %>%
as_tibble() %>%
separate(col = value,
into = colnames,
sep = "\t",
convert = TRUE)
Great, now we have a nice data frame/tibble with all the blast outputs we can process, manipulate, plot, etc all in R.
tidy_blast
## # A tibble: 2 × 12
## qseqid sseqid pident length mismatch gapopen qstart
## * <chr> <chr> <dbl> <int> <int> <int> <int>
## 1 Homo_Lin28A HomoENST00000254231.4 100 207 0 0 1
## 2 Homo_Lin28A HomoENST00000326279.10 100 207 0 0 1
## # ... with 5 more variables: qend <int>, sstart <int>, send <int>,
## # evalue <dbl>, bitscore <int>
Put together all and we will got this, without any intermediate object.
blastn = "/path/to/blast/ncbi-blast-2.6.0+/bin/blastn"
blast_db = "blast_database"
input = "input_file"
evalue = 1e-6
format = 6
colnames <- c("qseqid",
"sseqid",
"pident",
"length",
"mismatch",
"gapopen",
"qstart",
"qend",
"sstart",
"send",
"evalue",
"bitscore")
blast_out <- system2(command = "blastn",
args = c("-db", blast_db,
"-query", input,
"-outfmt", format,
"-evalue", evalue,
"-ungapped"),
wait = TRUE,
stdout = TRUE) %>%
as_tibble() %>%
separate(col = value,
into = colnames,
sep = "\t",
convert = TRUE)
From here we can put all of these in a function, even extend the blast options on your wish and use in our everyday analysis or as me, put this on a Shiny website with some extra feature we will discuss later on. In the next post I plan to continue on the visualization of the blast output.