This post contains an adapted R script based on prep_n_run_GOplot.pl
from Trinity, the denovo transcriptome assembler, for the times R cannot read in the produced EC.*
files.
prep_n_run_GOplot.pl
is used to produce GOplot
visualizing differential expression, sorted by GO terms (see http://wencke.github.io/). It uses the DE_analysis.DE_subset
and DE.subset.GOseq.enriched
files to produce two other files, EC.david.
and EC.genelist
used in GOplot.
In most cases, this script words properly. However, sometimes when the text annotation contains special characters such as , EC.david
and EC.genelist
do not read properly into R when prep_n_run_GOplot.pl
calles GOplot.Rscript
.
Based on the perl script, I adapted a R version, prep_for_GOplot.R
to produce EC.david
and EC.genelist
shown below.
This is a command-line version which will require the argparse
package in order to read in arguments from the command line.
# to make EC. genelist, has columns ID, logFC and adj.P.val
# DE.subset equivalents are the row name, logFC and FDR
suppressPackageStartupMessages(library(argparse))
parser <- ArgumentParser()
parser$add_argument("-s", "--DE_subset", help = "the subset file from the two-way comparison for both", required=T)
parser$add_argument("-e", "--DE_go", help = "the enriched go file from the same two-way comparison", required=T)
parser$add_argument("--comparison_prefix", help = "the string to add to output files, should be the names of the two-way comparison", required=T)
# parser$print_help()
args <- parser$parse_args()
subset.genelist<-read.table(args$DE_subset, sep = "\t", header=T, quote = "")
subset.genelist$gene<-rownames(subset.genelist)
make.ec.genelist<-data.frame("ID" = subset.genelist$gene, "logFC" = subset.genelist$logFC, "adj.P.val" = subset.genelist$FDR)
write.csv(make.ec.genelist, paste0(args$comparison_prefix,".EC.genelist"),row.names = F)
# to make EC.david, has columns Category, ID, Term, GEnes, adj_pval
# DE.subset.Goseq.enriched equivalents are , , , , over_represented_FDR
subset.david<-read.table(args$DE_go, sep = "\t", header=T, quote = "")
## get all go-terms without any information
missing_go <- subset.david$category[is.na(subset.david$term)]
print(paste("go terms being skipped due to no information", as.character(missing_go)))
subset.david<-subset.david[!is.na(subset.david$term),]
make.ec.david<-data.frame("Category" = subset.david$ontology, "ID" = subset.david$category, "Term" =subset.david$term, "Genes" = subset.david$gene_ids, "adj_pval" = subset.david$over_represented_FDR)
write.csv(make.ec.david, paste0(args$comparison_prefix, ".EC.david"), row.names=F)
Run the script as Rscript prep_for_GOplot.R --help
to see the help page. The script takes in three arguments, DE_analysis.DE_subset
and DE.subset.GOseq.enriched
, and a prefix for the file names.
Afte running this script, you will have to run GOplot
manually rather than through prep_n_run_GOplot.pl
. There is example code on the package website and Trinity also has the script GOplot.Rscript
in the directory $TRINITY/Analysis/DifferentialExpression
.
In summary, I used the argparse
package to adapt parts of a perl script by porting it into R instead. The script can be run on the command line just like the other Trinity scripts.