The recommended way to figure out the best parameters to use in stacks for denovo RAD-seq analysis is to test parameter combinations for M and n. (Paris et al. 2017, Rochette and Catchen 2017).
The authors of stacks recommend testing M and n from 1 to 9 and then visualizing the distribution of loci against the number of SNPs on the loci.
stacks provides a script to extract distributions using stacks-dist-extract
but you will have to run it repeatedly by hand for every parameter.
Here is a bash script with a loop to extract the distribution of SNPs for each parameter combination and a R script to visualize the distributions in R similar to Rocheete and Catchen’s figure.
In this script, it is assumed that folder names follow Rochette and Catchen’s protocol (stacks.M1, stacks.M2 etc)
Home
+-- stacks.M1
| +-- populations.r80
| +-- populations.log
| +-- populations.log.distribs
| +-- ind.1.alleles.tsv.gz
| +-- ind.1.matches.bam
| +-- ind.1.snps.tsv.gz
| +-- ind.1.tags.tsv.gz
| ...
|-- stacks.M2
|-- stacks.M3
|-- stacks.M4
...
#!/bin/bash
# load stacks for those who are on clusters, this is for SLURM
module load stacks/2.0b
# change the numbers in the next line to match the directory values
for i in 1 2 3 4 5 6 7 8 9
do
stacks-dist-extract stacks.M${i}/populations.r80/populations.log.distribs snps_per_loc_postfilters > ${i}_snp_distribution.tsv
done
There is also a pure awk
solution not involving stacks scripts.
#!/bin/bash
# load stacks for those who are on clusters, this is for SLURM
module load stacks/2.0b
for i in 1 2 3 4 5 6 7 8 9
do
awk '/snps_per_loc_postfilters/{flag=1; next} /END/{flag=0} flag' stacks.M${i}/populations.r80/populations.log.distribs > ${i}_snp_distribution.tsv
done
Then in R, make a data frame combining all parameter distributions before visualizaing in ggplot2
. Here, the data comes from a GBS analysis of garlic mustard (Alliaria petiolata).
count <- 1
files <- list.files(path = "", pattern="*_snp_distribution.tsv", full.names = T)
for (i in files[1:9]){
table <- read.delim(i, skip=1, header=T)
table$n_loci_percent<- table$n_loci/sum(table$n_loci)
table$m<- count
write.table(table, "distributions.tsv", append=T, row.names=F, col.names = F)
snp_count <- data.frame("m"= count, "n_snps"=sum(table$n_loci))
write.table(snp_count, "total_count.tsv", append=T, row.names=F, col.names = F)
count <- count + 1
}
total_count.tsv
is used to display total SNP by parameter.
library(ggplot2)
snp_count<-read.delim("total_count.tsv", sep=" ", header=F)
names(snp_count)<-c("m", "n_snps")
snp_count$m<-as.factor(snp_count$m)
p<-ggplot(data=snp_count, aes(x=m, y=n_snps)) +
geom_point() + theme_classic() + scale_y_continuous(limits = c(0, 3000))
p
distributions.tsv
is used to show parameter distributions.
snp_table<-read.delim("distributions.tsv", sep=" ", header=F)
names(snp_table)<- c("n_snps","n_loci", "n_loci_percent", "m")
snp_table$n_loci_percent<-snp_table$n_loci_percent*100
snp_table$n_snps<-ifelse(snp_table$n_snps < 9, snp_table$n_snps, "9 +")
snp_table$n_snps<-as.factor(snp_table$n_snps)
snp_table$m<-as.factor(snp_table$m)
q<-ggplot(data = snp_table) +
geom_col(aes(x=n_snps, y=n_loci_percent, fill=m), position="dodge") + theme_classic()
q