Treemix
treemix
was written by Joseph K. Pickrell and Jonathan K. Pritchard and is available here. If given a set of allele frequencies from a number of populations, it will return the maximum likelihood tree for the set of populations, and optionally attempt to infer a number of admixture events.
Treemix assumes unlinked SNPs and we are thus first going to prune the file for SNPs in high LD. We learned how to do this previously in more detail in the PCA tutorial, but we will repeat it again here. One could also account for linkage in treemix
by setting -k 1000
- i.e setting a large blocksize for the jacknife resampling. However, our current SNP dataset is too large for us anyway. Furthermore, treemix
does not like missing data. We will therefore remove sites with missing data and perform linkage pruning.
FILE=dogs
mkdir treemix
cd treemix
vcftools --gzvcf /home/data/vcf/$FILE.vcf.gz --max-missing 1 --recode --stdout | gzip > $FILE.noN.vcf.gz
For LD-pruning, you can use the LD-pruning script as you already know how to linkage prune.
wget https://github.com/joanam/scripts/raw/master/ldPruning.sh
chmod +x ldPruning.sh
./ldPruning.sh $FILE.noN.vcf.gz
gzip $FILE.noN.LDpruned.vcf
FILE=$FILE.LDpruned
treemix
requires a special input format. We will use a script that generates this input file from a vcf
and also a clust
file - which is a file that provides the information on which sample belongs to which taxon (see here for some further information). The clust
file contains three columns, whereby the first and the second column indicate the name of the individual and the third column indicates the taxon name. We can thus easily generate the clust
file from our .pop
file that we generated previously for ADMIXTOOLS
:
bcftools query -l $FILE.vcf.gz | awk '{split($1,pop,"."); print $1"\t"$1"\t"pop[2]}' > dogs.clust
With this clust
file ready, we can run the conversion script. This script requires plink2treemix.py from here:
vcf2treemix.sh $FILE.vcf.gz $FILE.clust
Once it is finished, we can already run treemix
. For treemix
we need to specify how many migration edges it should add. We want to run it with 0 to 5 migration edges and we set the Golden Jackal as the outgroup. Because we only use a single individual per taxon, we need to use the flag -noss
. Otherwise, treemix will apply a correction for small sample size which is too conservative. Ideally, we would really have multiple individuals of each taxon.
for i in {0..5}
do
treemix -i $FILE.treemix.frq.gz -m $i -o $FILE.$i -root GoldenJackal -bootstrap -k 500 -noss > treemix_${i}_log &
done
Now we need to download all output files treemix
has produced to our local machines for visualization of the results in R
. You should also download the plotting_funcs.R
script which allows you to plot treemix
results. This R script is provided with treemix
. In addition, you need to download /home/data/vcf/dogs.list which just contains all the populations/breeds (one per line).
First, we need to set the working directory and give a prefix for the file names:
setwd("~/treemix/") # of course this needs to be adjusted
prefix="dogs.LDpruned"
We also need to load the following packages:
library(RColorBrewer)
library(R.utils)
source("plotting_funcs.R") # here you need to add the path
Now, we can plot the 6 runs of treemix
side-by-side:
par(mfrow=c(2,3))
for(edge in 0:5){
plot_tree(cex=0.8,paste0(prefix,".",edge))
title(paste(edge,"edges"))
}
To check which parts of the tree are not well modelled for the different runs, we can plot the residuals. For that we need a file with the different taxa listed.
for(edge in 0:5){
plot_resid(stem=paste0(prefix,".",edge),pop_order="dogs.list")
}
Note, that there are also other ways of inferring admixture. One nice method is making admixture graphs.