Skip to content
Snippets Groups Projects
README.md 3.6 KiB
Newer Older
Hijazi, Hussein's avatar
Hijazi, Hussein committed
Title: FastNet: Fast and accurate inference of
phylogenetic networks using large-scale genomic
sequence data
Hijazi, Hussein's avatar
Hijazi, Hussein committed

Hijazi, Hussein's avatar
Hijazi, Hussein committed
Authors: Hussein A. Hejase & Natalie VandePol & Gregory A. Bonito & Kevin J. Liu
Hijazi, Hussein's avatar
Hijazi, Hussein committed

LICENSE: All data and scripts are distributed under the terms of the GNU General Public License as published by the Free Software Foundation. You can distribute or modify it under the terms of the GNU General Public License either version 3 of the License or any later version.

The following file contains information about the simulated and empirical data, and the scripts used to run the analysis:

Simulation Study

Hijazi, Hussein's avatar
Hijazi, Hussein committed
————————————
Hijazi, Hussein's avatar
Hijazi, Hussein committed

Hijazi, Hussein's avatar
Hijazi, Hussein committed
1- Generate model trees using r8s.

2- Remove branch lengths of model trees using remove-bl.R

3- simulate.R, simulate-ret.R, random_network.R
Hijazi, Hussein's avatar
Hijazi, Hussein committed
These R scripts take an input of model trees simulated by r8s and generate the model network ms command
Hijazi, Hussein's avatar
Hijazi, Hussein committed

4- Run the following command to parse the true gene trees:
Hijazi, Hussein's avatar
Hijazi, Hussein committed
sh parse_gene_trees.sh \<num species\> \<height\> <\migration_rate\> \<theta\> \<numRep\>
Hijazi, Hussein's avatar
Hijazi, Hussein committed

Hijazi, Hussein's avatar
Hijazi, Hussein committed
5- run_seq_gen.sh is a bash file that simulates DNA sequence 
Hijazi, Hussein's avatar
Hijazi, Hussein committed
evolution using seq-gen from a set of gene trees generated by ms. 
To run it, use the following command: sh run_seq_gen.sh \<num species\> \<height\> \<migration_rate\> \<theta\> \<number of replicates\>
Hijazi, Hussein's avatar
Hijazi, Hussein committed
where theta is \<0.08\>. 
Hijazi, Hussein's avatar
Hijazi, Hussein committed

Hijazi, Hussein's avatar
Hijazi, Hussein committed
6- run_parse.sh is a bash file that parses sequence alignments generated by seq-gen, and use them as input to FastTree to infer a gene tree for each DNA sequence alignment. To run it, use the following command: sh run_parse.sh \<num species\> \<height of model phylogeny\> \<migration_rate\> \<theta\> \<number of replicates\>
Hijazi, Hussein's avatar
Hijazi, Hussein committed

7- Run the following script to get the gene trees without the outgroup:
Rscript get_inferred_gene_trees.R

8- Run the following script to get gene trees with the outgroup:
Rscript get_inferred_gene_trees_with_outgroup.R

FastNet
Hijazi, Hussein's avatar
Hijazi, Hussein committed
————
Hijazi, Hussein's avatar
Hijazi, Hussein committed

Arguments:
Hijazi, Hussein's avatar
Hijazi, Hussein committed
path=\< current path \>
Hijazi, Hussein's avatar
Hijazi, Hussein committed
taxa=21
Hijazi, Hussein's avatar
Hijazi, Hussein committed
height=5
Hijazi, Hussein's avatar
Hijazi, Hussein committed
migration=5
Hijazi, Hussein's avatar
Hijazi, Hussein committed
theta=0.08
Hijazi, Hussein's avatar
Hijazi, Hussein committed
numRep=20
Hijazi, Hussein's avatar
Hijazi, Hussein committed
subproblem_size=5
Hijazi, Hussein's avatar
Hijazi, Hussein committed
ret=1 or 2 or 3
Hijazi, Hussein's avatar
Hijazi, Hussein committed
genetrees=1000
Hijazi, Hussein's avatar
Hijazi, Hussein committed
sample_size=1

Hijazi, Hussein's avatar
Hijazi, Hussein committed
How to run FastNet?
Hijazi, Hussein's avatar
Hijazi, Hussein committed
————————————
Hijazi, Hussein's avatar
Hijazi, Hussein committed
1. Run ASTRAL to get a guide tree:
Hijazi, Hussein's avatar
Hijazi, Hussein committed
sh run_ASTRAL.sh $path $taxa $height $migration $theta $numRep

Hijazi, Hussein's avatar
Hijazi, Hussein committed
2. Root the ASTRAL tree:
Hijazi, Hussein's avatar
Hijazi, Hussein committed
Rscript root_ASTRAL_tree.R $path $taxa $height $migration $theta $numRep

Hijazi, Hussein's avatar
Hijazi, Hussein committed
3. Decompose the full set of taxa into disjoint subproblems:
Hijazi, Hussein's avatar
Hijazi, Hussein committed
Rscript generate_subproblems.R $path $taxa $height $migration $theta $numRep $subproblem_size

Hijazi, Hussein's avatar
Hijazi, Hussein committed
4. Create NEXUS files for the disjoint subproblems to run MLE:
Hijazi, Hussein's avatar
Hijazi, Hussein committed
sh create_nex.sh $path $taxa $ret $genetrees

5. Create datasets: for each dataset sample 1 taxon from each subproblem:
Rscript get_samples.R $path $taxa $height $migration $theta $numRep $sample_size
sh run_candidate.sh $path $taxa $ret $genetrees cand

6. Create datasets for all possible combinations of disjoint subproblems:
Rscript combine_subproblems.R $path $taxa $height $migration $theta $numRep
sh run_candidate.sh $path $taxa $ret $genetrees comb

7. Run the inference procedure on an HPCC cluster

Hijazi, Hussein's avatar
Hijazi, Hussein committed
8. Parse network and MLE scores for all subproblems:
Hijazi, Hussein's avatar
Hijazi, Hussein committed
for i in `seq 0 $ret`;
do
  sh parse_network_subproblems.sh $path/$taxa/genetrees $i 
  sh parse_network_candidates.sh $path/$taxa/genetrees $sample_size $i $numRep
  sh parse_network_combine.sh $path/$taxa/genetrees $i $numRep 
  Rscript select_network_Lscore_candidate.R $path/$taxa/genetrees $numRep $sample_size $i
  Rscript select_network_Lscore_subproblems.R $path/$taxa/genetrees $numRep $sample_size $i $taxa $height $migration $theta
  Rscript select_network_Lscore_combine.R $path/$taxa/genetrees $numRep $i
done

Hijazi, Hussein's avatar
Hijazi, Hussein committed
9. Run merge.R to merge subproblems into a phylogeny