Skip to content

Latest commit

 

History

History
190 lines (117 loc) · 15.3 KB

README.md

File metadata and controls

190 lines (117 loc) · 15.3 KB

Rosids_AJB-D-19-00298

Estimating rates and patterns of diversification with incomplete sampling: A case study in the rosids

Miao Sun, Ryan A. Folk, Matthew A. Gitzendanner, Pamela S. Soltis, Zhiduan Chen, Douglas E. Soltis, Robert P. Guralnick

Note:
Scripts and documentation are provided here with an assumption that users have basic knowledge of UNIX shell, R, and Python, including changing the working directory, and pointing to the input and output path to link data and properly execute the scripts. In order to fully reproduce the results in our study, high performance computing clusters (HiPerGator at University of Florida) must be used for BAMM and RPANDA analysis.

Here we lay out below each of our scripts based on the order of the workflow as described in the main text. All the scripts and their relative folders are descriptively named based on their primary functionalities in the analyses, and detailed under each highlighted bullet point.

Datasets

  • trees

    This folder contains four subfolders: "rosids_9k-tip", "rosids_20k-tip", "rosids_100k-tip" and "whole_tree". See "readme" for some naming conventions, and such conventions are ture all across data and scripts in this repo.

    As described in the main text, all analyses were run in parallel across 17 subclades corresponding to all rosid orders recognized in APG IV (2016), hence all the 17 subclades trees are inside 9k-, 20k-, and 100k-tip folders, respectively.

    We also used the whole tree (see "whole_tree" folder) for DR and RPANDA analyses.
    All trees are ultrametric.

  • clade_age
    Crown age of major clades in rosids were extracted from 9k-, 20k- and 100k-tip trees, respecitvely using R script "./Scripts/misc/get_crown_age.R".

  • diversification_data

    Raw data and summarized results from three commonly used methods, both parametric (RPANDA and BAMM) and semiparametric (DR). Details see below under the name of each folder. Our general convention is that the summarized results are located at the root of each method directory.

    • RPANDA
      A folder with all the resultant data generated by RPANDA analyses for 9k-, 20k-,and 100k-tip trees, including Rdata from birth-death and pure-birth models, summarized Akaike-weight, and diversification rates. Raw data is retained inside folders, and summarized data (.csv format) under directory of main folder. Any files with "17" tag means data was analyzed at subclade level, while "whole" means the whole rosid tree is analyzed.

    • BAMM
      Two main folders under this directory --- "BAMM_BD"" and "BAMM_PB". They contain main results from BAMM analyses under birth-death ("BD") and pure-birth ("PB") models, respectively. We only used 17 subclades from 20k-tip tree for testing and comparing how tip rates and speciation rates were influenced by different diversification models.
      Note: given the large volume of BAMM event data (e.g., 2.84 GB for "Fabales" event data) and limitations on the file size that can be commited via git, only summarized rate files were uploaded.

    • DR
      A folder with all the DR rates calculated by R script "./Scripts/DR/DR_statistic.R". DR rates from 17 subclades are retained in 9k, 20k, and 100k tagged folders, respectively. DR rate for the whole tree are kept in "whole_tree" folder. Mean Dr rate for eahc order are summarized in csv file under "DR" directory.

    • Cucurbitaceae_Test_Case
      This folder contains four sub-folders: "Cucurbitaceae_original_subclade_tree", "Cucurbitaceae_random_sampling", "Cucurbitaceae_random_sampling_addin", and "Cucurbitaceae_representive_sampling".

      • Cucurbitaceae_original_subclade_tree
        Diversification results from one single Cucurtbitaceae tree extract from 20k-tip rosid tree, including 528 tips (tree file in "data" folder).

      • Cucurbitaceae_random_sampling
        Diversification results from the simulation of randomly missing species by randomly dropping extant species from the original Cucurbitaceae tree (Cucurbitaceae_original_subclade_tree) at four sampling levels (10%, 30%, 50%, and 75% of sampled species), with 10 replicates for each sampling treatment (40 trees in total, see tree files in "tree_40" folder).

      • Cucurbitaceae_random_sampling_addin
        Diversification results from the simulation randomly missing species that are added in via backbone taxonomies via randomly dropping extant species at four sampling levels (as introduced above) and then adding them back to the phylogeny by attaching them to the most recent common ancestor (MRCA) of the genus, with the tip branch length extended to the present, similar to Smith and Brown (2018). These steps were done in 10 replicates with OpenTree PY Toys (40 trees in total, see tree files in "tree_40/resulted_tree" folder).

      • Cucurbitaceae_representive_sampling
        Diversification results from the simulation representative sampling. We pruned 528-tip Cucurtbitaceae tree from 20k-tip rosid phylogeny to a genus-level phylogeny by randomly selecting one species in each genus in 10 replicates. Across these scenarios, we repeated the diversification methods for the 10 replicated trees (10 trees in total, see tree files in "tree_10" folder)).

        For these 10 replicate genus-level trees, we also explored the impact of a global sampling probability (one missing species proportion imposed as the parameter for the entire tree) and species-specific sampling probabilities (missing species parameters for arbitrarily defined clades, often named taxa) on diversification rates implemented in BAMM (see "Cucurbitaceae_representive_sampling/genus_10BAMM/global_sampling" and "Cucurbitaceae_representive_sampling/genus_10BAMM/species-specific_sampling").

Scripts

Note: All the scripts here are categoried under the name of their directories (excuting diversification analyses, summarizing results and plotting figures in the main text, and supporting information). Please modify and confirm the input and output path before executing these scripts. All the file names here descriptively denote the specific purpose and follow the order of the main text (which see); further explanatory notes are selectively given below.

  • DR

    • DR_statistic.R
      This script computes the DR statistic for the entire rosid tree and each ordinal tree. The method was described by Jetz et al. (2012), and the script was derived from Harvey et al. (2016).

    • Extract_DR_subclade.sh
      This bash script is used to extract DR rates for each order based on the species (tip) belong to which order. It'll loop through all 17 rosid order based on list with names of all rosid orders. All see comments inside the script.

    • Summ_mean_DR_each_order.R
      This R script will go through each of order folder created by step above, then summrized mean DR rates for each order.

  • RPANDA

    • fitbd.R
      This script fits 9 time-dependent likelihood diversification birth-death RPANDA models to rosid 9k-, 20k-, and 100k-tip trees, and 17 subclades, and saving the results of fitted models as R objects.
      outputs a summary table of parameters and values estimated from each model for each rosid subclade.

    • fitbd_summ.R
      This script outputs summary table of parameters and values estimated from each model for each rosid subclade.

    • RPANDA_PB_model.R
      This scipt will only extract pure-birth models from the 9 models mentioned above.

    • RPANDA_summary_AW.R
      This scipt reads in the summary table generated above, and then selects the model with the smallest Akaike Information Criterion (AIC; Akaike, 1974) value and largest Akaike weight as the best diversification model for each rosid tree, or subclade, then outputs these values into a table (see Appendix S2b).

    • RPANDA_model_weighted_mean_speciation_rate_summ.R
      As explianed in the script title. Also see Appendix S3a.

  • BAMM

    Given different sampling sclaes (9k-, 20k-, and 100k-tip), it is challenging for BAMM to reach MCMC convergence from a single run; this was true for the global tree and even for the larger subclades. Hence multiple runs were conducted for each large-size clade of each dataset. In the end, we concatenated event data from each run for each order and only kept one header using the scripts below. Most of the scripts can be both applied to post-run analyses for BAMM under both "BD" and "PB" models, with the excpetion of configuration files for BAMM runs.

    • Run_priors.sh
      This bash script is used for “setBAMMpriors” in BAMM analyses; combining information from rosid_17_order_sampling_fraction.csv and write_prior.R

    • write_prior.R
      This dummy script is used by Run_priors.sh and will output parameters for each order to feed BAMM_diversification.config

    • BAMM_BD_diversification.config and BAMM_PB_diversification.config
      BAMM control file for "BD" and "PB" diversification analyses, containing a replaceable parameter template, which would be modified by Run_BAMM_config_setup.sh script below for each specific order.

      If not converged, this file should be modified again to load event data from previous run with additional generations; for more details see BAMM website.

    • Run_BAMM_config_setup.sh
      This bash script will replace parameter templates (above; denoted XXX) with specific values (rosid_17_order_sampling_fraction.csv) corresponding to each rosid order, as well as parameters produced by Run_priors.sh script. After this step, the BAMM configure file is ready to run

    • BAMM.sbatch
      Slurm job script from running BAMM including information of computational resources request.

    • BAMM_converge_checker.R
      This script will check the combined "mcmc" file to ensure MCMC convergence (>200 for both the number of shifts and log likelihoods).

    • BAMM_post-run_data_collector.sh
      This bash script will combine mcmc data and event data for each order from all the BAMM runs, respectively. It will prepare the required data files for BAMM_postrun_analyses_Order_Batch.R step below.
      combine_BAMM_eventdata.sh has similar function, but only works for event data.

    • config_nrun_maker.sh
      As noted above, it required multiple runs to each convergent, so if n-th run failed to converge, then (n+1)-th will launched automatically by run this bash script, with features of modifying corresponding parameters in the configure file.

    • BAMM_postrun_analyses_Order_Batch.R
      This script evaluate MCMC convergence of BAMM runs for each order (Order), and also extracts summaries of "tip rates"", "mean lambd", "rate-through-time matrices", etc. for downstream analyses. It also saves event data as an .rds file for read-in efficiency.

      BAMM_analysis_clade.R has similar function, just working on single clade.

    • rosid_9k-20k_mean_rates_summ.R and rosid_100k_mean_rates_summ.R
      These two scripts are used to summarized mean median tree-wide speciation rates and tips rate from all three rosid trees.

    • BAMM_BD_vs_PB_sum_rate_plot.R
      This script will plot and compare the rates through time summarized from rosid 20k-tip tree under BD and PB models (also see Appendix S3d).

  • Cucurbitaceae_Test_Case

    This folder contains four sub-folders: "Cucurbitaceae_original_subclade_tree", "Cucurbitaceae_random_sampling", "Cucurbitaceae_random_sampling_addin", and "Cucurbitaceae_representive_sampling". All the scripts under each subfolder are corresponding with data files in the four sub-folders in Datasets section.

    Generally, names for these scripts are self-explanatory, including random tree generation, BAMM configure file, slurm job submission files, some bash scripts for automation, and Rscripts for three diversification analyses and their post-run summary.

  • Figs

    The scripts correspond to Figs.2-8 in the main text. Most are barplots and diversification rate curves from BAMM.

  • misc

    • Appendix_S3d.R and Appendix_S4.R
      These scripts are used to generate plots in Online Supplemental files --- Appendix_S3d and Appendix_S4.

    • Cucurbitaceae_genus10_AW_rate_summ.R
      This script is used to summarized rates from three diversification methods, and conducting "TukeyHSD" test and some exploratory plotting (some may not used in paper) for different rates dataset from different sampling treaments (see Material and Methods section in the main text).

    • get_crown_age.R
      This script is used to extract ages of major rosid clade (see Fig. 2).

    • Run_config_single_clade.sh
      This script will generate and setup BAMM configure file for one single order (clade).

    • make_ultra.R and make_ultra.sh
      All the trees are dated, and ultrametric, but different OSs have different convents of round up maximum digits for the branch length info. Hence these two scripts will work together to address this issue, making trees ultrametric to facilitate downstream diversification analyses.

    • subclade_extract_V2.sh
      By providing a larger tree and a list of two tips used as most recent comment ancester (MRCA) to defined a clade, this bash script will extract the clade using phyx (make sure Phyx is installed).

    • summary_rate.sh
      This script will summarize "Speciation Rate" and "Tip Rate" of each rosid order into one table (also see "./Scripts/BAMM/rosid_100k_mean_rates_summ.R" and "./Scripts/BAMM/rosid_9k-20k_mean_rates_summ.R").

Requirements

  • R V.3.5.3

  • Python3

  • Bash

  • Phyx

  • Newick Utilities
    The link with installation and mannual

    The data analyses in this study were conducted either on a MacBook Pro laptop (OS-X) or on a Linux cluster system (HiPerGator).

If you found this repository useful, please cite our work and/or this repo: DOI

Cite as:
Sun et al. (2020, March 24). Cactusolo/Rosids_AJB-D-19-00298 v1.0.3 (Version v1.0.3). Zenodo. http://doi.org/10.5281/zenodo.3725025