BDMM-Prime: Linear birth-death-sampling phylodynamics

Source code: https://github.com/tgvaughan/BDMM-Prime

1. Introduction

BDMM-Prime is a BEAST 2 package for performing phylodynamic inference under a variety of linear birth-death-sampling models with/without types.

1.1. Comparison with BDMM

The BDMM-Prime project is a hard fork of the original BDMM project. The intention is to extend the functionality of the original package, while improving its flexibility and ease of use. It incorporates the following enhancements:

  • an improved BEAUti interface that allows a much more diverse range of analyses to be configured,
  • automatic fall-back to analytical solutions for unstructured (single type) analyses, meaning BDMM-Prime can in principle be used in place of BDSKY,
  • use of stochastic mapping for efficient sampling ancestral states,
  • a particle filtering algorithm allowing joint sampling of population trajectories,
  • a heavily refactored code base intended to make the package easier to use, extend and maintain.

As a result of the many changes that were required in making this transition, BDMM-Prime is completely incompatible with BDMM itself. Thus the original package will be maintained separately to ensure that BEAST 2 XMLs and packages that depend on it remain usable.

1.2. Acknowledgements and Citations

As this is a fork of BDMM, BDMM-Prime owes a huge debt to the authors and contributors of that project, in particular Denise Kühnert and Jérémie Scire.

If you use this package as part of your research, please cite these papers:

  • Vaughan and Stadler, Bayesian phylodynamic inference of multi-type population trajectories using genomic data, doi:10.1101/2024.11.26.625381 (preprint)
  • Scire et al., Robust Phylodynamic Analysis of Genetic Sequencing Data from Structured Populations, Viruses, 14:8, 1648 (2022), https://doi.org/10.3390/v14081648
  • Kühnert et al., Phylodynamics with Migration: A ComputationalFramework to Quantify Population Structure from Genomic Data, MBE, 33:8, 2102-2116 (2016), https://doi.org/10.1093/molbev/msw064.

2. Getting Started

2.1. Installation

BDMM-Prime requires a working installation of BEAST 2.7 which can be obtained from https://www.beast2.org/. The package itself can then be installed via the built-in package manager in the following way:

  1. Open BEAUti.
  2. From the File menu select Manage Packages.
  3. Click the Package repositories button at the bottom of the dialog box.
  4. Click Add URL and enter the following repository URL:
    https://tgvaughan.github.io/BDMM-Prime/package.xml.
    The list of repositories should now look like this: package_repo.png
  5. Close the repository manager.
  6. The latest version of BDMM-Prime should now appear in the list of packages: package_manager.png Select it, then click the Install/Upgrade button.
  7. Close the package manager.

BDMM-Prime should now be available to use on your system. (You will need to restart BEAUti before you can set up any analyses.)

2.2. Setting up your first analysis

Here we will step through the process of setting up, running and interpreting a simple BDMM-Prime analysis. For the purpose of demonstration, we will focus on applying a multi-type model to simple two-type epidemiological data set, but the same general approach can be taken to set up single-type analyses of other data sets too.

BDMM-Prime analyses can be set up in BEAUti in much the same way that analyses under using other BEAST 2 models such as BDSKY or the Bayesian Skyline Plot can be set up. The only part unique to BDMM-Prime is "tree prior" configuration. Nonetheless, below we describe the complete process of setting up a small BDMM-Prime analysis.

The following should be read like a short tutorial. Steps to perform on your computer are highlighted like this:

Start BEAUti.

Note: the following assumes some limited familiarity with BEAST 2. If you've never used this software before, please consider working through the Introduction to BEAST 2 tutorial.

2.2.1. Loading sequence data into BEAUti

For this tutorial, we'll be using the example influenza data which is installed along with BDMM-Prime. This data, assembled from publicly-available data downloaded from NCBI GenBank, is the same set used in this MultiTypeTree tutorial. It consists of an aligned set of 60 H3N2 HA sequences sampled from New Zealand and Hong Kong. To make finding this data easy, we first select the package working directory.

Select "Set working dir" from the File menu, then choose "BDMM-Prime" from the sub-menu.

To load the data, simply choose "Import alignment" from the File menu and select one or more FASTA files.

Select "Import alignment" from the File menu to display a file selection dialog box. Using this, open the examples subdirectory and select the file h3n2_2deme.fna.

The BEAUti window should now look something like the following: tutorial_seqloaded.png

2.2.2. Setting up tip dates

The sequences in this data set are sampled at different times, so use the following instructions to import this information into the analysis. (This step is unnecessary if sequences were collected at the same time, or near enough given the anticipated scale of the tree.)

Select the Tip Dates panel and check the "Use tip dates" option. Then press the "Auto-configure" button to set the times based on values extracted from the sequence headers. Finally, use the radio button and dropdown menu to interpret everything after the last "_" character as a numerical tip time.

tutorial_tipdates.png

2.2.3. Setting the site model

For this example, we will use the basic HKY substitution model, with equilibrium nucleotide frequencies fixed to the empirical frequencies of characters in our alignment.

Select the "Site model" panel. Select the HKY model with "empirical" nucleotide frequencies.

tutorial_sitemodel.png

This choice to fix the equilibrium frequencies is made here only to reduce the computational complexity of the tutorial analysis. For production analyses, these should probably be estimated, and Gamma-distributed site-to-site rate heterogeneity should also be used.

2.2.4. Setting the clock model

Since we have serially-sampled data, our analysis requires some kind of molecular clock. For the sake of simplicity, we use a strict clock, setting the initial value of the clock rate to $5\times 10^{-3}$ substitutions per site per year, which is close to the true value for influenza.

Select the "Clock model" panel. Set the "mean clock rate" value to 0.005.

tutorial_clock.png

2.2.5. Selecting the tree prior

The tree prior (also known as the phylodynamic likelihood) is the component of the analysis which BDMM-Prime provides. Here we configure the particular birth-death model we will use to relate the tree to the various population-level parameters we'd like to learn about.

Select the "Priors" panel. To choose the BDMM-Prime tree prior, find the drop-down menu next to Tree.t:h3n2_2deme and select "BDMMPrime".

tutorial_treeprior.png

2.2.6. Setting up the sample types

The next thing we want to do is to specify the locations associated with each of the samples included in our analysis.

Scroll down to end of the tree prior section and find the table associating tree leaves with locations. Notice that the sample names have the form ID_Location_Date. To extract the locations from the sample names, click the Auto-configure button, select split on character, ensure "_" is specified as the delimiter, and select "2" from the take group(s) drop-down menu. Once this is complete, press "Ok".

tutorial_tiptypes.png

The table should now be populated with types extracted from the sample names.

tutorial_tiptypes2.png

2.2.7. Configuring the multi-type model

Now we get to the main part of the analysis setup, where we decide exactly how to model the generation of our data. In our case, we want to model a small part of the global transmission dynamics of H3N2.

The first thing to do is to is to select our desired parameterization of the birth-death model. In our case, as we're analysing pathogen sequences in an attempt to reconstruct transmission dynamics, we'll use the epidemiological parameterization.

Expand the tree prior section by clicking the arrow next to Tree.t:h3n2 on the left-hand side of of the screen. Then select "Epi Parameterization" from the drop-down menu at the top of the expanded section.

tutorial_parameterization.png

With this done, the next task is to define the dimensionality and initial values of the multi-type birth-death parameters. In BDMM-Prime, each of these parameters is considered a "skyline parameter" which can change in a piecewise-constant fashion through time.

The first skyline parameter to consider is the Re parameter, representing the effective reproductive number. By default parameter values are assumed to be the same across all types,

For the Re skyline parameter, uncheck the "Scalar values" check box to allow this parameter to take type-dependent values.

tutorial_ReSetup.png

The next parameter is the "become uninfectious rate" parameter, representing the rate at which infected individuals become uninfectious. As this parameter is predominantly a function of the pathogen rather than location, we will assume it takes a single value which is shared among types. Furthermore, as influenza infections often take a relatively short time to pass, we will initialise this value to correspond to an infectious period of 1 week.

Find the Become Uninfectious Rate parameter, double-click the starting value and change it to 52 (per year). Be sure to press "Enter" to cause BEAUti to accept the new value.

tutorial_BUSetup.png

Now let's consider the Sampling Proportion parameter. This parameter requires special care, as the sampling assumptions of the model can dramatically influence inference results. In our case we want (a) to allow for type-dependent sampling proportions and (b) to fix the sampling proportion to zero for all times earlier than the first sample.

In BDMM-Prime, this configuration is reasonably simple to configure.

Find the Sampling Proportion parameter, then follow these steps:

  1. Uncheck the "Scalar value" checkbox to allow for type-dependent values.
  2. Check "Display visualization" to see the distribution of sample times relative to the most recent sample.
  3. Set the "Number of change times" value to 1. (Notice the appearance of the epoch boundary marker on the visualisation.)
  4. Set the change time for the boundary between epoch 1 and 2 to 5.7. (Notice that this value ensures the oldest sample remains in epoch 1.)
  5. In the "Values" table, double-click the entries corresponding to the sampling proportion values for Epoch 2 and change each to 0.0, remembering to press "enter" after each change.

Note skyline parameter values of 0 are always fixed in the analysis, regardless of whether or not the "estimate values" option is checked.

tutorial_SampSetup.png

Finally, we need to define how lineages change type in our multi-type model. Like BDMM, BDMM-Prime allows both direct "migration" as well as "birth to a new type" styles of multi-type models. Given we are modelling the movement of infected influenza hosts, migration suits our situation best. We also acknowledge that migration rates between different pairs of types/locations may be different from one another.

To incorporate these decisions, follow these instructions:

Find the Migration Rate parameter, then:

  1. Double-click on the value and change it from it's default to the smaller rate of 0.1 (any infected individual has a 10% chance of moving between the two locations in any given year).
  2. Uncheck the "Scalar values" checkbox to allow rates to vary between pairs of locations.
  3. Check the "Estimate values" checkbox to ensure these migration rates are estimated.
tutorial_migSetup.png

2.2.8. Setting the remaining parameter priors

With the multi-type model is configured, what remains is to define priors for each of the parameters we're estimating. As in any ther BEAST analysis, this includes parameters each of the various site, clock and phylodynamic models we've configured.

Rather than describe in detail how to set these priors, here we will simply list a set of simple priors which can be used for the purpose of demonstrating the model. (Detailed information on setting up priors in BEAUti can be found elsewhere, see for instance the Prior Selection tutorial on the Taming the BEAST website.

Set the following parameter priors:

Parameter Prior
ReEpi.t:h3n2_2deme LogNormal(0,0.5)
becomeUninfectiousRateEpi.t:h3n2_2deme 1/X
migrationRateEpi.t:h3n2_2deme Exp(0.1)
samplingProportionEpi.t:h3n2_2deme Unif(0,1)

2.2.9. MCMC and Logging setup

Finally, we need to configure the details of the MCMC chain to run and the trees and parameters to be logged.

Switch to the MCMC panel and change the Chain Length to 1000000 ($10^6$).

tutorial_MCMC.png

For this demonstration we will leave everything else as-is. Notice however that beyond the regular trace, screen and tree logs, there are three additional loggers. We'll discuss these in more detail

typedTreeLogger
A logger which uses stochastic mapping to generate samples from the marginal posterior over edge-typed trees, which amount to hypotheses about when and where each type change occurred on the tree.
nodeTypedTreeLogger
Similar to typedTreeLogger, but the sampled trees includes type information only at internal nodes. Used to build summary trees with BEAST 2's TreeAnnotator utility.
trajLogger
A logger which uses particle filtering to log samples from the marginal posterior of multi-type population trajectories.

Each of these loggers is optional, and can be enabled/disabled by clicking the arrow buttons to their left and checking/unchecking the "Enable logger" box. By default edge-typed tree logging is enabled, but trajectory logging is disabled.

2.3. Saving and Running the analysis

Now we can save and run the analysis:

  1. Open the "File" menu and select "Save". Choose a sensible location in your filesystem in which to save the analysis file, which should be named h3n2.xml. (By default the location will be the examples/ folder from which you loaded the sequences: you'll want to choose somewhere else, perhaps a folder on your desktop.)
  2. Launch BEAST, select the h3n2.xml as the input file, then run the analysis.

2.4. Processing the results

Once the analysis is complete, we can have a look at the results. (If you're impatient, note that BEAST 2 regularly appends to output files and that these files can also be studied in the following way while BEAST is still running.) The output files which should have been produced are

h3n2.log
The parameter log file,
h3n2-h3n2_2deme.trees
A tree log file with no types marked,
h3n2.h3n2_2deme.typed.trees
A tree log file with all ancestral types and type changes marked,
h3n2.h3n2_2deme.typed.node.trees
A tree log file with only types at ancestral nodes marked.

2.4.1. The parameter log file

Firstly, let's open the parameter log file using the Tracer app (distributed with BEAST 2).

Open Tracer, select "Import trace file…" from the File menu. Use the interface to explore the marginal distributions as estimated from the MCMC samples.

We see immediately that the ESS for many of the parameters is low, which is unsurprising for such a short run. However, the focus here is on understanding how BDMM-Prime analysis results are presented in the log file rather than trying to seriously interpret the results.

With this in mind, see firstly that each of the estimated parameters including Re, becomeUninfectiousRate and migrationRate are represented in the log file by ReSPEpi, becomeUninfectiousRateSPEpi and migrationRateSPEpi. The "SP" indicates that the parameter is a "Skyline parameter" (as opposed to BEAST 2's regular RealParameter).

Also notice that when we have allowed type-specific (not scalar) values, as in the case of Re, the parameter name appears multiple times with a suffix indicating the type associated with the specific value.

Finally, notice that when we have allowed the parameter to change at a particular time, each epoch/interval receives its own set of parameter values. In this case an additional ".endtime" value is also recorded, which identifies the end time of a specific interval, measured relative to the start time of the process.

tracer.png

2.4.2. The tree log files

The untyped tree log file can be dealt with in the same way as all BEAST 2 tree files. However we also have tree log files containing sampled ancestral type information which we can use to learn about the type change process as it occurred on the tree. In this influenza case, this contains potentially interesting information about global influenza spatial dynamics.

While in principle the .typed.trees file contains the most detailed information, it is easier to jointly summarize these ancestral dynamics with the tree itself by applying the TreeAnnotator program to the .typed.node.trees file.

  1. Open TreeAnnotator (distributed with BEAST 2)
  2. For the node heights option, select "mean heights". (This produces less biased estimates of divergence times than the default.)
  3. Select h3n2.h3n2_2deme.typed.node.trees as the input file.
  4. Choose an appropriate output file name (e.g. summary.tree)
  5. Click "Run".
treeannotator.png

Once complete, you can visualise the summary tree using any BEAST-compatible tree viewer, such as FigTree or IcyTree.

  1. Navigate to https://icytree.org in a javascript-enabled web browser.
  2. Select "Load from file" from the File menu and select the summary.tree file produced by TreeAnnotator.
  3. From the Style menu choose "colour nodes by" and then "type".
  4. Also from the Style menu choose "Node height error bars" and then "height_95%_HPD".

This should produce something quite like the figure below:

summary_tree.png

Notice that we've chosen to colour the nodes themselves rather than the edges. This is done precisely because in the case of these node-typed symmary trees, it is only the types at the nodes that are being represented. Please resist the urge to colour the edges based on these values! While tree viewers make this easy, and it is frequently done, it is a misleading representation.

3. Model specification using BEAUti

In the following section we'll delve into the BDMM-Prime interface in more detail. Unlike the tutorial, this section is written in a more general way rather than considering a specific data set. However, we will occasionally refer to the example in the tutorial, so it would be helpful if you'd worked through that first before proceeding to the sections below.

3.1. The BDMM-Prime Tree Prior

Using BDMM-Prime amounts to selecting the BDMM-Prime tree prior when choosing priors for a particular analysis. This is a very flexible tree prior, encompassing a wide variety of single- and multi-type birth-death-sampling models, allowing for fixed-rate and time-dependent-rate analyses.

For earlier packages such as BDSKY and BDMM, the inputs to the relevant tree priors were organised in very flat manner, with the tree prior directly taking inputs for birth and death rates, change times, including similar values for alternative parameterisations.

In contrast, inputs to the BDMM-Prime tree prior are organised in a hierarchical way which seeks to avoid repetition and improve flexibility. At the top level, the BDMM-Prime tree prior takes the following five main elements:

  1. The tree,
  2. tip type information,
  3. starting type prior probabilities, and
  4. a "parameterization" object defining how the model is to be parameterized.

The tree itself is automatically defined and provided to the tree prior by BEAUti. It therefore remains only to define the remaining three, which are detailed in the following sections.

Important: Be sure to define all model linkages between partitions before configuring the BDMM-Prime tree prior. Modifying linkages later on can result in broken BEAST 2 models.

3.2. Defining Tip Types

A "Type Trait Set" is used to associate types with each of the leaf nodes of the tree (i.e. each of the samples in the sequence alignment). It is completely analogous to the trait sets used associate times with leaf nodes.

To set the tip types, open the Priors panel and ensure the BDMM-Prime tree prior is expanded by clicking the arrow button on the left-hand side of the window. The relevant input editor is labeled "Type Trait Set" and can be found by scrolling down past the "Parameterization" section.

The input editor functions very similarly to the tip dates input editor. All tips are initially given the same type, UNSPECIFIED. (There is nothing special about this value, only that it is shared among all tips, causing the model to include only one type.) Individual can be specified using any string, although problems may occur if strings contain spaces or other whitespace characters.

To set the types, one can double-click on a specific value in the type column and enter a new value. This approach quickly becomes tedious, however, so I highly recommend that you instead click the Auto-configure button. This presents a dialog box which allows you to assign types by either extracting type values from a portion of the taxon/sequence name, or by loading them in from an external file. (The required format of this file is described in the built-in help which can be accessed by pressing the button labeled ?.)

3.3. Start type prior probabilities

Start type prior probabilities are a little-discussed yet crucial aspect of multi-type birth-death model specification. For a model with $M$ demes, the start type prior probabilities are $M$ positive numbers which sum to 1 and represent the (categorical) prior distribution over the possible types associated with the original individual in the multi-type birth-death process.

To set these prior probabilities, find the relevant input editor within the tree prior input editor. (It is usually located between the Parameterization and Type Trait Set input sections.) For a model with two types, it appears as follows in its initial configuration: startTypePriorProbs.png Probabilities are entered as numerical values with spaces between them. The number of values must match the number of demes in your model. The $i^{\text{th}}$ number is the prior probability for the $i^{\text{th}}$ type when the types are sorted in lexicographical (generalized alphabetical) order.

By default the prior probabilities are chosen to be uniform. In many cases this is a sensible choice, however it may be the case that you are certain that the birth-death process began (or did not begin) in some particular state. For instance, you might know - or be willing to assume - that an outbreak was seeded by travel cases other locations. In such cases you can modify the probability values to reflect these prior assumptions.

It is also possible to "estimate" these prior probabilities. Clicking the estimate button causes BEAUti to assign a flat Dirichlet "hyper-prior" to the probabilities (i.e. a uniform distribution subject only to the constraint that the probabilities must be positive and sum to 1). We advise against doing this, as doing so is completely equivalent to simply setting a uniform prior in the first place.

3.4. Parameterizations

BDMM-Prime provides a simple plug-in mechanism for handling different parameterizations. With only one minor exception for the FBD parameterization (described below), all of these parameterizations provides perfectly equivalent specifications of the birth-death model.

Given this equivalence, one might wonder why multiple parameterizations are necessary at all. There are two main reasons these are important:

  1. Context-specific parameterizations allow for easier interpretation of inference results, and
  2. they allow appropriate field-specific prior information to be included.

This second point is arguably the most important as, for instance, a paleontologist might conceivably have prior information on past speciation or diversification rates, while epidemiologists might be able to more confidently put bounds on $R_e$.

Thus it is often a good idea to select the parameterization which best reflects your biological context.

The following sections briefly describe the currently-available parameterizations and their relationships to one another.

3.4.1. Canonical

The "canonical" parameterization is in a sense the native mathematical description of the birth-death model; the its parameters reflect the rates of the basic reactions which make up the process.

Birth Rate ($\lambda_{ii}$)
This is the (per individual per unit time) rate at which each individual in the population $i$ produce new individuals of their own type.
Death Rate ($\mu_i$)
Similarly, this is the rate at which individuals in population $i$ are removed from the population.
Sampling Rate ($\psi_i$)
This is the rate at which individuals in population $i$ are "sampled". A non-zero rate here corresponds to a process that will produce samples that are distributed through time. (Sometimes referred to as $ψ$-sampling.)
Removal probability ($r_i$)
This is the probability with which sampled individuals in population $i$ are also removed from the population.
Rho sampling ($\rho_i$)
This parameterizes binomial sampling at a particular time (usually the present). If enabled, each individual in population $i$ is considered in turn, and sampled with this probability.

In addition, the following rates parameterize reactions which allow type changes to occur. In general these are specified as matrices, allowing for a unique rate for every ordered pair of distinct types.

Migration Rate ($m_{ij}$)
This is the rate at which an individual with some type becomes an individual of another type.
Birth Among Demes ($\lambda_{ij}$)
This is the rate at which an individual with some type produces an individual of another type.

Finally, the following parameter expresses the total duration of the birth-death-sampling process.

ProcessLength
the time between the start of the process and the most recent sample. (Here equivalent to the "origin" of the process.)

3.4.2. Epidemiological

The assumption behind the "epidemiological" parameterization is that individuals in the model represent infected individuals in a population. In place of the canonical birth, death and sampling parameters we have

Re ($R_{ii}$)
This is the type-specific effective reproductive number, defined by $R_i=\lambda_{ii}/(\mu_i+r_i\psi_i)$. Values above 1 correspond to growing (in expectation) numbers of infected individuals.
Become uninfectious rate ($\delta_i$)
This is the rate at which infected individuals are removed from the infectious pool, either with or without sampling. In terms of the canonical parameters, this is $\mu_i+r_i\psi_i$.
Sampling proportion ($s_i$)
This is the probability with which infected individuals are sampled through time, defined as $\psi_i/(\mu_i+r_i\psi_i)$.

Besides the migration matrix, epidemiological compartment changes are parameterized by:

Re Among Demes ($R_{ij}$)
This is a generalisation of the effective reproductive number to account for transmissions between epidemiological compartments. It is defined as $R_{ij}=\lambda_{ij}/(\mu_i+r_i\psi_i)$.

(The rho sampling, removal probability, migration and process length parameters are identical to those described for the canonical parameterization.)

3.4.3. Fossilized birth-death process

The FBD parameterization is most appropriate when individuals in the model are intended to represent whole species, and birth and death events correspond to speciation and extinction. In place of the canonical birth, death and sampling rate parameters we have

Diversification ($d_{ii}=\lambda_{ii}-\mu_i$)
This is the rate at which species richness increases through time.
Turnover ($t_{i}=\mu_i/\lambda_{ii}$)
For $d_{ii}>0$, this parameter indicates the relative frequency with which species go extinct before speciation, and satisfies $0\leq t_i < 1$.
Sampling proportion ($s_i=\psi_i/(\mu_i+r_i\psi_i)$)
Analogous to the epdemiological parameterization, this indicates the probability with which species leave fossils before going extinct.

Besides the migration rate, species type changes are parameterized by:

Diversification between types ($d_{ij}=\lambda_{ij}-\mu_i$)
This is a generalisation of diversification rate, but allowing for one of the daughter species to carry a new type.

(The rho sampling, removal probability, migration and process length parameters are identical to those described for the canonical parameterization.)

Note: Unlike the Canonical and Epidemiological parameterizations, the FBD parameterization on its own does not permit exploration of the whole parameter space. In particular, it makes the assumption that diversification is always increasing ($d_{ii}>0$ and $t_{i}<1$). If you want to also allow negative diversification, use the Canonical parameterization instead.

3.5. Skyline Parameters

Besides Parameterizations, an importance difference between BDMM-Prime and earlier packages like BDMM and BDSKY is the use of special-purpose "Skyline parameters" to describe piecewise-constant variation in rates and other parameters. Each piecewise-constant function must be specified using a combination of "change times" and "epoch values", the latter determining the value of the function should take in each "epoch" between the change times.

The relationship between epochs, change times and epoch values is illustrated in the following schematic:

epochsAndChangeTimes.png

Each skyline parameter has its own set of change times, which can be specified as times relative to the start of the process or ages relative to the most recent sample in the tree. These times or ages can be specified as absolute values or as fractions of the total process length. Keeping these decisions close to the parameter they affect and independent of those affecting other parameters provides maximum flexibility and minimal confusion. The BDMM-Prime BEAUti interface provides each of these options for every relevant rate and probability parameter.

Skyline parameters come in two main variants: Vector and Matrix skyline parameters. These differ only in that they are applied to vector-valued and matrix-valued parameters, respectively.

3.5.1. Overview of the skyline parameter input editor

In this section we delve into the details of the BDMM-Prime skyline parameter user interface. Although focus here on vector skyline parameters, everything here is also applicable to vector skyline parameters.

The input editor BDMM-Prime uses for specifying skyline parameters is quite complex, reflecting the large number of ways in which these parameters can be configured. The general layout is as follows:

skylineLayout.png

Here you can see that the configuration editor consists of three main parts: the epoch configuration, the value configuration and an epoch visualiser for helping with the configuration. The following sections detail each of these parts in turn.

3.5.1.1. Epoch configuration

The epoch configuration portion of the input editor is responsible for defining the number of times a skyline parameter will change, and the times at which these changes will happen.

The "Number of change times" input specifies how many changes will take place. By default this is 0, indicating that the parameter will take a constant value at all times. In this case, none of the user interface elements for defining the change times are displayed. To see those, you need to specify at least 1 change time.

If this condition is satisfied, other elements are displayed. The first of these requests the (initial) values of the change times. By default these represent the "ages" of the change points prior to the most recent sample. Alternatively, by unchecking the "Times specified as ages" checkbox, the numbers are interpreted as times (increasing into the future) relative to the starting point of the process. In any case, the values must increase left to right: when specified as an "age", the first change time represents the most recent change time looking into the past. Otherwise, the first change time represents the oldest change time looking from the start of the process into the future.

By default, the change times are not estimated: they retain their chosen values through the whole analysis. If you choose, however, you can check the "Estimate change times" checkbox to tell BEAST to update these change times during the MCMC. (Remember to configure a prior for the change times if you do this!)

Finally, by checking the "Relative to process length" checkbox, one can cause the change times/ages to be interpreted as fractions of the process duration. (I.e. the time between the start of the process and the most recent genetic sample.) In this case, the numeric values provided for the change times must fall between 0 and 1. Note that the value of the "Times specified as ages" checkbox still determines whether these values increase into the past (default) or into the future. Clicking the "Distribute evenly" button causes these relative times or ages to be distributed evenly between 0 and 1.

3.5.1.2. Value configuration

The initial parameter values corresponding to each epoch are configured using an input table. By default, the "Scalar values" option is checked, causing these values are applied to individuals of all types even when multiple types exist in the dataset. If you're performing a multi-type analysis and want different types to have their own value of a given parameter, uncheck this option.

Beware: When entering values using this table, always remember to press "Enter" (or "Return") after entering the value in order for BEAUti to make the requested modificiation.

When estimating values, the "Link identical values" option causes BDMM-Prime to consider identical values as representing the same degree of freedom. With this enabled, any set of values which are initially identical will continue to be identical throughout the analysis. This option is useful for setting up multi-type skyline analyses in which some of the type-specific values extend across epoch boundaries.

3.5.1.3. Epoch visualisation

The option "Display epoch visualisation" is useful when analysing samples distributed through time, as enabling it displays a graphical depiction of the sample distribution alongside the specified epochs and change times. Sample times are depicted using red and orange circles, with the coloration changing between every pair of epochs, allowing one to determine at a glance which epoch a given sample is currently assigned to.

This visualisation is particularly helpful when setting up sampling rate or proportion change times, which are often chosen based on the time of the oldest sample. (For instance, in the tutorial we configured the sampling rate of H3N2 such that it was set to zero prior to the oldest sample included in the analysis.)

3.6. Miscellaneous inputs

Besides the main settings for the model described above, there are several minor options and settings that can also be configured. Here they are in their default configuration:

miscParams.png

Below we describe each of them in turn.

Condition on Survival
This is a modelling option which, when enabled, causes the tree prior to include a term which conditions the tree probability on having at least one sampled individual in the dataset.
Use Analytical Single Type Solution
When there is only a single type in the dataset, this option tells BDMM-Prime to use analytical (BDSKY) expressions for computing the tree probability rather than to rely on numerical integration. For multi-type analyses this option has no effect. (Unless you are testing the numerical integration code, there is really no reason to ever disable this option.)
Rel Tolerance and Abs Tolerance
These parameters determine the tolerance thresholds given to the numerical integration algorithm used to compute the tree probability when multiple types are in play. You should usually leave these as-is.
Parallelize
This option determines whether or not BDMM-Prime will use multiple execution threads in an attempt to speed up the tree prior calculation on computers with multiple CPU cores available.
Parallelization Factor
When parallelization is enabled, this determines the fraction of all of the tree nodes a subtree must contain for its tree prior calculation to be assigned to a different thread. The default value of 0.1 should be fine in most cases.

4. Sampling Latent Variables

One important distinguishing feature of BDMM-Prime over the earlier BDMM package is its ability to produce samples of ancestral lineage types and multi-type population trajectories from the joint posterior distribution, even though these variables are not sampled using the MCMC algorithm. Instead, BDMM-Prime employs stochastic mapping and particle filtering to sample these latent variables during the logging procedure. By applying these algorithms to only a small subset of the states visited by the MCMC algorithm, we obtain these latent variable samples for almost no computational cost at all.

In BEAUti, both of these algorithms are configured in the MCMC pane (where the chain length and loggers are configured).

4.1. Tree edge types and type transitions

Tree edge types and type transitions are sampled using a stochastic mapping algorithm based on the work of Höhna et al., 2019. The stochastic mapping procedure and the corresponding logging of edge-typed trees is enabled by default. To enable/disable stochastic mapping, simply expand the typedTreeLogger and nodeTypedTreeLogger input editors check/uncheck the "Enable Logger" checkbox.

mappingEnable.png

These input editors can also be used to adjust important inputs such as the name of the output files and the number of MCMC steps between each call to the mapping algorithm.

When enabled, these loggers produce two additional tree log files beyond the standard untyped tree log file produced by BEAST:

  1. a log file with the extension ".typed.trees" containing fully edge-typed trees, and
  2. a log file with the extension ".typed.node.trees" containing trees annotated only with the types at their internal nodes.

These files are equivalent to those produced by packages such as MultiTypeTree and BDMM. Both of these files can be examined using tree viewers such as FigTree or IcyTree, however only the typed node tree file can be summarized using TreeAnnotator. (See the tutorial section for an example of using TreeAnnotator to summarize typed node trees.)

Note that the most sophisticated use of these trees is in answering questions about the timing and type of events on the tree. For instance, one can ask, "When did the first migration on the tree occur?" To answer such a question, one could load the full .typed.trees file into R (using, for instance, the treeio package), then compute this time for every tree. This would yield a set of samples from the posterior distribution for this time, which could then be characterised.

4.2. Multi-type trajectories

Multi-type trajectories are sampled using a phylogenetic particle filtering algorithm based on the method described by Vaughan et al., 2019. For a given sampled tree and set of model parameters, this algorithm simulates an ensemble of possible histories, members of which are weighted by their agreement with the tree and parameters. One particular history is then sampled from this weighted distribution. By applying this process to a subsample of the states visited by the MCMC chain, we are left with a set of trajectories drawn from the posterior distribution of the population history, given the genetic data.

Because the output files can be rather large, trajectory sampling is disabled by default. Like the stochastic mapping algorithm, it can be enabled by expanding the relevant logger input editor - "trajLogger" in this case - and ensuring "Enable Logger" is checked.

trajectoryEnable.png

Once enabled, you can adjust the logging frequency by changing the "Log Every" value. The default is 1000, meaning a trajectory will be sampled and recorded every 1000 steps. This is probably too frequent, as trajectory files are large and you probably don't need more than a few hundred samples to characterise the posterior. Thus I suggest you increase this value to at least "10000" for most analyses.

Further options to do with the particle filtering algorithm itself can be found by clicking the unlableled button next to the "typedTraj.t:" label, which brings up a dialog box as shown below:

trajectoryOptions.png

Here we can set the "N Particles" option, which determines the size of the weighted trajectory ensemble used to sample final trajectory. The default is 1000. Increasing this number will improve the accuracy of the calculation, but requires more computation. Ideally, one should always compare results generated using different trajectory ensemble sizes in order to test for convergence.

The "Resamp Thresh" value indicates the effective size (relative to the total number of particles) of the particle ensemble that will trigger a resample, during the sequential monte carlo algorithm. The default value of 0.5 is appropriate for most situations.

The "Use Tau Leaping" checkbox specifies whether to use exact Gillespie simulation or the approximate finite time step tau-leaping algorithm to simulate trajectories. The approximation is significantly faster and should generally be used, but if you run into numerical stability issues you might consider unchecking this option.

The "Min Leap Count" and "Epsilon" values affect the tau leaping algorithm, when used. "Min Leap Count" specifies the minimum number of stochastic integration steps which will be used to simulate a given trajectory, while "Epsilon" determines how conservative the algorithm should be when selecting step sizes, with smaller values yielding shorter step sizes. It's almost always okay to leave these parameters alone.

4.2.1. Plotting sampled trajectories

For our example here, we use the trajectory file generated by the analysis from the tutorial section with trajectory logging enabled as above.

Trajectory data generated by BDMM-Prime is stored as a simple table in the plain text file with the extension ".traj". The format of this file is described in detail in the appendix.

The easiest way to plot trajectories sampled using BDMM-Prime is using the tidyverse suite of R packages. To install these in R (if you haven't already), simply issue the following command from the R command line:

install.packages("tidyverse")

(You'll only have to do that once.) With the packages installed you'll need make them available in your current R session by issuing the command:

library(tidyverse)

The first step is to load the trajectory file using the read_tsv command.

traj <- read_tsv("h3n2.h3n2_2deme.traj")

When doing this it is a good idea to omit some of the early trajectories to account for burn-in. (Examining the parameter log file is one way to get an idea of how much to remove.) As described in the trajectory file format section, the table has several columns, one of which is "Sample", which indicates the MCMC iteration a particular trajectory belongs. We can use this to remove trajectories

For example, the following removes trajectories belonging to the first 10% of samples, assuming $10^6$ samples overall:

traj <- traj %>% filter(Sample>=100000)

Another column in the table is named "variable", and is used to indicate the type of information a given row provides about the trajectory. To plot population sizes, we are only interested in rows where the variable has the value "N", which means that the row contains information about a population size. Thus, we can further filter our table to include only these rows:

traj <- traj %>% filter(variable=="N")

To actually plot the distribution, we use the fact that each row now contains an "age" relative to the end of the process, a "type" indicating the type of the population, and a "value" indicating the size of the population immediately before age.

ggplot(traj, aes(age, value, col=factor(type), group_by=factor(Sample))) +
  geom_line(alpha=0.2) + xlim(0,10) +
  ylab("Population size")

Here we've chosen to simply plot all of the sampled trajectories together, with each trajectory coloured by its type. Note the group_by option, which is used to tell ggplot that rows with the same "Sample" value belong to the same trajectory. The use of factor() is important, as without this ggplot will treat the type and Sample indices as continuous variables rather than integers. The xlim(0,10) command is included to focus on the time domain covered by the majority of the trajectories.

trajectories.png

Notice that the types are indicated using an integer rather than the original label. As described in the section on start type prior probabilities, these integers are simply the indices of the type names when sorted lexicographically (generalized alphabetically). Thus in this case, 0 stands for Hong Kong and 1 stands for New Zealand.

While plotting all of the trajectories in this way can be interesting, it's often more useful to display precise credible intervals for the population sizes through time, similar to the approach often used to visualize the results of Bayesian skyline plot analyses.

To do this, we first define a vector of ages which will span the time domain of our plot. In our case, we'll use 1001 values spread evenly between ages 0 and 10:

ages <- seq(0,10,length.out=1001)

Then, we need to determine the population size in each trajectory at every one of these ages. We can do this by making use of the approx() function, which is a base R procedure for interpolating function sampled at discrete time points.

gridded_traj <- traj %>%
  group_by(Sample, type) %>%
  reframe(N=approx(age, value, ages, method="constant", f=0, yright=0)$y,
          age=ages)

We can then summarize the distribution at each time point by computing the 50%, 2.5%, and 97.5% quantiles, corresponding respectively to the median and the lower and upper bounds of the 95% central posterior density (CPD).

gridded_traj <- gridded_traj %>%
  group_by(type,age) %>%
  summarize(low=quantile(N,0.025),
            high=quantile(N,0.975),
            med=quantile(N,0.5))

These values can be nicely plotted using geom_ribbon() to display the CPD boundaries and a geom_line() to display the medians.

ggplot(gridded_traj,
       aes(age, N, col=factor(type), fill=factor(type), y=med, ymin=low, ymax=high)) +
    geom_ribbon(alpha=0.5) +
  geom_line() +
  ylab("Population size")

This yields the following graphic:

trajectories_CI.png

This can be further improved by including actual calendar times on the time axis. To do this we use our knowledge that the final included sample is from September 1, 2005, meaning that this is the date relative to which our ages are computed. (Our ages are the number of years prior to this date.)

ggplot(gridded_traj %>%
       mutate(Date=ymd("2005/09/01")-age*365) %>%
       mutate(Location=recode(factor(type),
                         "0"="Hong Kong",
                         "1"="New Zealand")),
       aes(date, N, col=Location, fill=Location, y=med, ymin=low, ymax=high)) +
    geom_ribbon(alpha=0.5) +
    geom_line() + ylab("Population size") +
    scale_x_date(date_breaks="1 year", date_labels="%Y")
trajectories_CI_years.png

As an aside, you may have noticed that the population sizes inferred in these example analyses are extremely small, given that these are intended to reflect global Influenza H3N2 dynamics with sampled sequences drawn from different countries across several years. Recall however that both $R_e$ and the become uninfectious rate $\delta$ were assumed constant in the tutorial. By fitting such a simple constant-rate birth-death model to this dataset, our parameters are forced to represent "effective" rates and "effective" population sizes which best fit the birth-death dynamics on this scale.

While these effects are made very obvious by the population trajectory posteriors, they are no less present when performing birth-death analyses without trajectory sampling. It is therefore always important to keep the possibility of these kinds of model-fitting effects in mind when interpreting birth-death results and phylodynamic inference results in general, even when applied at smaller scales.

5. Appendices

5.1. XML reference

While the BDMM-Prime BEAUti interface is quite flexible, there are vastly more possibilities for model configurations than those which are accessible via the graphical interface. In this section we detail the important aspects of the BEAST 2 XML model description language as it pertains to BDMM-Prime.

5.1.1. Model definition

5.1.1.1. The Birth-Death-Migration Distribution

The principle object contained in every BDMM-Prime analysis is the bdmmprime.distribution.BirthDeathMigrationDistribution object representing the probability of a tree under some birth-death distribution.

In the XML the element defining this object usually appears as the value of a distribution input to a higher leverl CompoundDistribution object in the following form:

<distribution spec="bdmmprime.distribution.BirthDeathMigrationDistribution">
  <!-- Inputs -->
</distribution>

The main inputs defined for the BirthDeathMigrationDistribution are as follows:

parameterization (Parameterization, required)
This input expects an object of the abstract type Parameterization to specify the birth-death model parameters. Existing Parameterization objects are given in the next section.
startTypePriorProbs (Function, required for multiple types)
This is a object of type Function=/=RealParameter whose dimension is the number of types and whose values represent the prior probability for the type of the initial individual in the process belonging to each possible type.
typeTraitSet (TraitSet)
This is a TraitSet specifying the type associated with each Taxon.
typeLabel (String)
If provided, this is a String specifying the node metadata key from which the type information should be read.
conditionOnSurvival (Boolean, default true)
This indicates whether the tree prior should be conditioned on the presence of at least one sample.
conditionOnRoot (Boolean, default false)
This indicates whether the tree prior should be condigioned on the time and presence of the root event. Note: If enabled, all processLength inputs must also be provided with the tree instead of an origin parameter.
finalSampleOffset (Function, default 0)
If provided, this determines the time between the final sample and the end of the birth-death-sampling process. This input is important if you want the model to be able to include the absence of samples in this period as information for inferring the various rates. It is also important when jointly estimating tip dates. (See the tip date operators described in the operator section below.)

Besides these, the following inputs are also defined. These are primarily used for debugging and/or tuning the calculation performance:

useAnalyticalSingleTypeSolution (Boolean, default true)
This indicates whether to use the fast analytical solution for the tree prior in the case of single-type analyses. Ideally this should always be left on.
relTolerance (Double, default 1e-7)
Relative tolerance for numerical integration algorithm.
absTolerance (Double, default 1e-100)
Absolute tolerance for numerical integration algorithm.
parallelize (Boolean, default true)
Indicates whether to spawn threads to parallelize the calculation of subtree likelihoods.
parallelizationFactor (Double, default 0.1)
The minimum relative size (number of nodes) the smallest of a pair of subtrees must have in order for their calculations to be carried out in separate threads.
savePartialLikelihoodsTofile (String)
If provided, the name of a file to which a tree annotated with partial likelihoods will be written. This is useful for debugging the cause of chain initialization failures.
5.1.1.2. Parameterizations

The Parameterization is an abstract class of objects tasked with connecting parameter values to the internal representation of birth-death models used by BirthDeathMigrationModel. There are several concrete classes of this type, as detailed below.

Note: Besides the parameterization-specific inputs, each parameterization object takes the following inputs:

processLength (Function or Tree, required)
This should be the duration of the process, either from the origin or from the tree root, depending on whether conditionOnRoot is set to true. If the former, suppling an "origin" parameter is usual. If the latter, the tree object itself should be supplied.
typeSet (TypeSet)
A TypeSet defining the types present in the model.
5.1.1.2.1. CanonincalParameterization
<parameterization spec="bdmmprime.parameterization.CanonicalParameterization">
<!-- Inputs -->
</parameterization>

The CanonicalParameterization parameterization type defines a parameterization in terms of the basic rates for birth, death, sampling and migration events. These are defined by the following inputs:

birthRate (SkylineVectorParameter, required)
The birth rate skyline.
deathRate (SkylineVectorParameter, required)
The death rate skyline.
samplingRate (SkylineVectorParameter, required)
The sampling rate skyline.
removalProb (SkylineVectorParameter, required)
A skyline parameter representing the probability of removal on sampling.
rhoSampling (TimedParameter)
Specifies the times and probabilities of contemporaneous sampling events.
migrationRate (SkylineMatrixParameter)
The migration rate matrix skyline.
birthRateAmongDemes (SkylineMatrixParameter)
The rate matrix skyline describing the rates of births producing children with types differing from the parent.
5.1.1.2.2. EpiParameterization
<parameterization spec="bdmmprime.parameterization.EpiParameterization">
<!-- Inputs -->
</parameterization>

This defines a parameterization in terms of rates useful for epidemiological applications:

Re (SkylineVectgorParameter, required)
The effective reproduction number skyline.
becomeUninfectiousRate (SkylineVectgorParameter, required)
The become uninfectious rate skyline.
samplingProportion (SkylineVectgorParameter, required)
The sampling proportion skyline.
removalProb (SkylineVectorParameter, required)
A skyline parameter representing the probability of removal on sampling.
rhoSampling (TimedParameter)
Specifies the times and probabilities of contemporaneous sampling events.
migrationRate (SkylineMatrixParameter)
The migration rate matrix skyline.
ReAmongDemes (SkylineMatrixParameter)
A matrix skyline representing a generalized effective reproduction number indicating parameterizing the average number of birth events into a new compartment over the infectious period of an individual in a source compartment.
5.1.1.2.3. FBDParameterization
<parameterization spec="bdmmprime.parameterization.FBDParameterization">
<!-- Inputs -->
</parameterization>

This defines a parameterization in terms of rates useful for macroevolutionary applications. It uses the Fossilized Birth-Death process (Heath et al., 2014) parameterization.

diversificationRate (SkylineVectgorParameter, required)
The diversification rate (difference of speciation and extinction rates) skyline.
turnover (SkylineVectgorParameter, required)
The turnover (ratio of extinction rate to birth rate) skyline.
samplingProportion (SkylineVectgorParameter, required)
The sampling proportion skyline.
removalProb (SkylineVectorParameter, required)
A skyline parameter representing the probability of removal on sampling.
rhoSampling (TimedParameter)
Specifies the times and probabilities of contemporaneous sampling events.
migrationRate (SkylineMatrixParameter)
The migration rate matrix skyline.
diversificationRateAmongDemes (SkylineMatrixParameter)
A matrix skyline representing the difference between the speciation rate into a new compartment and the extinction rate in that same compartment.

Warning: The FBD parameterization as usually set up in BDMM-Prime produces MCMC chains that do not cross the $\lambda=\mu$ boundary. That is, one always has either $\lambda>\mu$ (positive diversification rate) or $\lambda<\mu$ (negative diversification rate), as determined by the initial parameter values. The reason for this is that proposing a sign change in the diversification rate requires a coincident inversion of the turnover, which is unsupported by the default operators. Thus, if you don't want to make this strong a priori assumption about the sign of the diversification rate, it's best to use the CanonicalParameterization instead.

5.1.1.3. Skyline parameters

For the most part, parameters supplied to the parameterizations listed above are "skyline parameters", which define matrix- or vector-valued piecewise constant functions of time, where the vector values associate rates with specific types, and the matrix values associate rates with specific type pairs. Additionally, the rhoSampling input requires a TimedParameter which specifies a vector of values together with a set of individual times associated with these values.

The followinc sections describ the complete set of inputs to each of these types.

5.1.1.3.1. SkylineVectorParameter
<parameter spec="bdmm.prime.parameterization.SkylineVectorParameter">
<!-- Inputs -->
</parameter>

The skyline vector parameter specifies a piecewise-constant function with, in general, type-specific values. It takes the following inputs:

skylineValues (Function, required)
changeTimes (Function)

Each of this pair of inputs takes a Function (such as a RealParameter). Together they specify the change times and epoch values of the skyline parameter, and how these values are distributed across types. specifies the values associated with the piecewise-constant function. The dimension $N$ of the skylineValues Function or RealParameter, in combination with the dimension $K$ of the changeTimes Function or RealParameter (0 if this input is not set), determine how these values are distributed among types and times.

Defining $D$ to be the number of distinct types in the model, there are two main possibilities. Firstly, if $N=K+1$ (i.e. the number of epochs implied by changeTimes), the elements $v_1,v_2,\ldots,v_N$ are assumed to apply uniformly to all types in each epoch. That is:

Epoch Type 0 Type1 Type 2
0 v1 v1 v1
1 v2 v2 v2
2 v3 v3 v3

Alternatively, if $N=(K+1)\times D$, the first $D$ skylineValues elements represent the values associated with types in the first epoch, the next $D$ represent the values for types in the second epoch, and so on:

Epoch Type 0 Type 1 Type 2
0 v1 v2 v3
1 v4 v5 v6
2 v7 v8 v9
typeSet (TypeSet)
A reference to the TypeSet for the model. While optional, it's worthwhile supplying this input if multiple types are in play to allow for additional checking and more easily-interpretable logging.
timesAreAges (Boolean, default false)
When set to true, this input causes the change times supplied to changeTimes to be interpreted as ages relative to the end of the birth-death process. (I.e. the time of the final sample + the value of finalSampleOffset.) This also affects the ordering of the assignment to epochs. (E.g. value $v_1$ will be associated with the most recent epoch when timesAreAges is true, but the oldest epoch when it is false.)
timesAreRelative (Boolean, default false)
When set to true, times or ages are treated as relative fractions of the process length.
processLength (Function)
This is the object specifying the process length. It is only necessary when one or both of timesAreAges and timesAreRelative are set to true.
5.1.1.3.2. SkylineMatrixParameter
<parameter spec="bdmm.prime.parameterization.SkylineMatrixParameter">
<!-- Inputs -->
</parameter>

The skyline matrix parameter specifies a piecewise-constant function with, in general, type pair-specific values. It takes the following inputs:

skylineValues (Function, required)
changeTimes (Function)

Just as for SkylineVectorParameter, each of this pair of inputs takes a Function (such as a RealParameter). Together they specify the change times and epoch values of the skyline matrix parameter, and how these values are distributed across source/destination type pairs. The dimension $N$ of the skylineValues Function or RealParameter, in combination with the dimension $K$ of the changeTimes Function or RealParameter (0 if this input is not set), determine how these values are distributed among types and times.

Defining $D$ to be the number of distinct types in the model, there are two main possibilities. Firstly, if $N=K+1$ (i.e. the number of epochs implied by changeTimes), the elements $v_1,v_2,\ldots,v_N$ are assumed to apply uniformly to all types pairs in each epoch. That is:

Epoch   To Type 0 To Type1 To Type 2
0 From Type 0 . v1 v1
  From type 1 v1 . v1
  From type 2 v1 v1 .
1 From Type 0 . v2 v2
  From type 1 v2 . v2
  From type 2 v2 v2 .
 

Alternatively, if $N=(K+1)\times D(D-1)$, the first $D(D-1)$ skylineValues elements represent the values associated with type pairs in the first epoch, the next $D(D-1)$ represent the values for types in the second epoch, and so on:

Epoch   To Type 0 To Type1 To Type 2
0 From Type 0 . v1 v2
  From type 1 v3 . v4
  From type 2 v5 v6 .
1 From Type 0 . v7 v8
  From type 1 v9 . v10
  From type 2 v11 v12 .
 
typeSet (TypeSet)
A reference to the TypeSet for the model. While optional, it's worthwhile supplying this input to allow for additional checking and more easily-interpretable logging.
timesAreAges (Boolean, default false)
When set to true, this input causes the change times supplied to changeTimes to be interpreted as ages relative to the end of the birth-death process. (I.e. the time of the final sample + the value of finalSampleOffset.) This also affects the ordering of the assignment to epochs. (E.g. value $v_1$ will be associated with the most recent epoch when timesAreAges is true, but the oldest epoch when it is false.)
timesAreRelative (Boolean, default false)
When set to true, times or ages are treated as relative fractions of the process length.
processLength (Function)
This is the object specifying the process length. It is only necessary when one or both of timesAreAges and timesAreRelative are set to true.
5.1.1.3.3. TimedParameter
<parameter spec="bdmm.prime.parameterization.TimedParameter">
<!-- Inputs -->
</parameter>

The timed parameter specifies an, in general, type-dependent sequence of values associated with one or more times. It is used to define the timing and probabilities for "rho sampling" events. These are the possible inputs:

values (Function)
times (Function)

Each of this pair of inputs takes a Function (such as a RealParameter). Together they specify the timing and sampling probabilities associated with 0 or more rho sampling events. The dimension $N$ of the values Function or RealParameter, in combination with the dimension $K$ of the times Function or RealParameter, determine how the probabilities provided are distributed among types and times.

Defining $D$ to be the number of distinct types in the model, there are two main possibilities. Firstly, if $N=K$ (i.e. the number of events implied by times), the elements $v_1,v_2,\ldots,v_N$ are assumed to apply uniformly to all types at each event. That is:

Event Type 0 Type1 Type 2
0 v1 v1 v1
1 v2 v2 v2
2 v3 v3 v3

Alternatively, if $N=K\times D$, the first $D$ elements represent the values associated with types for the first event, the next $D$ represent the values for types for the second event, and so on:

Event Type 0 Type 1 Type 2
0 v1 v2 v3
1 v4 v5 v6
2 v7 v8 v9
typeSet (TypeSet)
A reference to the TypeSet for the model. While optional, it's worthwhile supplying this input if multiple types are in play to allow for additional checking and more easily-interpretable logging.
timesAreAges (Boolean, default false)
When set to true, this input causes the times supplied to times to be interpreted as ages relative to the end of the birth-death process. (I.e. the time of the final sample + the value of finalSampleOffset.) This also affects the ordering of the assignment to events. (E.g. value $v_1$ will be associated with the most recent event when timesAreAges is true, but the oldest event when it is false.)
timesAreRelative (Boolean, default false)
When set to true, times or ages are treated as relative fractions of the process length.
processLength (Function)
This is the object specifying the process length. It is only necessary when one or both of timesAreAges and timesAreRelative are set to true.
5.1.1.4. The TypeSet
<typeSet spec="bdmm.prime.parameterization.TypeSet">
  <!-- Inputs -->
</typeSet>

This object defines a set of types for a multi-type birth-death model.

value (String)
If provided, specifies a comma-delimited list of types. For instance, the string ="New Zealand, Australia, Samoa" specifies 3 types with those names.
typeTraitSet (TraitSet)
If provided, specifies a trait set whose unique values will be used to assemble the possible types.
unknownTypeIdentifier (=String, default "?")
If provided, taxa whose type trait value matches this string will be regarded as having an "unknown" type.

It is possible to specify values for both the value and typeTraitSet inputs. In this case, all types present in value will be included in the type set regardless of whether or not they are present in the trait set. (This can be useful for including unsampled demes in an analysis.)

5.1.2. Start type posterior logging

<log spec="bdmmprime.util.StartTypePosteriorProbsLogger">
  <!-- Inputs -->
</log>

This logger logs the posterior distribution of the start type conditional on the current parameter values. This posterior is computed by the BirthDeathMigrationDistribution while evaluating the tree prior. When averaged over the entire MCMC chain, these probabilities yield the complete marginal posterior distribution over the starting state conditional on the data.

This object takes the following input:

bdmmTreePrior (BirthDeathMigrationDistribution)
Reference to the BirthDeathMigrationDistribution used in the analysis.

5.1.3. Stochastic mapping

<log spec="bdmmprime.mapping.TypeMappedTree">
  <!-- Inputs -->
</log>

Stochastic mapping is handled in BDMM-Prime by a TypeMappedTree object, which is intended to be logged. The stochastic mapping calculation is performed on the input state only when the output has to be appended to the log file.

The important inputs are:

parameterization (Parameterization)
bdmmDistrib (BirthDeathMigrationDistribution)
One of these two inputs must be provided, as these provide the parameters needed to sample the ancestral type distribution.
untypedTree (Tree, required)
The tree on which the type changes are to be stochastically mapped.
typeTraitSet (TraitSet, required)
Trait set specifying the taxon types.
startTypePriorProbs (RealParameter, required)
The prior probabilities for the initial individual type.
typeLabel (String, required)
The label to be used for traits in the generated metadata. In other words, this is the key used in the key-value pairs which will be used to annotate the tree with the ancestral type information.
finalSampleOffset (Function, default 0)
If provided, the difference in time between the final sample and the end of the BD process.

Because the TypeMappedTree class extends the Tree class it accepts many inputs that are not particularly relevant. These have been omitted.

5.1.3.1. Type mapped tree loggers
<log spec="bdmmprime.mapping.TypeNodeTreeLogger">
  <!-- Inputs -->
</log>

This logger takes a type mapped tree as input and logs a filtered view of this tree which excludes changes along edges but retains type information at internal nodes. Such "node-typed" trees are important as distributions of such trees can be summarized by the TreeAnnotator software.

The main inputs for the TypedNodeTreeLogger are:

typedTree (Tree, required)
Typed tree whose node types to log.
<log spec="bdmmprime.mapping.TypeTreeStatsLogger">
  <!-- Inputs -->
</log>

This logger is intended for inclusion in the main parameter ("trace") log, as it computes a number of potentially-interesting summary statistics from each type-mapped tree, including

  • the total edge length associated with each type, and
  • the total number of changes between each pair of types.

It's primary inputs are:

typedTree (Tree, required)
Typed tree whose statistics to log.
typeSet (TypeSet, required)
Type set specifying model types. (This is used to label statistics with meaningful type labels rather than simply type indices.)
typeLabel (String, required)
The type label which was used to store type information on the tree. (Use the same label supplied to TypeMappedTree.)
includeRootEdge (Boolean, default false)
If true, include root edge in summary statistics calculations.

5.1.4. Trajectory sampling

<logger fileName="output.traj" logEvery="10000">
  <log spec="bdmmprime.trajectories.SampledTrajctory">
  </log>
</logger>

Trajectory sampling is handled in BDMM-Prime by a SampledTrajectory object, which is intended to be logged. This implements a particle filtering calculation which is performed only when the output has to be appended to the log file.

Important: The SampledTrajectory must be the only input passed to its parent logger. Broken output files will result if this rule is violated!

Objects of type SampledTrajectory accept the following inputs:

parameterization (Parameterization)
bdmmDistrib (BirthDeathMigrationDistribution)
One of these two inputs must be provided, as these provide the parameters needed to sample the population dynamics.
typeMappedTree (Tree, required)
An edge-typed tree. Ideally a reference to a TypeMappedTree.
typeLabel (String, required)
The type label which was used to store type information on the tree.
finalSampleOffset (Function, default 0)
If provided, the difference in time between the final sample and the end of the BD process.
nParticles (Integer, default 1000)
Size of particle ensemble used in particle filtering. (Larger values give more accurate results, but make the calculation more involved.)
resampThresh (Double, default 0.5)
Relative effective size of the weighted particle ensemble at which to trigger a resampling step.
useTauLeaping (Boolean, default false)
If true, use the faster but approximate tau-leaping algorithm to simulate population histories. Otherwise, use the slower but statistically correct Gillespie algorithm.
minLeapCount (Integer, default 100)
When tau leaping is enabled, at least this many steps are used to simulate each population trajectory.
epsilon (Double, default 0.03)
Controls the adaptive time step used in the tau leaping algorithm, with lower values causing the step size propsals to be more conservative (smaller steps). For details, refer to Cao et al., 2006.

5.1.5. Special priors and operators

5.1.5.1. Easy parameter linking

When dealing with multi-type birth-death skyline models, one frequently needs to operate and place priors on skyline values which preserve the identity of some elements and not others. For instance, when a single skyline parameter is used to define sampling rates for several different types, one might also want to add change times to disable sampling in each type at a different time. Since change times affect all types in the skyline simultaneously, this introduces more elements in the parameter supplied to the skylineValues input than there are degrees of freedom.

The two objects below are intended to address exactly this situation. Firstly, we have the following prior distribution:

<distribution spec="bdmmprime.util.priors.SmartZeroExcludingPrior">
  <!-- Inputs -->
</distribution>

This prior is intended as a drop-in replacement for the standard "beast.base.inference.distribution.Prior" used to define parametric priors on RealParameters. Indeed, it takes almost the same inputs:

x (Function, required)
Parameter to which prior is applied.
distr (ParametricDistribution, required)
Specifies which parametric distribution to apply, e.g. LogNormal, Exponential, etc.
classesToExclude (Function)
Elements possessing any of these values will be excluded from the prior calculation.

Unlike the standard Prior distribution though, SmartZeroExcludingPrior does the following:

  1. It excludes any elements having the value 0 from the prior calculation. For rates and probabilities, zero-valued parameters are almost never estimated in BEAST, and the presence of fixed zeros in parameters otherwise causes issues when applying prior distributions (e.g. LogNormal) which have no support at zero.
  2. It counts sets of identical element values (e.g. 1.2 and 1.2) only once, treating them as a single degree of freedom in the prior calculation.
<distribution spec="bdmmprime.util.operators.SmartScaleOperator">
  <!-- Inputs -->
</distribution>

This operator provides the other half of the mechanism for linking elements. It behaves as a drop-in replacement for the "beast.base.evolution.operator.ScaleOperator", at least as it applies to a RealParameter. It takes the following inputs:

parameter ([RealParameter], at least one required)
One or more parameters to operate on.
scaleFactor (Double, default 0.75)
The scaling operation will scale some parameter elements by a value chosen uniformly at random from the interval [scaleFactor,1/scaleFactor].
optimize (Boolean, default true)
Flag to indicate that the scaleFactor is to be adapted during the MCMC to achieve a good acceptance rate.
classesToExclude (Function)
Elements posessing any of these values will not be operated on.
weight (Double, required)
Operator weight. (Unnormalized probability with which this operator is chosen.)

The principle difference between this operator and the standard ScaleOperator is that like the SmartZeroExcludingPrior, the SmartScaleOperator groups identical values together and treats them as a single degree of freedom. That is, if two elements of the input parameter possess the identical value 1.237, those two elements will be operated on as if they were a single value, ensuring that this equivalence is preserved.

Take extreme care when relying on these classes to provide linkage that you do not include any additional operators on the target parameter which might break the equivalence between linked elements, or indeed introduce equivalence between initially distinct elements. (Due to the finite precision involved in floating point arithmetic, it is technically possible for such equivalence to arise by chance, although it is extremely unlikely.)

5.1.5.2. Consistent tip date estimation

BDMM-Prime's inclusion of the finalSampleOffset input to the BirthDeathMigrationDistribution allows us to jointly estimating all tip dates on a given tree.

This is ordinarily problematic due to the fact that node times, and the times of other events including rate shifts, in are in BEAST 2 always expressed as ages relative to the most recent sample. Thus if the time of the most recent sample is unknown (i.e. estimated), or if an earlier tip has the potential to become the most recent tip, anything point in time measured relative to the most recent tip becomes uncertain.

The inclusion of the finalSampleOffset has the potiential to solve this problem, since it allows us to measure all event times relative to some external fixed calendar time (coinciding with the end of the birth-death-sampling process). We are then free to operate on all tip ages, provided the operators we use also correctly update this offset.

5.1.5.2.1. Operators

BDMMPrime provides a pair of operators which act on both tip dates and the finalSampleOffset:

<operator spec="bdmmprime.util.operators.TipDateOperator">
  <!-- Inputs -->
</operator>

This takes the following inputs:

tree (Tree, required)
The tree on whose tips to operate.
finalSampleOffset (RealParameter, required)
The corresponding finalSampleOffset parameter to jointly update as required.
taxonSet (TaxonSet)
Taxon set contaiing taxa specifying leaves on which to operate. (Default is to operate on all leaves.)
windowSize (Double, default 1.0)
The width of time window used to uniformly draw new node times from.
weight (Double, required)
Operator weight. (Unnormalized probability with which this operator is chosen.)
<operator spec="bdmmprime.util.operators.TipEdgeScaleOperator">
  <!-- Inputs -->
</operator>

This takes the following inputs:

tree (Tree, required)
The tree on whose tips to operate.
finalSampleOffset (RealParameter, required)
The corresponding finalSampleOffset parameter to jointly update as required.
taxonSet (TaxonSet)
Taxon set contaiing taxa specifying leaves on which to operate. (Default is to operate on all leaves.)
scaleFactor (Double, default 0.8)
Edge length above tip is scaled by a value chosen uniformly at random from the interval [scaleFactor, 1/scaleFactor].
weight (Double, required)
Operator weight. (Unnormalized probability with which this operator is chosen.)
5.1.5.2.2. Prior

In addition to the operators, the following is useful for specifying a large number of tip date priors simultaneously using a TraitSet pair:

<distribution spec="bdmmprime.util.priors.TipDatePrior">
  <!-- Inputs -->
</distribution>

It takes the following inputs:

tree (Tree, required)
Tree whose tip dates we wish to place a prior on.
finalSampleOffset (Function, required)
Final sample offset parameter.
earlierBound (TraitSet, required)
Trait set specifying earliest possible time for each tip in ages relative to the end of the birth-death-sampling process.
laterBound (TraitSet, required)
Trait set specifying latest possible time for each tip in ages relative to the end of the birth-death-sampling process.
endOfSamplingTime (Function)
Time of the point when sampling ends. (Necessary only when upper and lower bounds are given forwards in time.)
reportBoundsViolations (Boolean, default false)
Causes the distribution to report which taxon exceeded its bounds in each case. Useful for diagnosing initialization problems, but should be disabled during the analysis itself.
5.1.5.2.3. Initialization

Finally, BDMM-Prime also provides the following class for initializing tip dates (useful when there are many constraints):

<init spec="bdmmprime.util.operators.TipDateInitializer">
  <!-- Inputs -->
</init>

It takes the following inputs:

tree (Tree, required)
Tree whose tip dates to initialize.
finalSampleOffset (=RealParameter, required)
Final sample offset parameter.
tipDatesTrait (TraitSet, required)
Trait set specifying initial tip dates.
endOfSamplingTime (Function)
Time of the point when sampling ends. (Necessary only when upper and lower bounds are given forwards in time.)
adjustInternalNodes (Boolean, default false)
If true, original divergence times of input tree are preserved.
internalNodeSpacing (Function, default 1.0)
Default spacing used when adjusting internal node times.

5.2. Trajectory file format

Trajectory files are tab-separated variable (TSV) files containing a table with 7 columns and a large number of rows. The first row is a header row. The file can be read into R using the readr::read_tsv function.

The following is the start of a table generated by enabling trajectory sampling in the H3N2 example used in the tutorial:

Sample t age variable type type2 value
10000 0 8.67 N 0 NA 1
10000 0 8.67 N 1 NA 0
10000 0 8.67 O NA NA NA
10000 0.0865 8.58 N 0 NA 2
10000 0.0865 8.58 N 1 NA 0
10000 0.0865 8.58 B 0 0 1

The sample column on the left indicates the MCMC iteration the trajectory belongs to. (All rows with the same sample number correspond to the same MCMC iteration, i.e. the same population trajectory.) Each row reports either a population size or an event at a particular time, as indicated by the entry in the "variable" column.

Population sizes are reported by rows with the "variable" column set to "N". Each such row reports the population size associated with the type index given in the "type" column and the size in the "value" column. The size corresponds to the time immediately at and after the time given in the "t" column. (Respectively, the age immedately before the age given in the "age" column.)

Events are reported by rows with the following characters in the "variable" column:

B
This indicates that the event at the given time/age was a birth event. The type column indicates the type of the individual that gave birth. The child type is given by the value in type2.
D
This indicates that the event at the given time/age was a death event. The type column indicates the type of the individual that died. The value of type2 is always NA in this case.
S
This indicates that the event at the given time/age was a sampling event. The type column indicates the type of the sampled individual. The value of type2 is always NA in this case.
M
This indicates that the event at the given time/age was a migration event. The type column indicates the original type of the migrating individual. The value of type2 is the destination type.
O
This is a dummy event that indicates the start of the stochastic process. The type and value columns for this event are meaningless. There should be exactly one such event per trajectory.

For events, the "value" column indicates the multiplicity of the event. I.e., how many times it occurred. (The approximate tau leaping simulation algorithm allows more than one event to occur in a single simulation step.)

6. License

BDMM-Prime is free software. It is made available under the terms of the GNU General Public Licence version 3. Unlike many software licenses, this is designed to give you, the user, the freedom to use the software as you wish - including making and publishing modifications - provided you also extend the same freedom to other users.

Author: Tim Vaughan

Created: 2025-04-01 Tue 11:29