Understanding BEAST 2 XML

Tim Vaughan

cEvo (Stadler Group), D-BSSE, ETH Zurich

Taming the BEAST 2018

What is BEAST 2 XML?

Precise description of the:
1. Data (alignment, sampling times, ...)
2. Model (substitution model, tree prior, ...)
3. Parameter priors
which form the basis of a BEAST analysis.
Usually produced by BEAUti.
Read by BEAST to when the analysis is run. (Nothing else matters!)
Important component of the BEAST 2 strategy for making results reproducible.

Why should I learn about BEAST 2 XML?

Every BEAST 2 analysis that can be executed can be described using BEAST 2 XML.
- Many (despite the best efforts of model developers) of these analyses cannot be set up using BEAUti.
- Being able to even slightly modify the XML that BEAUti produces dramatically increases the number of analyses one can do.
Several analysis types are most easily achieved by modifying the XML:
- User-defined starting trees (can now be achieved via BEAUti)
- Fixing parameters or the trees in the analysis.
- Linking certain models (beyond usual subsitution, clock, or tree priors)

BEAST 2 XML: A first look


    <beast version='2.0' namespace='...'>
      <run spec="MCMC" id="mcmc" chainLength="1000000000">
        <state>
          <stateNode spec='RealParameter' id="hky.kappa">1.0</stateNode>
          <stateNode spec='RealParameter' id="popSize">1.0</stateNode>
          <stateNode spec='ClusterTree' id='tree' clusterType='upgma'>
            <taxa idref='alignment'/>
          </stateNode>
        </state>

        <distribution spec="CompoundDistribution" id="posterior">
          <distribution id="coalescent" spec="Coalescent">
            <treeIntervals spec='TreeIntervals' id='TreeIntervals'>
              <tree idref="tree"/>
            </treeIntervals>
            <populationModel spec="ConstantPopulation" id='ConstantPopulation'>
              <popSize idref="popSize"/>
            </populationModel>
          </distribution>

          <distribution spec='TreeLikelihood' id="treeLikelihood">
            <data id="alignment" dataType="nucleotide">
              <sequence taxon="human"  value="AGAAAT..."/>
              <sequence taxon="chimp"  value="AGAAAT..."/>
              <sequence taxon="bonobo" value="AGAAAT..."/>
            </data>

            <tree idref="tree"/>
            <siteModel spec='SiteModel' id="siteModel">
              <input name='substModel' idref='hky'/>
              <substModel spec='HKY' id="hky">
                <kappa idref='hky.kappa'/>
                <frequencies id='freqs' spec='Frequencies'>
                  <data idref='alignment'/>
              </substModel>
            </siteModel>
          </distribution>
        </distribution>

        <operator id='kappaScaler' spec='ScaleOperator' scaleFactor="0.5" weight="1">
          <parameter idref="hky.kappa"/>
        </operator>
        <operator id='popSizeScaler' spec='ScaleOperator' scaleFactor="0.5" weight="1">
          <parameter idref="popSize"/>
        </operator>
        <operator spec='SubtreeSlide' weight="5" gaussian="true" size="1.0">
          <tree idref="tree"/>
        </operator>

        <logger logEvery="10000" fileName="$(filebase).log">
          <log idref="hky.kappa"/>
        </logger>
        <logger logEvery="20000" fileName="$(filebase).trees">
          <log idref="tree"/>
        </logger>
        <logger logEvery="10000">
          ...
        </logger>
      </run>

    </beast>

What is XML?

XML is a standard way of representing hierarchically structured data.
XML files are plain text files containing XML-formatted data.

XML file components:


      <tag attributeOne="Attribute value"
           attributeTwo="Another attribute value">
        <childTag childAttribute="10"> </childTag>
        <childTag childAttribute="20"/>
        <!-- This is a "comment" -->
      </tag>

There is a lot that one can say about XML, but this is all we need!

Editing XML

XML files are plain text (i.e. a string of printable characters).
Word processing tools such as MS Word are not suitable for these files.
You can edit them using Notepad (Windows) or TextEdit (MacOS), but only use these in an emergency.
Ideally one should use a programmers' text editor that supports syntax highlighting and checking:

Atom
atom.io

Sublime Text
sublimetext.com

GNU Emacs
gnu.org/s/emacs

Vim
vim.org

A simple BEAST 2 model

The BEAST 2 Object Model

A BEAST 2 Object

Object class (type) and input names usually written as "CamelCase".
Input names usually have lower case first letter.
Object class names almost always have upper case first letter.

Class (type) Hierarchy

XML Object representation


      <parentInput spec="BEASTObject">
        <input1 ...> </input1>
        <input2 ...> </input2>
        ...
      </parentInput>

Inputs with simple types

Some BEASTObjects take inputs with primitive types such as strings (i.e. some text), boolean values (true/false) or numbers.
These values are specified using attributes.


      <mcmc spec="MCMC"
            chainLength="10000000"
            storeEvery="10000"
            sampleFromPrior="true">
        ...
      </mcmc>

Connecting BEAST Objects


      <parentInput spec="Normal">
        <mean spec="RealParameter" value="1.0" lower="0.0" upper="5.0"/>
        <sigma spec="RealParameter" value="0.5" lower="0.0" upper="5.0"/>
      </parentInput>

Object IDs

Referencing IDs using tags


      <state>
        <stateNode spec="RealParameter" value="1.0" id="clockRate"/>
      </state>

      ...
      <logger logEvery="1000" fileName="logfile.log">
        <log idref="clockRate"/>
      </logger>

... using attributes


      <state>
        <stateNode spec="RealParameter" value="1.0" id="clockRate"/>
      </state>

      ...
      <operator spec="ScaleOperator" parameter="@clockRate" weight="1"/>

Loose Ends

Alternative but equivalent forms of BEASTObject representation:


          <parentInput spec="RealParameter" value="1.0"/>


          <parameter name="parentInput" value="1.0"/>

BEASTObject class names are in general prefixed by their location in a hierarchy of java packages, e.g. beast.core.parameter.RealParameter.
- The namespace attribute to the <beast> tag specifices a list of these locations, and the classes at these locations don't need the prefix.

The BEAST XML again


    <beast version='2.0' namespace='...'>
      <run spec="MCMC" id="mcmc" chainLength="1000000000">
        <state>
          <stateNode spec='RealParameter' id="hky.kappa">1.0</stateNode>
          <stateNode spec='RealParameter' id="popSize">1.0</stateNode>
          <stateNode spec='ClusterTree' id='tree' clusterType='upgma'>
            <taxa idref='alignment'/>
          </stateNode>
        </state>

        <distribution spec="CompoundDistribution" id="posterior">
          <distribution id="coalescent" spec="Coalescent">
            <treeIntervals spec='TreeIntervals' id='TreeIntervals'>
              <tree idref="tree"/>
            </treeIntervals>
            <populationModel spec="ConstantPopulation" id='ConstantPopulation'>
              <popSize idref="popSize"/>
            </populationModel>
          </distribution>

          <distribution spec='TreeLikelihood' id="treeLikelihood">
            <data id="alignment" dataType="nucleotide">
              <sequence taxon="human"  value="AGAAAT..."/>
              <sequence taxon="chimp"  value="AGAAAT..."/>
              <sequence taxon="bonobo" value="AGAAAT..."/>
            </data>

            <tree idref="tree"/>
            <siteModel spec='SiteModel' id="siteModel">
              <input name='substModel' idref='hky'/>
              <substModel spec='HKY' id="hky">
                <kappa idref='hky.kappa'/>
                <frequencies id='freqs' spec='Frequencies'>
                  <data idref='alignment'/>
              </substModel>
            </siteModel>
          </distribution>
        </distribution>

        <operator id='kappaScaler' spec='ScaleOperator' scaleFactor="0.5" weight="1">
          <parameter idref="hky.kappa"/>
        </operator>
        <operator id='popSizeScaler' spec='ScaleOperator' scaleFactor="0.5" weight="1">
          <parameter idref="popSize"/>
        </operator>
        <operator spec='SubtreeSlide' weight="5" gaussian="true" size="1.0">
          <tree idref="tree"/>
        </operator>

        <logger logEvery="10000" fileName="$(filebase).log">
          <log idref="hky.kappa"/>
        </logger>
        <logger logEvery="20000" fileName="$(filebase).trees">
          <log idref="tree"/>
        </logger>
        <logger logEvery="10000">
          ...
        </logger>
      </run>

    </beast>

Questions?

XML Hacking Tutorial

Structure of Tutorial

This tutorial covers a short series of small XML-hacking exercises:

Modifying MCMC parameters.
Setting an initial tree.
Fixing the tree in an analysis.
Linking a model in an analysis.

You will be given approximately 15 minutes for each exercise, after which I will present the solution.

Exercise 1:
Modifying MCMC parameters

Modification of basic parameters of the MCMC and loggers are easy to do directly in the XML.

Download the primate-mtDNA.nex alignment from tgvaughan.github.io/TTB_Lectures/XML/downloads/primate-mtDNA.nex
Load into BEAUti, link tree models and save.
Open the resulting XML in your text editor of choice.
Modify the chain length to be $10^6$ iterations.
Make the algorithm store the state every $10^4$ iterations.
Change the sampling frequency of the trace and tree logs to one sample per $10^4$ iterations.

Exercise 1 Solution

Exercise 2:
Setting an initial tree

By default BEAST initializes the tree randomly in a way that is consistent with any topological constraints, but occasionally we need to provide a better starting tree by hand.

Download the primate tree file (Newick format) from tgvaughan.github.io/TTB_Lectures/XML/downloads/primate_tree.newick
Open the XML from the previous exercise.
Remove the <init>...</init> element.
Find and modify the <tree> element within the <state> according to the instructions at beast2.org/fix-starting-tree to set the starting tree to the tree found in the tree file.
Run the analysis and verify (using icytree.org or FigTree) that the initial tree matches the one we chose.

Exercise 2 Solution

Exercise 3:
Fixing the tree in an analysis

Occasionally we (think!) we know the tree topology perfectly. We can easily prevent the analysis from sampling distinct tree topologies:

Open the XML from the previous exercise.
Locate the section of the file defining operators.
Remove/"comment out" the SubtreeSlide, Exchange and WilsonBalding operators.
Run the analysis and verify (using icytree.org or FigTree) that the topology is now fixed during the analysis.

(Results of previous exercise at http://tgvaughan.github.io/TTB_Lectures/XML/downloads/primates_ex2.xml )

Exercise 3 Solution

Exercise 4:
Linking models

Allowing different subsets of the data to share a model is useful/necessary. Here we experiment with linking clock models, but the approach translates to other models (e.g. migration).

Open the XML from the previous exercise.
Remove clockRate.c:2ndpos and clockRate.c:3rdpos parameters from <state>, priors on these parameters, operators and loggers for these parameters.
Replace the <branchRateModel> from treeLikelihood.2ndpos/3rdpos with idrefs pointing to the <branchRateModel> in treeLikelihood.1stpos.
Run the analysis and view the output in tracer to ensure that there is now a single clock rate for the coding sites.