Newick Utilities Tutorial: Polio1A CO XA18
Newick Utilities Tutorial: Polio1A CO XA18
6
COX
ECHO
0
CO
V7
8
XA
A1
HE
V6
CO
4
17
HE
PO
XA
1
XA
LI
1
O
CO
8
1A 6
PO XA
LIO CO
83
2
22
2 XA
CO
99
PO 38
72
99
LIO
3
70
97
76
64
HRV2
100
7 59
68
HRV93 99
HRV16
83 52
100
HRV17 75 22 HRV1B
48
89 17 70
52
100
HRV
HRV 24
52
65
62
97
4 HR
V1 V8
1
HR 5
92
HR
V3 V1
54
HR 1
7
HR
32
V3
V9
HR
HR
V1
78
HRV
V6
HR
HRV89
HRV39
HRV2
HRV
4
94
Contents
Introduction 3
1 General Remarks 5
1.1 Help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Multiple Input Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Simple Tasks 7
2.1 Displaying Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 As Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 As SVG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Ornaments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.4 Options not Covered . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Displaying Tree Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Rooting and Rerooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Rerooting on the ingroup . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.2 Rooting without an (explicit) outgroup . . . . . . . . . . . . . . . 30
2.3.3 Derooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4 Extracting Subtrees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4.1 Monophyly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4.2 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4.3 Siblings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4.4 Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5 Computing Bootstrap Support . . . . . . . . . . . . . . . . . . . . . . . . 37
2.6 Retaining Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.7 Extracting Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.7.1 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.7.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.7.3 Alternative formats . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.8 Finding subtrees in other trees . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.9 Renaming nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.9.1 Breaking the 10-character limit in PHYLIP alignments . . . . . . . 50
2.9.2 Higher-rank trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.10 Condensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.11 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.11.1 Keeping selected Nodes . . . . . . . . . . . . . . . . . . . . . . . . 57
2.12 Trimming trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.13 Indenting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.14 Extracting Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
1
2.14.1 Counting Leaves in a Tree . . . . . . . . . . . . . . . . . . . . . . . 65
2.15 Ordering Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.15.1 Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.16 Converting Node Ages to Durations . . . . . . . . . . . . . . . . . . . . . 69
2.17 Generating Random Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.18 Stream editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
2.18.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
2.18.2 The General Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.18.3 nw luaed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.18.4 nw ed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
2.18.5 nw sched . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3 Advanced Tasks 92
3.1 Checking Consistency with other Data . . . . . . . . . . . . . . . . . . . . 92
3.1.1 By condensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.2 Finding Recombination Breakpoints . . . . . . . . . . . . . . . . . . . . . 95
3.3 Number of nodes vs. Tree Depth . . . . . . . . . . . . . . . . . . . . . . . 96
4 Python Bindings 99
4.1 API Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
D Changes 107
D.1 Version 1.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
D.2 Version 1.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
2
Introduction
The Newick Utilities are a set of U NIX (including Mac OS X) and U NIX-like (Cygwin)
shell programs for working with phylogenetic trees. Their main features are:
• they require no user interaction1
• they can work on any number of trees at a time
• they perform well with large trees
• they are implemented as filters2
• they read and write text
They are not tools for making phylogenies. Rather, they are for processing existing ones,
by which I mean manipulating the tree or extracting information from it: rerooting,
simplifying, extracting subtrees, printing branch lengths and distances, etc - see table
1; a glance at the table of contents should also give you an idea.
Each of the programs performs one task (with some variants). For example, here
is how you would reroot a series of phylograms contained in file mytrees.nw using
node Dmelano as the outgroup:
$ nw_reroot mytrees.nw Dmelano
Now, you might want to make cladograms from the rerooted trees. Program nw topology
does the job, and since the utilities are filters, you can do it in a single command:
$ nw_reroot mytrees.nw Dmelano | nw_topology -
As you can see, it is straightforward to pipe Newick Utilities together, and of course
they can be mixed freely with any other shell program (see e.g. 2.14.1).
you already know when a command-line interface is better than an interactive interface.
2 In U NIX jargon, a filter is a program that reads input from standard input and writes output to standard
output.
3
Program Function
nw clade Extracts subtrees specified by node labels
nw condense Condenses (simplifies) trees
nw display Shows trees as graphs (ASCII graphics or SVG)
nw duration Convert node ages into duration
nw distance Prints distances between nodes, in various ways
nw ed Stream editor (à la sed or awk); see also nw luaed and nw sched
nw gen Random tree generator
nw indent Shows Newick in indented form
nw labels Prints node labels
nw luaed Like nw ed, but uses Lua
nw match Finds matches of a tree in another one
nw order Orders tree (preserving topology)
nw prune Removes branches based on labels
nw rename Changes node labels according to a mapping
nw reroot (Re)roots the tree
nw sched Like nw luaed, but uses Scheme
nw stats Prints tree statistics and properties
nw support Computes bootstrap support of a tree given replicate trees
nw topology Alters branch properties, preserving topology
nw trim rims a tree at a specified depth
4
Chapter 1
General Remarks
1.1 Help
All programs print a help message if passed option -h. Here are the first 20 lines of
nw indent’s help:
$ nw_indent -h | head -20
Indents the Newick, making structure more clear.
Synopsis
--------
Input
-----
Argument is the name of a file that contains Newick trees, or ’-’ (in
which case trees are read from standard input).
Output
------
By default, prints the input tree, with each parenthesis and each leaf on a
line of its own, and indented a multiple of ’ ’ (two spaces) to reflect
structure. The default output is valid Newick.
The help page describes the program’s purpose, its input and output, and its op-
tions, in a format reminiscent of U NIX manpages. It also shows a few examples. All
examples can be tried out using files in the data directory.
1.2 Input
Since the Newick Utilities are for working with trees, it should be no surprise that the
main input is a file containing trees. The trees must be in Newick format, which is
5
one of the most widely used tree formats. Its complete description can be found at
http://evolution.genetics.washington.edu/phylip/newicktree.html.
The input file is always the first argument to the program (after any options). It
may be a file stored in a filesystem, or standard input. In the latter case, the filename is
replaced by a ’-’ (dash):
$ nw_display mytrees.nw
is the same as
$ cat mytrees.nw | nw_display -
or
$ nw_display - < mytrees.nw
Of course the second (”dashed”) form is only really useful when chaining several pro-
grams into pipelines.
1.3 Output
All output is printed on standard output (warnings and error messages go to standard
error). The output is either trees or information about trees. In the first case, the trees
are in Newick format, one per line. In the second case, the format depends on the
program, but it is always text (ASCII graphics, SVG, numeric data, textual data, etc.).
1.4 Options
Options change program behaviour and/or allow extra arguments to be passed. They
are all passed on the command line, before the mandatory argument(s), using a single
letter preceded by a dash, in the usual U NIX way. There are no mandatory control files,
although some tasks require additional files (e.g. 2.1.2). For example, we saw above
that nw display produces graphs. By default the graph is ASCII graphics, but with
option -s, the program produces SVG:
$ nw_display -s sometree.nw
All options are described in the program’s help page (see 1.1).
1 well, attempted. . .
6
Chapter 2
Simple Tasks
The tasks shown in this chapter all involve a single Newick Utilities program (plus
possibly nw display), so they can serve as introduction to each individual program.
$ cat catarrhini
((((Gorilla:16,(Pan:10,Homo:10)Hominini:10)Homininae:15,Pongo:30)
Hominidae:15, Hylobates:20):10,(((Macaca:10,Papio:10):20,
Cercopithecus:10) Cercopithecinae:25,(Simias:10,Colobus:7)
Colobinae:5)Cercopithecidae:10);
So we want to make a graphical representation from it. This is the purpose of the
nw display program.
2.1.1 As Text
At its simplest, nw display just outputs a text graph. Here is the primates tree, shown
with nw display:
$ nw_display catarrhini
7
+---------------+ Gorilla
|
+-------------+ Homininae---------+ Pan
| +---------+ Hominini
+--------------+ Hominidae +---------+ Homo
| |
+---------+ +----------------------------+ Pongo
| |
| +-------------------+ Hylobates
|
=| +---------+ Macaca
| +-------------------+
| +-----------------------+ Cercopithecinae +---------+ Papio
| | |
| | +---------+ Cercopithecus
+---------+ Cercopithecidae
| +---------+ Simias
+----+ Colobinae
+------+ Colobus
|-------------------|------------------|-------------------|-----
0 20 40 60
substitutions/site
That’s pretty low-tech compared to interactive, colorful graphical displays, but if you
use the shell a lot (like I do), you may find it useful.
You can use option -w to set the number of columns available for display (the de-
fault is 80):
$ nw_display -w 60 catarrhini
8
+----------+ Gorilla
|
+---------+ Homininae---+ Pan
| +------+ Hominini
+---------+ Hominidae +------+ Homo
| |
+------+ +-------------------+ Pongo
| |
| +------------+ Hylobates
|
=| +------+ Macaca
| +------------+
| +----------------+ Cercopithecinae---+ Papio
| | |
| | +-----+ Cercopithecus
+------+ Cercopithecidae
| +------+ Simias
+--+ Colobinae
+----+ Colobus
|-------------|------------|-------------|---
0 20 40 60
substitutions/site
Scale Bar
If the tree is a phylogram, nw display prints a scale bar. Its units can be specified
with option -u, the default is substitutions per site. To suppress the scale bar, pass the
-S switch. The scale bar can also ”go backwards” (option -t), i.e. the scale bar’s zero
is aligned with the leaves and units increase towards the root. This is handy when the
units are ages, e.g. in millions of years ago, but it only makes much sense if the leaves
themselves are aligned. See 2.16 for an example.
9
+----------+ Gorilla
|
+-Homininae +------+ Pan
| +-Hominini
+-Hominidae +------+ Homo
| |
+------+ +-------------------+ Pongo
| |
| +------------+ Hylobates
|
=| +------+ Macaca
| +------------+
| +-Cercopithecinae+ +------+ Papio
| | |
| | +-----+ Cercopithecus
+-Cercopithecidae
| +------+ Simias
+-Colobinae
+----+ Colobus
|-------------|------------|-------------|---
0 20 40 60
substitutions/site
Now you can visualize the result using any SVG-enabled tool (all good Web browsers
can do it), or convert it to another format with, say rsvg or Inkscape (http://www.
inkscape.org). The SVG produced by nw display is designed to be easy to edit in
an interactive editor (Inkscape, Adobe Illustrator, etc.): for example, the tree edges are
in one group, and the text in another, so it is easy to change the line width of the edges,
or the font family of the text (you can also do this from nw display using a CSS map,
see 2.1.2).
The following PDF image was produced like this:
will not name here - either it was painfully slow, or it simply crashed, or else the output was unreadable,
incomplete, or otherwise unsuitable.
10
16
Gorilla
15
Homininae
10
Pan
10
Hominini
15
Hominidae 10
Homo
10 30
Pongo
20
Hylobates
10
Macaca
20
10
25 Papio
Cercopithecinae
10
Cercopithecus
10
Cercopithecidae
10
Simias
5
Colobinae
7
Colobus
substitutions/site
0 20 40 60
All SVG images shown in this tutorial were processed in the same way. In the rest of the
document we will usually skip the redirection into an SVG file and omit the SVG-to-PDF
conversion step.
Text-mode options
Options for ASCII trees also work for SVG: -S suppresses the scale bar2 , and -u specifies
its units; -w governs the tree’s width, except that for SVG the value is in pixels instead
of columns; -I controls the placement of inner node labels.
Radial trees
You can make radial trees by passing the -r switch:
$ nw_display -sr -S -w 450 catarrhini
2 The positioning of the scale bar is a bit crude in SVG mode, especially for radial trees. This is mainly
because of the ”SVG string length curse”, that is, the impossibility of finding out the length of a text string in
SVG . This means it is hard to ensure the scale bar will not overlap with a node label, unless one places it far
away in a corner, which is what I do for now. An improvement to this is on my TODO list.
11
Ce
rco
pit
he
Cer
Simias
cu
Ce
cop
rc
nae
s
op
Pap
bu
ith
ithe
io ec
obi
lo
in
Co
a
cida
e
Col
e Gorilla
Macaca
Ho
mi
nin
s a
te e
Ho
b a Pa
mi
lo n
y
nid
Ho
H
m
ae
in
in
o
i
Ho
Pong
mo
Using CSS
You can modify node style using CSS. This is done by specifying a CSS map, which is
just a text file that says which style should be applied to which node. If file css.map
contains the following
# Cercopithecidae in red
stroke:red Clade Macaca Cercopithecus
# Apes (Hominoidea) in pinkish
stroke:#fa7 C Homo Hylobates
# Colobus and Cercopithecus (individually) in green
stroke:green Individual Colobus Cercopithecus
# Hominines in thick blue
"stroke-width:2; stroke:blue" Clade Homo Pan
we can apply the style map to the tree above by passing -c, which takes the name of
the CSS file as argument:
12
Ce
rco
pit
he
Cer
Simias
cu
Ce
cop
rc
nae
s
op
Pap
bu
ith
ithe
io ec
obi
lo
in
Co
a
cida
e
Col
e Gorilla
Macaca
Ho
mi
nin
s a
te e
Ho
b a Pa
mi
lo n
y
nid
Ho
H
m
ae
in
in
o
i
Ho
Pong
mo
• The first element of the line is the style, and it is a snippet of CSS code.
• The second element states whether the following nodes are to be treated individ-
ually or as a clade. It is either Clade or Individual (which can be abbreviated
to C or I, respectively).
• The remaining element(s) are node labels and specify the nodes to which the
style must be applied: if the second element was Clade, the program finds the
last common ancestor of the nodes and applies the style to that node and all
its descendants. If the second element was Individual, then the style is only
applied to the nodes themselves.
In our example, css.map:
13
• the first line states that the style stroke:red must be applied to the Clade de-
fined by Macaca and Cercopithecus, which consists of these two nodes, their
ancestor Cercopithecinae, and Papio.
• Line 2 prescribes that style stroke:#fa7 (an SVG hexadecimal color specifica-
tion) must be applied to the clade defined by Homo and Hylobates, which con-
sists of these two nodes, their last common ancestor (unlabeled), and all its de-
scendants (that is, Homo, Pan, Gorilla, Pongo, and Hylobates, as well as the
inner nodes Hominini, Homininae and Hominidae).
• Line 3 instructs that the style stroke:green be applied individually to nodes
Colobus and Cercopithecus, and only these nodes - not to the clade that they
define.
• Line 4 says that style stroke-width:2; stroke:blue should be applied to
the clade defined by Homo and Pan - note that the quotes have been removed:
they are not part of the style, rather they allow us to improve readability by
adding some whitespace.
The style of an inner clade overrides that of an outer clade, e.g., although the Homo -
Pan clade is nested inside the Homo - Hylobates clade, it has its own style (blue, wide
lines) which overrides the containing clade’s style (pinkish with normal width). Like-
wise, Individual overrides Clade, which is why Cercopithecus is green even
though it belongs to a ”red” clade.
Styles can also be applied to labels. Option -l (lowercase l) specifies the leaf label
style, option -i the inner node label style, and option -b the branch length style. For
example, the following tree, which was produced using defaults, could be improved a
bit:
14
0.011042
Mesocricetus
0.010397
56
0.010718
Tamias
0.010912
37
0.021350
Procavia
0.000000
10
0.010759
Papio
0.032554
63
0.000000
Homo
0.000000
5
0.000000
73
0.000000
Hylobates
0.010545
0.000000
10
Sorex
0.111002
Bombina
0.022711
0.000000
51
90.033482
Didelphis
0.032725
Lepus
0.016349
0.000000
24
4
0.253952
Tetrao
0.077647
0.033266
75 Bradypus
0.029470
Vulpes
0.052491
53
0.200300
Orcinus
0.025842
Xiphias
0.077647
100 0.056027
Salmo
0.486740
0.123041
Oncorhynchus
Let’s remove the branch length labels, reduce the vertical spacing, reduce the size of
inner node labels (bootstrap values), and write the leaf labels in italics, using a font
with serifs:
15
Mesocricetus
56
37
Tamias
Procavia
10
Papio
63
5
Homo
73
Hylobates
Sorex
10
Bombina
9 51
Didelphis
Lepus
24
4
Tetrao
75 Bradypus
Vulpes
53
Orcinus
Xiphias
100
Salmo
Oncorhynchus
Still not perfect, but much better. Option -v specifies the vertical spacing, in pixels, be-
tween two successive leaves (the default is 40). Option -b sets the style of branch labels,
option -l sets the style of leaf labels, and option -i sets the style of inner node labels.
Note that we did not discard the branch lengths (we could do this with nw topology),
because doing so would reduce the tree to a cladogram. Instead, we set their CSS style
to opacity:0 (visibility:hidden also works).
What if we want to change the default style? Say we want the branches in blue, and
two pixels wide? That’s option -d:
16
Mesocricetus
56
37
Tamias
Procavia
10
Papio
63
5
Homo
73
Hylobates
Sorex
10
Bombina
9 51
Didelphis
Lepus
24
4
Tetrao
75 Bradypus
Vulpes
53
Orcinus
Xiphias
100
Salmo
Oncorhynchus
2.1.3 Ornaments
Ornaments are arbitrary snippets of SVG code that are displayed at specified node po-
sitions. Like CSS, this is done with a map. The ornament map has the same syntax
as the CSS map, except that you specify SVG elements rather than CSS styles. The
Individual keyword means that all nodes named on a given line sport the corre-
sponding ornament, while Clade means that only the clade’s LCA must be adorned.
The ornament is translated in such a way that its (0,0) coordinate corresponds to the
position of the node. In radial graphs, text ornaments are rotated like node labels.
The following file, ornament.map, instructs to draw a red circle with a black bor-
der on Homo and Pan, and a cyan circle with a blue border on the root of the Homo
- Hylobates clade. Gorilla node will be annotated with the word ”plains”, and
Pongo with ”Borneo” in italics3 . The SVG is enclosed in double quotes because it con-
tains spaces - note that single quotes are used for the values of XML attributes. The
ornament map is specified with option -o:
17
Ce
rco
pit
he
Cer
Simias
cu
Ce
cop
rc
nae
s
op
Pap
bu
ith
ithe
io ec
obi
lo
in
Co
a
cida
e
Col
e west Gorilla
Macaca
Ho
mi
nin
s a
te e
Ho
eo
b a Pa
mi
lo n
o Born
y
nid
Ho
H
m
ae
in
in
i
Ho
Pong
mo
18
Ce
rco
pit
he
Cer
Simias
cu
Ce
cop
rc
nae
s
op
Pap
bu
ith
ithe
io ec
obi
lo
in
Co
a
cida
e
Col
e west Gorilla
Macaca
Ho
mi
nin
s a
te e
Ho
eo
b a Pa
mi
lo n
o Born
y
nid
Ho
H
m
ae
in
in
i
Ho
Pong
mo
libxml
If libxml is being used (see Appendix C), the handling of ornaments is more elaborate,
in that some kinds of elements undergo special treatment. Besides positioning the or-
nament at the node’s location and orienting it along the parent edge, which occur for
all elements, the following occurs:
• <text> elements are nudged a few pixels from the parent edge, to make the
text more readable. They are also transformed so that the text is aligned with the
node’s position, on both sides of the tree (this involves an additional 180◦ rotation
on the left side of the tree).
• <image> elements are centered so that instead of having their top left corner at
the node’s position, they have the middle of the left side (this corresponds to
vertical centering on an orthogonal tree). On the left side of the tree, they are also
rotated and shifted so that they don’t show upside-down.
If applicable, these transforms must be applied to each element separately. This
means that the SVG snippet must be parsed (instead of just wrapped in a <g> element,
as is the case when libxml is not being used), and we use libxml’s XML parser.
In the following file, the orang-utan (Pongo) and hominines have several orna-
ments, which are spaced out along the radial axis so that they don’t overlap. This is
19
done simply by using the x attribute of texts and rectangles, as well as the cx attribute
of circles and ellipses. Again, the node to be adorned lies at (0,0), x values lie on the
radial axis, and y values are perpendicular to the x axis.
"<circle style=’fill:red;stroke:black’ r=’5’/>" I Homo Pan
"<circle style=’fill:cyan;stroke:blue’ r=’5’/>" C Homo Hylobates
<text>plains</text> I Gorilla
<text>plains</text> I Macaca
"<text style=’font-style:italic’ x=’-25’>Borneo</text><circle r=’4’ style=’
fill:blue;stroke:cyan’/><circle cx=’-10’ r=’4’ fill=’green’ stroke=’lime’/>
<rect x=’-25’ y=’-3’ width=’8’ height=’6’ stroke=’orange’ fill=’blue’/>" I
Pongo
"<text style=’font-face:italic’ x=’-12’>Africa</text><circle r=’4’ style=’s
troke:grey;fill:white’/><ellipse cx=’-8’ rx=’4’ ry=’2’ style=’fill:magenta;
stroke:purple’/>" I Homininae
This gives the following:
$ nw_display -sr -w 500 -o orn_xml.map catarrhini
Ce
rco
pit
he
cu
Cer
Simias
Ce
s
cop
rc
o
s
Pap
nae
pi
bu
io th
ithe
ec
lo
obi
in
Co
ae
cida
Col
e
plains Gorilla
plains Afr
ic
Macaca a
Ho
mi
nin
s
eo
ae
te
Ho
ba
Born
Pa
mi
o
yl n
nid
H
Ho
ae
m
in
in
i
o
Ho
Pong
mo
substitutions/site
0 20 40 60
20
As hinted above, libxml also allows handling of images:
This gives the following (credits: all images are from Wikipedia):
21
Bovi
ae
lid
n ae
gu
Tra
Ca
pr Bo
in
ae vi
da
e
Ruminantia Antilocapridae
inae
Antilop
Pec
or
a
M
os
ch
e
id
a
ae
id
rv
Giraffidae
Ce
$ head -5 b2r.map
in which the fill values are hexadecimal color codes along the gradient. Then:
22
COXB2
ECHO1 1
6
V70
ECHO
14
COX A17877226
CO 68
CO XA
XA
V
PO 6
CO XA
HE
99HE
X 12
83
A
10070 99
L O
PO IO1 59C OXA
2
LIO 3A C
PO
L 2978
64
HRV2 IO3
7 68
HRV9399 HRV16
83
HRV11700 5 52
5 2 7 HRV1B
HRV 14 89
48
22
V
HR 3 6
5
HRV
V 70 24
HR
7
V3
17
HR
HR
V8
HR 5
V1
52
100
1
62
97
2
HR
54 H
V1
HRV89
V9
32
78
92
HR
RV
HRV2
HRV
HRV39
HRV
64
94
As we can see, the high-GC sequences are all found in the same main clade.
But forest svg isn’t valid SVG – it is a concatenation of many SVG documents. You
can just extract them into individual files with csplit:
$ csplit -sz -f tree_ -b ’%02d.svg’ forest_svg ’/<?xml/’ {*}
This will produce one SVG file per tree in forest.nw, named tree 01.svg, tree 02.svg,
etc.
23
nw display -s stores its arguments
When run in SVG mode, nw display ”remembers” its arguments, that is, it puts them
in an XML comment with the keyword arguments. It is then trivial to retrieve them:
This is handy when one wants to re-use a set of options on another tree, especially after
a while when one doesn’t remember the exact values of the parameters, or which was
the input tree, etc.
Type: Phylogram
#nodes: 19
#leaves: 10
#dichotomies: 9
#leaf labels: 10
#inner labels: 6
$ nw_stats -f l catarrhini
Phylogram 19 10 9 10 6
24
5
Homo
10
Pan
16
5 Gorilla
Hominini
30
10
Pongo
Homininae
20
Hylobates
15
Hominidae
10
Macaca
20
10
15 25 Papio
Cercopithecinae
10
Cercopithecus
20
Cercopithecidae
10
Simias
5
20 Colobinae
7
Colobus
60
Cebus
substitutions/site
0 25 50 75 100 125
25
30
Cebus
10
Macaca
20
10
25 Papio
Cercopithecinae
10
Cercopithecus
20
Cercopithecidae
10
Simias
5
Colobinae
7
30 Colobus
20
Hylobates
20 30
Pongo
15 16
HominidaeGorilla
15
Homininae
10
Pan
10
Hominini
10
Homo
substitutions/site
0 25 50 75 100
Now the tree is correct. Note that the root is placed in the middle of the ingroup-
outgroup edge, and that the other branch lengths are conserved.
The outgroup does not need to be a single leaf. The following tree is wrong for
the same reason as the one before, except that is has three New World monkey species
instead of one, and they appear as a clade (Platyrrhini) in the wrong place:
26
5
Homo
10
Pan
16
5 Gorilla
Hominini
30
Pongo
10
Homininae
20
Hylobates
15
Hominidae 10
Macaca
20
10
25 Papio
Cercopithecinae
15
10
Cercopithecus
20
Cercopithecidae
10
Simias
5
Colobinae
7
20 Colobus
20
Cebus
10
15
25 Saimiri
Platyrrhini
25
Allouatta
substitutions/site
0 25 50 75 100 125
We can correct this by specifying the New World monkey clade as outgroup:
27
20
Cebus
10
15
12.5 Saimiri
Platyrrhini
25
Allouatta
10
Macaca
20
10
25 Papio
Cercopithecinae
10
Cercopithecus
20
Cercopithecidae
10
Simias
5
Colobinae
7
12.5 Colobus
20
Hylobates
20 30
Pongo
15 16
Hominidae Gorilla
15
Homininae
10
Pan
10
Hominini
10
Homo
substitutions/site
0 20 40 60 80
Note that I did not include all three New World monkeys, only Cebus and Allouatta.
This is because it is always possible to define a clade using only two leaves. The result
would be the same if I had included all three, though. You can use inner labels too, if
there are any:
28
Mesocricetus
56
37 Tamias
Procavia
10
Papio
63
5 Homo
73
Hylobates
10 Sorex
Bombina
9 51
Didelphis
Lepus
24 4
Tetrao
75 Bradypus
Vulpes
100 53
Orcinus
Danio
Tetraodon
Fugu
It is wrong because Danio (a ray-finned fish) is shown closer to tetrapods than to other
ray-finned fishes (Fugu and Tetraodon). So we should reroot it, specifying that the fishes
should form the outgroup. We could try this:
This fails because the last common ancestor of the two pufferfish is the root itself. The
workaround in this case is to try the ingroup. This is done by passing option -l (”lax”),
along with all species in the outgroup (this is because nw reroot finds the ingroup by
complementing the outgroup):
29
Mesocricetus
56
37 Tamias
Procavia
10
Papio
63
5 Homo
73
Hylobates
10 Sorex
Bombina
9 51
Didelphis
Lepus
24 4
Tetrao
75 Bradypus
Vulpes
53
Orcinus
Danio
100
Tetraodon
Fugu
To repeat: all outgroup labels were passed, not just the two normally needed to find
the last common ancestor – since, precisely, we can’t use the LCA.
30
Homo
27
Hylobates
57
Papio
Mesocricetus
6 99
Tamias
29
Didelphis
15
Lepus
46
44
Orcinus
Sorex
3322
Bradypus
36
Procavia
72
Tetrao
100
Vulpes
76 Bombina
Danio
Tetraodon
Fugu
substitutions/site
0 0.1 0.2 0.3 0.4
31
Homo
27
Hylobates
57
Papio
Mesocricetus
6 99
Tamias
29
Didelphis
15
Lepus
46
44
Orcinus
Sorex
33 22
Bradypus
36
Procavia
72
Tetrao
100
Vulpes
Bombina
Danio
76
Tetraodon
Fugu
substitutions/site
0 0.05 0.1 0.15 0.2
2.3.3 Derooting
Some programs insist on being passed an unrooted tree, e.g. if you want to supply
your own tree to PhyML, it has to be ”unrooted”. Strictly speaking, Newick trees are
always rooted, but there is a convention that if the root has three (or more) children,
the tree is considered unrooted. You can deroot a tree (in this limited sense) by passing
option -d to nw reroot. Here is a rooted tree, fagales.nw
Nothofagaceae
Fagaceae
Myricaceae
Juglandaceae
Rhoipteleaceae
Ticodendraceae
Betulaceae
Casuarinaceae
32
we can deroot it thus:
Nothofagaceae
Fagaceae
Myricaceae
Juglandaceae
Rhoipteleaceae
Ticodendraceae
Betulaceae
Casuarinaceae
this works as follows. The program finds which of the root’s two children (it is as-
sumed to have two, otherwise the tree is already considered unrooted in the above
sense) has more children than the other. This is considered the ingroup, and the LCA
of the ingroup is spliced out from the tree, attaching its children directly to the root.
In this example, the ingroup is the Fagaceae - Casuarinaceae clade, and the derooting
results in Fagaceae being directly attached to the root, as is its sister clade (Myricaceae
- Casuarinaceae).
33
16
Gorilla
15
Homininae
10
Pan
10
Hominini
15
Hominidae 10
Homo
10 30
Pongo
20
Hylobates
10
Macaca
20
10
25 Papio
Cercopithecinae
10
Cercopithecus
10
Cercopithecidae
10
Simias
5
Colobinae
7
Colobus
In the simplest case, the clade you want to extract has its own, unique label. This
is the case of Cercopithecidae, so you can extract the whole cercopithecid subtree
(Old World monkeys) using just that label:
10
Macaca
20
10
25 Papio
Cercopithecinae
10
Cercopithecus
Cercopithecidae
10
Simias
5
Colobinae
7
Colobus
Now suppose I want to extract the apes subtree. These are the Hominidae (”great
apes”) plus the gibbons (Hylobates). But the corresponding node is unlabeled in our
tree (it would be Hominoidea), so we need to specify (at least) two descendants:
34
16
Gorilla
15
Homininae10
Pan
10
Hominini
15
Hominidae 10
Homo
30
Pongo
20
Hylobates
The descendants do not have to be leaves: here I use Hominidae, an inner node, and
the result is the same.
$ nw_clade catarrhini Hominidae Hylobates | nw_display -sS -
16
Gorilla
15
Homininae10
Pan
10
Hominini
15
Hominidae 10
Homo
30
Pongo
20
Hylobates
2.4.1 Monophyly
You can check if a set of leaves5 form a monophyletic group by passing option -m:
nw clade will report the subtree only if the LCA has no descendant leaf other than
those specified. For example, we can ask if the African apes (humans, chimp, gorilla)
form a monophyletic group:
$ nw_clade -m catarrhini Homo Gorilla Pan | nw_display -sS -v 30 -
16
Gorilla
Homininae 10
Pan
10
Hominini
10
Homo
Yes, they do – it’s subfamily Homininae. On the other hand, the Asian apes (orangutan
and gibbon) do not:
5 In future versions I may extend this to inner nodes
35
$ nw_clade -m catarrhini Hylobates Pongo
[no output]
Maybe early hominines split from orangs in South Asia before moving to Africa.
2.4.2 Context
You can ask for n levels above the clade by passing option -c:
16
Gorilla
15
Homininae10
Pan
10
Hominini
15
Hominidae 10
Homo
30
Pongo
20
Hylobates
In this case, nw clade computed the LCA of Gorilla and Homo, ”climbed up” two
levels, and output the subtree at that point. This is useful when you want to extract a
clade with its nearest neighbor(s). I use this when I have several trees in a file and my
clade’s nearest neighbors aren’t always the same.
2.4.3 Siblings
You can also ask for the siblings of the specified clade. What, for example, is the sister
clade of the cercopithecids? Ask for Cercopithecidae and pass option -s:
16
Gorilla
15
Homininae10
Pan
10
Hominini
15
Hominidae 10
Homo
30
Pongo
20
Hylobates
36
Why, it’s the good old apes, of course. I use this a lot when I want to get rid of the
outgroup: specify the outgroup and pass -s – behold!, you have the ingroup.
Finally, although we are usually dealing with bifurcating trees, -s also applies to
multifurcations: if a node has more than one sibling, nw clade reports them all, in
Newick order.
2.4.4 Limits
nw clade assumes that node labels are unique. This should change in the future.
O6 1
CO
A18
COXA1
ECH
11
XA
17
HO
PO
7
16
LI
18
EC
PO
O
1
3
LI
18
1
O2
17
PO 1 9
LIO 13
1A
1
1
1
14
1
A6 1
XA
HEV7 7 X
CO
2
01 COOXA
15
C
20
HEV68 1
HRV85 1
1 5
HRV27 20 16 HRV8
6 91
1
93 2 HR
HRV 19 V1
B
1 3 1
V3 18
HR
14
HR V9
19
1 1
1 16 HR
H
7
RV
3
1
RV
1
3 20
HRV16 1 20
94
4
H
V1
1
V3
HR
HRV12
64
HR
HR
V78
1
HRV2
1
1
37
In this case I have colored the support values red. Option -p uses percentages instead
of absolute counts.
Notes
There are many tree-building programs that compute bootstrap support. For exam-
ple, PhyML can do it, but for large tasks I typically have to distribute the replicates
over several jobs (say, 100 jobs of 10 replicates each). I then collect all replicates files,
concatenate them, and use nw support to attribute the values to the target tree.
nw support assumes rooted trees (it may as well, since Newick is implicitly rooted),
and the target tree and replicates should be rooted the same way. Use nw reroot to
ensure this.
38
2.6 Retaining Topology
There are cases when one is more interested in the tree’s structure than in the branch
lengths, maybe because lengths are irrelevant or just because they are so short that they
obscure the branching order. Consider the following tree, vrt1.nw:
++ Mesocricetus
| 56
+++37amias
||
|++ Procavia
| 10
| | Papio
| |
|-5 63mo
| | 73
| | Hylobates
|
++ 10rex
||
||+-----+ Bombina
|+9 51
| +-+ Didelphis
|
|-+ Lepus
| 24
|------------+ Tetrao
|
+-------+-75Bradypus
| |
| | +-+ Vulpes
+-----------------------+ 100 +-+ 53
| | +---------+ Orcinus
| |
=| ++ Danio
|
+--+ Tetraodon
|
+-----+ Fugu
|---------|---------|---------|--------|------
0 0.2 0.4 0.6 0.8
substitutions/site
Its structure is not evident, particularly in the upper half. This is because many branches
are short in relation to the depth of the tree, so they are not well resolved. A better-
resolved tree can be obtained by discarding branch lengths altogether:
39
+---+ Mesocricetus
+----+ 56
+----+ 37 +---+ Tamias
| |
| +--------+ Procavia
+---+ 10
| | +--------+ Papio
| | |
+---+ 5 +----+ 63 +---+ Homo
| | +----+ 73
| | +---+ Hylobates
| |
+----+ 10+-----------------+ Sorex
| |
| | +-----------------+ Bombina
+----+ 9 +---+ 51
| | +-----------------+ Didelphis
| |
| | +---------------------+ Lepus
+---+ 24 +----+ 4
| | +---------------------+ Tetrao
| |
+---+ 75+-------------------------------+ Bradypus
| |
| | +-------------------------------+ Vulpes
+----+ 100---+ 53
| | +-------------------------------+ Orcinus
| |
=| +---------------------------------------+ Danio
|
+--------------------------------------------+ Tetraodon
|
+--------------------------------------------+ Fugu
This effectively produces a cladogram, that is, a tree that represents ancestry relation-
ships but not amounts of evolutionary change. The inner nodes are evenly spaced
over the depth of the tree, and the leaves are aligned, so the branching order is more
apparent.
Of course, ASCII trees have low resolution in the first place, so I’ll show both trees
look in SVG. First the original:
40
Mesocricetus
56
Tamias
37
Procavia
10
Papio
63
5Homo
73
Hylobates
10
Sorex
Bombina
9 51
Didelphis
Lepus
24
4
Tetrao
75Bradypus
Vulpes
100 53
Orcinus
Danio
Tetraodon
Fugu
substitutions/site
0 0.2 0.4 0.6 0.8
41
Mesocricetus
56
37 Tamias
Procavia
10
Papio
63
5 Homo
73
Hylobates
10 Sorex
Bombina
9 51
Didelphis
Lepus
24 4
Tetrao
75 Bradypus
Vulpes
100 53
Orcinus
Danio
Tetraodon
Fugu
As you can see, even with SVG’s much better resolution, it can be useful to display the
tree as a cladogram.
nw topology has the following options: -b keeps the branch lengths (obviously,
using this option alone has no effect); -I discards inner node labels, and -L discards
leaf labels. An extreme example is the following, which discards everything but topol-
ogy:
$ nw_topology -IL vrt1.nw
This produces the following tree, which is still valid Newick:
((((((((((,),),(,(,))),),(,)),(,)),),(,)),),,);
42
2.7 Extracting Distances
nw distance prints distances between nodes, in various ways. By default, it prints
the distance from the root of the tree to each labeled leaf, in Newick order. Let’s look at
distances in the catarrhinian tree:
43
16
Gorilla
15
Homininae
10
Pan
10
Hominini
15
Hominidae 10
Homo
10 30
Pongo
20
Hylobates
10
Macaca
20
10
25 Papio
Cercopithecinae
10
Cercopithecus
10
Cercopithecidae
10
Simias
5
Colobinae
7
Colobus
substitutions/site
0 20 40 60
$ nw_distance catarrhini
56
60
60
55
30
65
65
45
25
22
This means that the distance from the root to Gorilla is 56, etc. The distances are in
the same units as the tree’s branch lengths – usually substitutions per site, but this is
not specified in the tree itself. If the tree is a cladogram, the distances are expressed in
numbers of ancestors. Option -n shows the labels:
$ nw_distance -n catarrhini
44
Gorilla 56
Pan 60
Homo 60
Pongo 55
Hylobates 30
Macaca 65
Papio 65
Cercopithecus 45
Simias 25
Colobus 22
There are two main parameters to nw distance: the method and the selection. The
method determines how to compute the distance (from what node to what node), and
the selection determines for which nodes the program is to compute distances. Let’s
look at examples.
2.7.1 Selection
In this section we will show the different selection types, using the default distance
method (i.e., from the tree’s root – see below). The selection type is the argument to
option -s. The nodes appear in the same order as in the Newick tree, except when
they are specified on the command line (see below).
To illustrate the selection types, we need a tree that has both labeled and unlabeled
leaves and inner nodes. Here it is
$ nw_display -s dist_sel_xpl.nw
1
B
2
2
A
4
C
3
substitutions/site
0 2 4 6
B 3
A 6
45
All labeled nodes
Option -s l. This takes all labeled nodes into account, whether they are leaves or
inner nodes.
$ nw_distance -n -s l dist_sel_xpl.nw
B 3
A 6
C 4
All leaves
Option -s f. Selects all leaves, whether they are labeled or not.
$ nw_distance -n -s f dist_sel_xpl.nw
B 3
4
A 6
7
2
C 4
All nodes
Option -s a. All nodes are selected.
$ nw_distance -n -s a dist_sel_xpl.nw
B 3
4
2
A 6
7
C 4
0
$ nw_distance -n dist_sel_xpl.nw A C
A 6
C 4
46
2.7.2 Methods
In this section we will take the default selection and vary the method. The method is
passed as argument to option -m. I will also use an ad hoc tree to illustrate the methods:
$ nw_display -s dist_meth_xpl.nw
2
r A
3
d
1
2 B
e
1
C
substitutions/site
0 2 4 6
As explained above, the default selection consists of all labeled leaves – in our case,
nodes A, B and C.
$ nw_distance -n -m l dist_meth_xpl.nw
A 5
B 4
C 1
A 2
B 1
C 1
47
Matrix
Option -m m. Computes the pairwise distances between all nodes in the selection, and
prints it out as a matrix.
$ nw_distance -n -m m dist_meth_xpl.nw
A B C
A 0 3 6
B 3 0 5
C 6 5 0
3
6 5
A 0
B 3 0
C 6 5 0
For all other formats, the values are printed in a line, separated by TABs.
$ nw_distance -n -t -m p dist_meth_xpl.nw
A B C
2 1 1
1. leaves with labels found in both trees are kept, the other ones are pruned
2. inner labels are discarded
3. both trees are ordered (as done by nw order, see 2.15)
48
Example: finding trees with a specified subtree topology
File hominoidea.nw contains seven trees corresponding to successive theories about
the phylogeny of apes (these were taken from http://en.wikipedia.org/wiki/
Hominoidea). Let us see which of them group humans and chimpanzees as a sister
clade of gorillas (which is the current hypothesis).
Here are small images of each of the trees in hominoidea.nw:
Pan Pan
Hominoidea
7 (split of Hylobates)
Homo
Hominini
Pan
Homininae
Gorillini
Hominidae Gorilla
Ponginae
Pongo
Hominoidea
Hylobates
Hoolock
Hylobatidae
Symphalangus
Nomascus
Trees #6 and #7 match our criterion, the rest do not. To look for matching trees in
hominoidea.nw, we pass the pattern on the command line:
49
+-----------+ Homo
+-----------+ Hominini
+-----------+ Homininae +-----------+ Pan
| |
+-----------+ Hominidae +-------------Gorillini-+ Gorilla
| |
=| Hominoidea+-------------Ponginae--------------+ Pongo
|
+-------------Hylobatidae-----------------------+ Hylobates
+----------+ Homo
+----------+ Hominini
+-----------+ Homininae+----------+ Pan
| |
+----------+ Hominidae +------------Gorillini+ Gorilla
| |
| +-------------Ponginae------------+ Pongo
|
=| Hominoidea---------------------------------+ Hylobates
| |
| +---------------------------------+ Hoolock
+----------+ Hylobatidae
+---------------------------------+ Symphalangus
|
+---------------------------------+ Nomascus
Note that only the pattern tree’s topology matters: we would get the same results with
pattern ((Homo,Pan),Gorilla);, ((Pan,Homo),Gorilla);, etc., but not with
((Gorilla,Pan),Homo); (which would select trees #1, 2, 3, and 5. In future versions
I might add an option for strict matching.
The behaviour of nw match can be reversed by passing option -v (like grep -v):
it will print trees that do not match the pattern. Finally, note that nw match only works
on leaf labels (for now), and assumes that labels are unique in both the pattern and the
target tree.
50
or sequence IDs are longer than that. One solution is to rename the sequences, before
constructing the tree, using a numerical scheme, e.g., Strongylocentrotus purpuratus →
ID 01, etc. This means we have an alignment of the following form:
154 259
ID_01 PTTSNSAPAL DAAETGHTSG ...
ID_02 SVSSHSVPAL DAAETGHTSS ...
...
together with a renaming map, id2longname.map:
ID_01 Strongylocentrotus_purpuratus
ID_02 Harpagofututor_volsellorhinus
...
The alignment’s IDs are now sufficiently short, and we can use it to make a tree. It will
look something like this:
$ nw_display -s short_IDs.nw -v 30
ID 09
ID 07
ID 04
ID 05
ID 01
ID 02
ID 06
ID 03
ID 08
Not very informative, huh? But we can put back the original, long names :
(option -W specifies the mean width of label characters, in pixels – use it when the
default is wrong, as in this case with very long labels and small characters)
51
Anaerobiospirillium succiniciproducens
Notiocryptorrhynchus punctatocarinulatus
Parastratiosphecomyia stratiosphecomyioides
Gammaracanthuskytodermogammarus loricatobaicalensis
Strongylocentrotus purpuratus
Harpagofututor volsellorhinus
Tahuantinsuyoa macantzatza
Ephippiorhynchus senegalensis
Ia io
Now that’s better. . . although exactly what these critters are might not be evident. Not
to worry, I’ve made another map and I can rename the tree a second time on the fly:
$ nw_rename short_IDs.nw id2longname.map \
| nw_rename - longname2english.map \
| nw_display -s -v 30 -W 10 -
bacterium
weevil
soldier flyy
amphipod crustacean
sea urchin
fossil shark
cichlid fishh
saddle-billed stork
bat
52
$ nw_topology HRV_FMDV.nw | nw_display -sr -w 400 -
ECHO1
COXB2
6
COX
CO
E C HO
0
V7
XA
CO
8
A1
PO
V6
HE
17
XA
14
LI
HE
XA
O
18
1A 6
CO
PO
LIO OXA
83
2 C
22
PO 2
XA
99
LIO 38
CO
72
99
3
0
100 7
97
76
64
HRV
27
59
HRV93 68
99
83
FMDV-C
HRV17 100
75 52 HRV16
2 48
HRV5 89 22
HRV
17 1B
65
100
14
52
70
H RV HR
62
V2
97
V3 HR 4
HR
1
92
V8
37 5
54
RV
HR
H
2
32
V1
V1
HR
8
1
HR
HR
V7
V9
89
HRV9
HRV2
HRV39
HR
V64
HRV
I want to see if the tree correctly groups isolates of the same species together. So I use a
renaming map that maps an isolate name to its species (note by the way that the map
file can have comment, whitespace-only and empty lines (which are all ignored), just
like CSS maps (see 2.1.2):
# These species belong to HRV-A
HRV16 HRV-A
HRV1B HRV-A
...
# HRV-B
HRV37 HRV-B
HRV14 HRV-B
...
# Enterovirus
POLIO1A HEV
COXA17 HEV
53
HEV
HEV
HEV
HEV
V
HE
HE
V
V
HE
HE
H
EV
EV
H
HE V
HE
83
V
22
99
HE 38 V
HE
72
99
V
70
97
76
64
HRV
100
-B
59
HRV-B 68
99
83
FMDV-C
HRV-B 100
75 HRV-A
B 48 52
HRV- 89 22
HRV
17 -A
100
V-B 65
52
70
HR HR
62
V-A
97
B
V- HR
HR
1
92
V-
-B A
RV
54
HR
H
A
V-
32
V-
HR
V-A
A
HR
HR
V-
-A
HRV-A
HRV-A
HRV-A
HR
A
HRV
V-A
As we can see, it does. This would be even better if we could somehow simplify the
tree so that clades of the same species were reduced to a single leaf. And, that’s exactly
what nw condense does (see below).
2.10 Condensing
Condensing a tree means reducing its size in a systematic, meaningful way (compare
this to pruning (2.11) which arbitrarily removes branches, and to trimming (2.12) which
cuts a tree at a specified depth). Currently the only condensing method available is
simplifying clades in which all leaves have the same label - for example because they
belong to the same taxon, etc. Consider this tree:
54
A
it has a clade that consists only of A, another of only C, plus a B leaf. Condensing will
replace those clades by an A and a C leaf, respectively:
Now the A and B leaves stand for whole clades. The tree is simpler, but the information
about the relationships of A, B and C is conserved, while the details of the A and C
clades is not. A typical use of this is producing genus trees from species trees (or any
higher-level tree from a lower-level one), or checking consistency with other data: For
example condensing the virus tree of section 2.9.2 gives this:
The relationships between the species is now evident – as is the fact that the various
isolates do cluster within species in the first place. This need not be the case, and
renaming-then-condensing is a useful technique for checking this kind of consistency
in a tree (see 3.1 for more examples).
2.11 Pruning
Pruning is simply removing arbitrary nodes. Say you have the following tree (as it
happens, it contains a glaring error since the sister clade of mammals is the amphibian
rather than the bird):
55
Procavia
Vulpes
Orcinus
Bradypus
Mesocricetus
Tamias
Sorex
Homo
Mammalia
Papio
Hylobates
Lepus
Didelphis
Bombina
Tetrao
Danio
Tetraodon
Fugu
and say you only need a subset of the species, perhaps because you want to compare
this tree to another tree with fewer species. Specifically, let’s say you don’t need to
show Tetraodon, Danio, Bombina, and Didelphis. You just pass those labels to nw prune:
Procavia
Vulpes
Orcinus
Bradypus
Mesocricetus
Tamias
Sorex
Homo
Papio
Hylobates
Lepus
Tetrao
Fugu
56
Note that each label is removed individually. The discarding of Didelphis is the cause
of the disappearance of the node labeled Mammalia. And the embarrassing error is
hidden by the removal of Bombina.
You can also discard internal nodes, if they are labeled (in future versions it will be
possible to discard a clade by specifying descendants, just like nw clade). For exam-
ple, you can discard the whole mammalian clade like this:
$ nw_prune vrt2_top.nw Mammalia | nw_display -s -
Bombina
Tetrao
Danio
Tetraodon
Fugu
By the way, Tetrao and Tetraodon are not the same thing, the first is a bird (grouse),
the second is a pufferfish.
Sorex
Bombina
Tetrao
Fugu
Note that I also passed Mammalia, for the reason discussed above: the node with this
label would go away if I did not, resulting in a different tree (try it out).
6 In future versions there will be an option for finer control of this behaviour
57
2.12 Trimming trees
Trimming a tree means cutting the nodes whose depth is larger than a specified thresh-
old. Here is what will happen if I cut the catarrhini tree at depth 30:
16
Gorilla
15
Hom ininae
10
Pan
10
Hom inini
15
Hom inidae 10
Homo
10 30
Pongo
20
Hylobates
10
Macaca
20
10
25 Papio
C ercopithecinae
10
Cercopithecus
10
C ercopithecidae
10
Simias
5
C olobinae
7
Colobus
substitutions/site
0 10 20 30 40 50 60
The tree will be ”cut” on the red line, and everything right of it will be discarded:
58
5
Homininae
15
Hominidae
5
10 Pongo
20
Hylobates
20
Cercopithecinae
10
Cercopithecidae
10
Simias
5
Colobinae
7
Colobus
substitutions/site
0 5 10 15 20 25
ID 1
972
882
ID 4
1000
618
960
957
825
1000
7
1
0
128
1
0
27
substitutions/site
0 0.1 0.2 0.3 0.4
59
The leaves with labels of the form ID * are also leaves in the original tree, the other
leaves are former inner nodes whose children got trimmed. Their labels are the (abso-
lute) bootstrap support values of those nodes. Note that the branch lengths are con-
served. It is apparent that the ingroup’s lower half has very poor support. This would
be harder to see without trimming the tree, due to its huge size.
Trimming cladograms
By definition, cladograms do not have branch lengths, so you need to express depth in
numbers of ancestors, and thus you want to pass -a.
2.13 Indenting
nw indent reformats Newick on several lines, with one node per line, nodes of the
same depth in the same column, and children nodes to the right of their parent. This
shows the structure more clearly than the compact form, but since whitespace is ig-
nored in the Newick format7 , the indented form is still valid. For example, this is a tree
in compact form, in file falconiformes:
(Pandion:7,(((Accipiter:1,Buteo:1):1,(Aquila:1,Haliaeetus:2):1):2,
(Milvus:2,Elanus:3):2):3,Sagittarius:5,((Micrastur:1,Falco:1):3,
(Polyborus:2,Milvago:1):2):2);
7 except between quotes
60
And this is the same tree, indented:
$ nw_indent falconiformes
(
Pandion:7,
(
(
(
Accipiter:1,
Buteo:1
):1,
(
Aquila:1,
Haliaeetus:2
):1
):2,
(
Milvus:2,
Elanus:3
):2
):3,
Sagittarius:5,
(
(
Micrastur:1,
Falco:1
):3,
(
Polyborus:2,
Milvago:1
):2
):2
);
The structure is much more clear, it is also relatively easy to edit manually in a text
editor - while still being valid Newick.
Another advantage of indenting is that it is resistant to certain errors which would
cause nw display to fail.8 For example, there is an error in this tree:
(Pandion:7,((Buteo:1,Aquila:1,Haliaeetus:2):2,(Milvus:2,
Elanus:3):2):3,Sagittarius:5((Micrastur:1,Falco:1):3,
(Polyborus:2,Milvago:1):2):2);
8 This is because indenting is a purely lexical process, hence it does not need a syntactically correct tree.
61
yet it is hard to spot, and trying nw display won’t help as it will abort with a parse
error. With nw indent, however, you can at least look at the tree:
(
Pandion:7,
(
(
Buteo:1,
Aquila:1,
Haliaeetus:2
):2,
(
Milvus:2,
Elanus:3
):2
):3,
Sagittarius:5
(
(
Micrastur:1,
Falco:1
):3,
(
Polyborus:2,
Milvago:1
):2
):2
);
While the error is not exactly obvious, you can at least view the Newick. It turns out
there is a comma missing after Sagittarius:5.
62
The indentation can be varied by supplying a string (option -t) that will be used
instead of the default (which is two spaces). If you want to indent by four spaces
instead of two, you could say this:
$ nw_indent -t ’ ’ accipitridae
(
(
Buteo:1,
Aquila:1,
Haliaeetus:2
):2,
(
Milvus:2,
Elanus:3
):2
):3;
(
| (
| | Buteo:1,
| | Aquila:1,
| | Haliaeetus:2
| ):2,
| (
| | Milvus:2,
| | Elanus:3
| ):2
):3;
Now the indentation levels are easier to see, but at the expense of the tree no longer
being valid Newick.
Finally, option -c (”compact”) does the reverse: it removes all indentation and pro-
duces a compact tree. You can use this when you want to produce a compact Newick
file after editing. For example, using Vim, after loading a Newick tree I do
gg!}nw_indent -
63
2.14 Extracting Labels
To get a list of all labels in a tree, use nw labels:
$ nw_labels catarrhini
Gorilla
Pan
Homo
Hominini
Homininae
Pongo
Hominidae
Hylobates
Macaca
Papio
Cercopithecus
Cercopithecinae
Simias
Colobus
Colobinae
Cercopithecidae
The labels are printed out in Newick order. To get rid of internal labels, use -I:
$ nw_labels -I catarrhini
Gorilla
Pan
Homo
Pongo
Hylobates
Macaca
Papio
Cercopithecus
Simias
Colobus
Likewise, you can use -L to get rid of leaf labels, and with -t the labels are printed on
a single line, separated by tabs (here the line is folded due to lack of space).
If you just want the root’s label, pass -r. In conjunction with nw clade (see 2.4), this
is handy to get support values of nodes defined by their descendants. For example, the
following shows the support value of the clade defined by HRV39 and HRV85 in a virus
tree similar to that of 2.1.3:
100
64
2.14.1 Counting Leaves in a Tree
A simple application of nw labels is a leaf count (assuming each leaf is labeled -
Newick does not require labels):
$ nw_labels -I catarrhini | wc -l
10
Pandion
Aquila
Buteo
Haliaeetus
Milvus
Elanus
Sagittarius
Micrastur
Falco
Polyborus
Milvago
65
Micrastur
Falco
Milvago
Polyborus
Buteo
Aquila
Haliaeetus
Elanus
Milvus
Pandion
Sagittarius
But do they represent different phylogenies? In other words, do they differ by more
than just the ordering of nodes? To check this, we pass them to nw order and use
diff to compare the results9 :
So, after ordering, the trees are the same: they tell the same biological story. Note
that these trees are cladograms. If you have trees with branch lengths, this approach
will only work if the lengths are identical, which may or may not be what you want.
You can get rid of the branch lengths using nw topology (see 2.6).
2.15.1 Variants
Other ordering criteria are available through option -c. To order a tree by number of
descendants (i.e., ”light” nodes before ”heavy” nodes), pass -c n. This has the effect
of ”ladderizing” trees which are heavily imbalanced. Consider this tree:
9 One could also compute a checksum using md5sum, etc
66
COXB2
COXA1
COX
O6
1
CO
ECH
HO
A18
XA
14
EC
XA
17
PO
CO
LI
6
O
PO XA
80
3
35
L IO
2 CO
90
2
85
PO XA
90
45
35
LIO 65 CO
1A
5
HEV7
0
35
75
100
HEV68
HRV85
80 25
100
HRV27 HRV8
95 30 9
10
70
93 HR
HRV 15 V1
95
B
100
5
90
V3 HR
15
HR V9
15
37
100
80
RV
H
RV
H
4
94
HR
V1
V6
HR
HR
V3
HRV2
HRV12
HRV16
4
V78
HR
Here is the same tree, reordered by number of descendants: light nodes appear before
(clockwise) heavy nodes:
$ nw_order -c n HRV_cg.nw \
| nw_display -sSr -b ’visibility:hidden’ -v 30 -w 450 -
67
HRV85
HRV12
89
HRV
B
HRV
V1
78
HR
HR
V9
V2
H
30
RV
HR
4
25
V9
16
HR
100
V3 H R
9
10
4
5
90 V6
HR
15
HR 80
15
V3
10
0
HRV3
7 15
70
95
HRV14
95 COXA6
10
0
100
HRV27 COXA
75
2
80
35
93 CO
HRV XA
5
14
2 65 HE
LIO
35
V6
90
45
O
85
P
90
8
1A
35
O H
LI
80
EV
PO
3
70
EC
LIO
18
HO
CO
PO
17
ECHO
XA
COXA1
1
XB
COXA
CO
2
6
De-ladderizing
Incidentally, ”ladderizing” a tree may not be a good idea, because it lends itself to
misinterpretations. For example, the following tree leads some people (including pro-
fessional biologists, apparently [1]) to the following mistakes:
Homo
Mammalia
Equus
Amniota
Tetrapoda Columba
Gnathostomata Xenopus
Vertebrata Carcharodon
Petromyzon
• there is a ”chain of being” with ”higher” and ”lower” organisms, with (surprise!)
humans at the top; ”higher” can be interpreted in various ways, including ”more
68
perfect”, or ”more evolved” or even morally superior. This is known as the scala
naturæ fallacy.
• there is a ”main line” that progressively leads to (surprise!) humans, with ”off-
shoots” along the way – lowly lampreys branching out first, then sharks, etc.
• early-branching species (this is itself an error) are ”primitive”: in our case, it
would mean that the last common ancestor of lampreys and humans was a lam-
prey (or very like one); that the LCA of humans and sharks was very much like a
modern shark, etc.
$ nw_order -c d scala.nw \
| nw_display -s -v 30 -l ’font-style:italic’ -
Petromyzon
Xenopus
Vertebrata
Tetrapoda Homo
Mammalia
Equus
Amniota
Gnathostomata
Columba
Carcharodon
It is less easy now to construe that there is a chain of being, or that evolution is progres-
sive, etc. Unfortunately, some folks take the new tree to mean that humans are more
closely related to amphibians (Xenopus) than to birds (Columba). There is no substitute
to actually learn how to interpret trees, I’m afraid.
Gnathostomata
Vertebrata
Conodonta
Chordata
Urochordata
69
Suppose we have the following information about the age of certain events (not that
it matters, I found it in Wikipedia and the Palaeos website (www.palaeos.com):
event age (million years ago)
split of vertebrates into gnathostomes and conodonts 530
extinction of conodonts 200
split of chordates into vertebrates and urochordates 540
We can use the ”branch length” field of Newick to specify ages, like this:
$ cat age.nw
(
(
Gnathostomata,
Conodonta:200
)Vertebrata:530,
Urochordata
)Chordata:540;
The ”branch length” of Vertebrata becomes 530, because the vertebrate lineage split
into conodonts and gnathostomes at that date10 . Note that the Gnathostomata leaf
has no age: this means that there are still living gnathostomes (such as you and I11 ); the
same goes for urochordates. In other words, a leaf with no age has an implicit age of
zero. This also ensures that the leaves of the extant taxa are aligned. The Conodonta,
on the other hand, has an age although it is a leaf: this is because the conodonts went
extinct, around 200 Mya.
Now, if we were to display this tree without further ado, it would be nonsense. We
have to convert the ages into durations, and this is the function of nw duration:
530
Gnathostomata
10
Vertebrata
330
Conodonta
Chordata
540
Urochordata
substitutions/site
0 100 200 300 400 500
We can improve the graph by supplying option -t to nw display: this aligns the
origin of the scale bar with the leaves and counts backwards. To top it off, we’ll specify
the units as million years ago:
70
530
Gnathostomata
10
Vertebrata
330
Conodonta
Chordata
540
Urochordata
Since you’re curious, here is what the age.nw tree looks like if we ”forget” to run it
through nw duration:
Gnathostomata
530
Vertebrata
200
Conodonta
Chordata
Urochordata
Now it looks as though only the conodonts are still alive, while the gnathostomes and
urochordates each had brief flashes of existence 200 and 730 million years ago, respec-
tively. Don’t show this to a palaeontologist.
71
n36
n35
n29
n4
0
n28
7
n14
n3
n3
n3
n2
2
9
4
n2
0
1
n3 n2 8
7 n3 n3
3
n1
0
n26
n20
n25 n3
n1
n13 n6
n0
n4
n24
n2
9
n1
n9
3 n7
2 n2
n3 n5
n11
n12
n8
1
n3
n1
8
n1
n 16
n17
Here I pass option -s, whose argument is the pseudo-random number generator’s
seed, so that I get the same tree every time I produce this document. Normally, you
will not use it if you want a different tree every time. Other options are -d, which
specifies the tree’s depth, and -l, which sets the average branch length.
I use random trees to test the other applications, and also as a kind of null model to
test to what extent a real tree departs from the null hypothesis.
• a grep command
Perl is a general-purpose language, it just happens to be rather good at processing
text.12 Sed is specialized for editing text streams, and grep is designed for precisely
the line-finding task in question.13 We should expect grep to be the most efficient,
but we should not expect it to be able to perform any significantly different task. By
12 Ok, Perl was initially designed for processing text – it’s the Practical Extraction and Report Language,
after all – but it has long grown out of this initial specialization.
13 The name ”grep” comes from the sed expression g/re/p, where ”re” stands for ”regular expression”.
72
contrast, Perl may be (I haven’t checked!) less efficient than grep, but it can handle
pretty much any task. Sed will lie in between. The programs we have seen so far are
grep-like: they are specialized for one task (and hopefully, they are efficient).
The programs described in this section are more sed-like: they are less specialized,
usually less efficient, but more flexible than the ones shown up to now. They were in
fact inspired by sed and awk, which perform an action on the parts of input (usually
lines) that meet some condition. Rather than lines in a file, the programs presented here
work with nodes in a tree: each node is visited in turn, and if it meets a user-specified
condition, a user-specified action is performed. In other words, they are node-oriented
stream editors for trees.
As a word of warning, I should say that these programs are among the more exper-
imental in the Newick Utilities package. This is why there are three programs that do
basically the same thing, although differently and with different capabilities: nw ed is
the simplest (and first), it is written entirely in C and it is fairly limited. nw sched was
developed to address nw ed’s limitations: by embedding a Scheme (http://www.
r6rs.org) interpreter (GNU Guile, http://www.gnu.org/software/guile/), its
flexibility is, for practical purposes, limitless. Of course, this comes at the price of
linking to an external library, which may not be available. Therefore nw ed, for all its
limitations, will stay in the package as it has no external dependency. Finally, I under-
stand that Scheme is not the only solution as an embedded language, and that many
people (myself included) find learning it a bit of a challenge. Therefore, I tried the same
approach with Lua14 (http://www.lua.org), which is designed as an embeddable
language, is even smaller than Guile, and by most accounts easier to learn.15 . The re-
sult, nw luaed, is probably the best so far: as powerful as nw sched, while smaller,
faster and easier to use. For this reason, I will probably not develop nw sched much
more, but I won’t drop it altogether either, not soon at any rate.
order.
15 And, in my experience, easier to embed in a C program, but your experience may differ. In particular, I
could provide all of nw luaed’s functionality without writing a single line of Lua code, whereas nw sched
relies on a few dozen lines of embedded Scheme code that have to be parsed and interpreted on each run.
But that may very possibly just reflect my poor Scheme/Guile skills. Furthermore, I can apparently run
nw luaed through Valgrind’s (http://www.valgrind.org) memcheck utility without problems (I do
this with all the programs in the utils), but with nw sched I get tons of error messages. But it may be that I
don’t get how to manage memory with Guile
73
nw ed nw sched nw luaed
language own Scheme Lua
programming constructs no Scheme’s Lua’s
functions fixed arbitrarya arbitrarya
depends on nothing GNU Guile Lua library
pre- & post-tree code no yes yes
pre- & post-run code no yes yes
a i.e., user can define their own
2.18.3 nw luaed
Although nw luaed is the most recent of the three, we’ll cover it first because if this
one does what you need it’s unlikely you’ll need the others. Let’s look at an example
before we jump into the details. Here is a tree of vertebrate genera, showing support
values:
$ nw_display -s -v 25 vrt2cg.nw
Procavia
42
Vulpes
84
16 Orcinus
Bradypus
Mesocricetus
78 88
Tamias
32
Sorex
26 Homo
71 99
Papio
42
67 Hylobates
30 Lepus
Didelphis
100
Bombina
Tetrao
Danio
97
Tetraodon
Fugu
Let’s extract all well-supported clades, using a support value of 95% or more as the
criterion for being well-supported. In our jargon, the condition would be that a node i)
have a support value in the first place (some nodes don’t, e.g. the root and the LCA of
(Fugu,Tetraodon)), and ii) that this value be no less than 95. The action would simply
be to print out the tree rooted at the current node.
74
$ nw_luaed -n vrt2cg.nw ’b ˜= nil and b >= 95’ ’s()’ \
| nw_display -w 65 -
+----------------------------------------------------+ Homo
|
=| 99 +-------------------------+ Papio
+--------------------------+ 42
+-------------------------+ Hylobates
+------------------+ Procavia
|
+-----+ 42 +------------+ Vulpes
| +-----+ 84
+-----+ 16 +------------+ Orcinus
| |
| +------------------------+ Bradypus
|
| +------------+ Mesocricetus
+-----+ 78 +-----+ 88
| | +-----+ 32 +------------+ Tamias
| | | |
| | | +------------------+ Sorex
| | |
| +-----+ 26 +------------+ Homo
+------+ 71 | |
| | | +-----+ 99 +-----+ Papio
| | | | +------+ 42
| | +-----+ 67 +-----+ Hylobates
| | |
+-----+ 30 | +------------------+ Lepus
| | |
| | +------------------------------------+ Didelphis
=| 100 |
| +-------------------------------------------+ Bombina
|
+-------------------------------------------------+ Tetrao
+----------------------------------------------------+ Danio
|
=| 97 +-------------------------+ Tetraodon
+--------------------------+
+-------------------------+ Fugu
75
name (Lua) type meaning (refers to the current node)
a integer number of ancestors
b number support value (or nil)
c integer number of children (direct descendants)
D integer total number of descendants (includes children)
d number depth (distance to root)
i Boolean true iff node is strictly internal (i.e., not root!)
lbl string label
l (ell) Boolean true iff node is a leaf
L number parent edge length
N node the current node itself
r Boolean true iff node is the root
Table 2.1: Predefined variables in nw luaed. Variables b and lbl are both derived from
the label, but b is interpreted as a number, and is undefined if the conversion to a number
fails, or if the node is a leaf. Edge length and depth (L and d) are undefined (not zero!) if
not specified in the Newick tree, as in cladograms.
checks that the support value is no less than 95. Note that the checks occur in that order,
and that if b isn’t defined, the second check isn’t even performed, as it is meaningless.
The third argument, s(), is the action: it specifies what to do when a node meets the
condition – in this case, call function s, which just prints the tree rooted at the current
node.
Conditions
Conditions are Boolean expressions usually involving node properties which are avail-
able as predefined variables. As the program ”visits” each node in turn, the variables
are set to the current node’s properties. These predefined variables have short names,
to keep expressions concise. They are shown in table 2.1.
The condition being just a Boolean expression written in Lua, all the logical op-
erators of the Lua language can be used (indeed, any valid Lua snippet can be used,
provided it evaluates to a Boolean), and you can use parentheses to override operator
precedence or for clarity.
Here are some examples of nw luaed conditions:
expression selects:
l (lowercase ell) all leaves
l and a <= 3 leaves with 3 ancestors or less
i and (b ˜= nil) and (b >= 95) internal nodes with support ≥ 95%
i and (b ˜= nil) and (b < 50) unsupported nodes (less than 50%)
not r all nodes except the root
c > 2 multifurcating nodes
Notes:
• If it is certain that all nodes do have support, checks such as b ˜= nil can be
omitted.
• if an action must be performed on every node, just pass true as the condition.
76
code effect modifies tree?
o splice out the node yes
s print the subtree rooted at the node no
u delete (”unlink”) the node (and all descendants) yes
Table 2.2: Predefined actions in nw luaed. The names are one letter long for convenience
when passing the action on the command line. When called without an argument, these
functions operate on the current node (i.e., s() is the same as s(N) (where N means the
current node – see table 2.1).
Table 2.3: Node properties accessible from Lua. rw: read-write, ro: read only. Some
fields have both a short and a long name, the former is intended for use on the command
line (where space is at a premium), the latter is for use in scripts (but both can be used
anywhere). Note that when referring to the current node, the predefined variables (see
table 2.1) are even more concise, e.g. N.len or N.L can be written just L, but they are
read-only.
Actions
Actions are arbitrary Lua expressions. These will typically involve printing out data or
altering node properties or even tree structure. nw luaed predefines a few functions
for such purposes (table 2.2), and you can of course write your own (unless the function
is very short, this is easier if you pass the Lua code in a file, see 2.18.3).
nw sched defines a ”node” type, and the current node is always accessible as vari-
able N (other nodes can be obtained through node properties, see below). Node prop-
erties can be accessed as fields in a Lua table. Table 2.3 lists the available node fields.
So for example the parent of the current node is expressed by N.par; doubling its
length could be N.par.len = N.par.len * 2.
Examples
Opening Poorly-supported Nodes When a node has low support, it may be better to
splice it out from the tree, reflecting uncertainty about the true topology. Consider the
following tree, HRV cg.nw:
77
hook name called. . .
start run before processing any tree
start tree for each tree, before processing
node for each node
stop tree for each tree, after processing
stop run after processing all trees
Table 2.4: Hooks defined by nw luaed. If a function named start tree is defined, it
will be called once per tree, before the tree is processed; etc. If a hook is not defined, no
action is performed on the corresponding occasion. Strictly speaking, start run is not
really necessary, as the file is evaluated before the run anyway, but it seems cleaner to
provide a start-of-the-run hook as well.
COXB2
COXA1
COX
O6
1
CO
ECH
HO
A18
XA
14
EC
XA
17
PO
CO
LI
6
O
PO XA
80
3
35
L IO
2 CO
90
2
85
PO XA
90
45
35
LIO 65 CO
1A
5
HEV7
0
35
75
100
HEV68
HRV85
80 25
100
HRV27 HRV8
95 30 9
10
70
93 HR
HRV 15 V1
95
B
100
90
V3 HR
15
HR V9
15
37
100
80
RV
H
RV
H
4
94
HR
V1
V6
HR
HR
V3
HRV2
HRV12
HRV16
4
V78
HR
78
COXB2
COXA1
COX
O6
1
CO
ECH
HO
A18
XA
14
EC
XA
17
PO
CO
LI
6
O
PO XA
3
L IO CO
80
2 2
PO
LIO XA
1A CO
90
90
85
65
HEV7
0
75
100
80
HEV68
HRV85
95
V3
10
HR
HR V9
0
37
80
100
RV
H
RV
H
4
94
HR
V1
V6
HR
HR
V3
HRV2
HRV12
HRV16
4
V78
HR
Now COXB2 and ECHO6 are siblings of ECHO1, forming a node with 90% support. What
this means is that the original tree strongly supports that these three species form a
clade, but is much less clear about the relationships within the clade. Opening the
nodes makes this fact clear by creating multifurcations. Likewise, the lower right of the
figure is now occupied by a highly multifurcating (8 children) but perfectly supported
(100%) node, none of whose descendants has less than 80% support.
Formatting Lengths Some phylogeny programs return Newick trees with an unreal-
istic number of decimal places. For example, the HRV.nw tree has six:
79
):0.936634
):0.770246
):0.051896
):0.438878
):1.235120,
COXA14_1:0.121281
):0.544944,
COXA6_1:0.675458,
COXA2_1:0.557975
);
Here I use nw indent to show each node on a line for clarity, and show only the last
ten.16 To format17 the lengths to two decimal places, do the following:
):0.94
):0.77
):0.05
):0.44
):1.24,
COXA14_1:0.12
):0.54,
COXA6_1:0.68,
COXA2_1:0.56
);
Multiplying lengths by a constant It may be necessary to have two trees which only
differ by a constant multiple of the branch lengths. This can be used, for example, to test
competing hypotheses about evolution rates. Here is our good friend the Catarrhinine
tree again:
$ nw_display -s catarrhini
16 the first ten lines contain only opening parentheses.
17 nw sched automatically loads the format module so that the full-fledged format function is available.
80
16
Gorilla
15
Homininae
10
Pan
10
Hominini
15
Hominidae 10
Homo
10 30
Pongo
20
Hylobates
10
Macaca
20
10
25 Papio
Cercopithecinae
10
Cercopithecus
10
Cercopithecidae
10
Simias
5
Colobinae
7
Colobus
substitutions/site
0 20 40 60
81
56
Gorilla
52.5
Homininae
35
Pan
35
Hominini
52.5
Hominidae 35
Homo
35 105
Pongo
70
Hylobates
35
Macaca
70
35
87.5 Papio
Cercopithecinae
35
Cercopithecus
35
Cercopithecidae
35
Simias
17.5
Colobinae
24.5
Colobus
substitutions/site
0 50 100 150 200
Implementing other Newick Utilities nw luaed can emulate other programs in the
package, when these iterate on every node and perform some action. There is no real
reason to use nw luaed rather than the original, since nw luaed will be slower (af-
ter all, it has to start the Lua interpreter, parse the Lua expressions, etc.). But these
”equivalents” can serve as illustration.
The lbl ˜= "" condition in the nw labels replacements is checked because the
original nw labels does not print empty labels. In the nw topology replacement,
the check for node type (l) is done in the action rather than the condition, because
there is some code that is performed for every node and some additional code only for
non-leaves.
A tree counter As you know by now, the Newick Utilities are able to process files
that contain any number of trees. But just how many trees are there in a file? If you’re
certain that there is exactly one tree per line, you just use wc -l. But the Newick
82
format allows trees to span more than one line, or conversely there may be more than
one tree per line; moreover there may be blank lines. All these conspire to yield wrong
tree counts. To solve this, we write a tree counter in Lua, and pass it to nw luaed. Here
is the counter:
$ cat count_trees.lua
function start_run()
count = 0
end
function stop_tree()
count = count + 1
end
function stop_run()
print(count)
end
As you can see, I’ve defined three of the five possible hooks. Before any tree is
processed, start run is called, which defines variable count and initializes it to zero.
After each tree is processed (actually, no processing is done, since the node hook is not
defined), function stop tree is called, which increments the counter. And after the
last tree has been processed, the stop run hook is called, which just prints out the
count.
Here it is in action. First, the simple case of one tree per line:
$ wc -l forest
4 forest
$ nw_luaed -n -f count_trees.lua forest
4
Right. Now how about this one: these are the same trees as in forest, but all on a
single line:
$ wc -l jungle
1 jungle
$ nw_luaed -n -f count_trees.lua jungle
4
nw luaed is not fooled! And this is the opposite case – an indented tree, which has one
node per line:
$ nw_indent catarrhini | wc -l
31
$ nw_indent catarrhini | nw_luaed -n -f count_trees.lua -
1
There’s no confusing our tree counter, it seems. Note that in future versions I might
well make this unnecessary by supplying a predefined variable which counts the input
trees, akin to Awk’s NR.
83
Numbering inner nodes I was once handed a tree with the task of numbering the
inner nodes, starting close to the leaves and ending at the root.18 Here is a tree with
unlabeled inner nodes (I hide the branch lengths lest they obscure the inner node labels,
which will also be numeric):
Pandion
Accipiter
Buteo
Aquila
Haliaeetus
Milvus
Elanus
Sagittarius
Micrastur
Falco
Polyborus
Milvago
substitutions/site
0 2.5 5 7.5
$ cat number_inodes.lua
function node()
if not l then n = n + 1; N.lbl = n end
end
84
Pandion
Accipiter
1
Buteo
3
Aquila
2
5 Haliaeetus
9
Milvus
4
Elanus
Sagittarius
Micrastur
6
Falco
8
Polyborus
7
Milvago
substitutions/site
0 2.5 5 7.5
Extracting deep, well-supported clades In the first example of this section (2.18.3),
we extracted well-supported clades, but there was an overlap because one well-supported
clade was a subclade of another. We may want to extract only the ”deepest” clades that
meet the condition, in other words, once a node has been found to match, its descen-
dants should not be processed. This is the purpose of option -o. For this option to
be useful, though, the tree must be processed from the root to the leaves, which is the
opposite of the default (namely, Newick order). To override this, we pass option -r
(”reverse”):
85
+----------------------------------------------------+ Danio
|
=| 97 +-------------------------+ Tetraodon
+--------------------------+
+-------------------------+ Fugu
+------------------+ Procavia
|
+-----+ 42 +------------+ Vulpes
| +-----+ 84
+-----+ 16 +------------+ Orcinus
| |
| +------------------------+ Bradypus
|
| +------------+ Mesocricetus
+-----+ 78 +-----+ 88
| | +-----+ 32 +------------+ Tamias
| | | |
| | | +------------------+ Sorex
| | |
| +-----+ 26 +------------+ Homo
+------+ 71 | |
| | | +-----+ 99 +-----+ Papio
| | | | +------+ 42
| | +-----+ 67 +-----+ Hylobates
| | |
+-----+ 30 | +------------------+ Lepus
| | |
| | +------------------------------------+ Didelphis
=| 100 |
| +-------------------------------------------+ Bombina
|
+-------------------------------------------------+ Tetrao
Future
I intend to develop nw luaed further. Among the items in my TODO list are a few
new predefined variables (number of records, root of the tree, more powerful structure-
altering functions, etc).
2.18.4 nw ed
Note: it is likely that nw luaed (2.18.3) will be more useful than nw ed. See also section
2.18 for a general intro to the stream editing programs. This section gives a minimal
description of nw ed, without
The two parameters of nw ed (besides the input file) are the condition and the ac-
tion.
86
Conditions
Conditions are logical expressions involving node properties, they are composed of
numbers, logical operators, and node functions. The functions have one-letter names,
to keep expressions short (after all, they are passed on the command line). There are
two types, numeric and Boolean.
The logical and relational operators work as expected, here is the list, in order of
precedence, from tightest to loosest-binding. Anyway, you can use parentheses to over-
ride precedence, so don’t worry.
symbol operator
! logical negation
== equality
!= inequality
< greater than
> lesser than
>= greater than or equal to
<= lesser than or equal to
& logical and
| logical or
expression selects:
l all leaves
l & a <= 3 leaves with 3 ancestors or less
i & (b >= 95) internal nodes with support greater than 95%
i & (b < 50) unsupported nodes (less than 50%)
!r all nodes except the root
c > 2 multifurcating nodes
Actions
The actions are also coded by a single letter, for the same reason. The following are
implemented:
I have no plans to implement any other actions, as this can be done easily with
nw luaed (or nw sched).
87
2.18.5 nw sched
Note: it is likely that nw luaed (2.18.3) will be more convenient than nw sched. See
also section 2.18 for a general intro to the stream editing programs. This section gives
a minimal description of nw sched, with no motivation and only a few examples (see
2.18.3 for more).
As mentioned above, nw sched works like nw luaed, but uses Scheme instead of Lua.
Accordingly, the condition and action are passed as a Scheme expression. The Scheme
language has a simple syntax, but it can be slightly surprising at first. To understand
the following examples, you just need to know that operators precede their arguments,
as do function names, so that the sum of 2 and 2 is written (+ 2 2), the sine of x is
(sin x), (< 3 2) is false, etc.
As a first example, let’s again extract all well-supported clades from the tree of verte-
brate genera, as we did with nw luaed.
88
+----------------------------------------------------+ Homo
|
=| 99 +-------------------------+ Papio
+--------------------------+ 42
+-------------------------+ Hylobates
+------------------+ Procavia
|
+-----+ 42 +------------+ Vulpes
| +-----+ 84
+-----+ 16 +------------+ Orcinus
| |
| +------------------------+ Bradypus
|
| +------------+ Mesocricetus
+-----+ 78 +-----+ 88
| | +-----+ 32 +------------+ Tamias
| | | |
| | | +------------------+ Sorex
| | |
| +-----+ 26 +------------+ Homo
+------+ 71 | |
| | | +-----+ 99 +-----+ Papio
| | | | +------+ 42
| | +-----+ 67 +-----+ Hylobates
| | |
+-----+ 30 | +------------------+ Lepus
| | |
| | +------------------------------------+ Didelphis
=| 100 |
| +-------------------------------------------+ Bombina
|
+-------------------------------------------------+ Tetrao
+----------------------------------------------------+ Danio
|
=| 97 +-------------------------+ Tetraodon
+--------------------------+
+-------------------------+ Fugu
The expression ((& (def? ’b) (>= 95 b)) (s)) parses as follows:
• the first element (or car, in Scheme parlance), (& (def? ’b) (>= 95 b)),
is the selector. It is a Boolean expression, namely a conjunction (&)19 of the expres-
sions (def? ’b) and (>= 95 b). The former checks that variable b (boot-
strap support) is defined20 , and the latter is true iff b is not smaller than 95.
19 & is a short name for the Scheme form and, which is defined by nw sched to allow for shorter expres-
zero, which is why one has to check that b is defined before using it. def? is just a shorter name for
defined?.
89
• the second element (cadr in Scheme jargon), (s), is the action – in this case, a
call to function s, which has the same meaning as action s in nw ed, namely to
print out the subclade rooted at the current node.
Selectors
Like nw ed addresses, nw sched selectors are Boolean expressions normally involving
node properties which are available as predefined variables. As the program ”visits”
each node in turn, the variables are set to reflect the current node’s properties. As in
nw ed, the variables have short names, to keep expressions concise. The predefined
variables are shown in the table below.
name type meaning
a integer number of ancestors
b rational support value
c integer number of children (direct descendants)
D integer total number of descendants (includes children)
d numeric depth (distance to root)
i Boolean true iff node is strictly internal (i.e., not root!)
lbl string label
l (ell) Boolean true iff node is a leaf
L rational parent edge length
r Boolean true iff node is the root
Variables b and lbl are both derived from the label, but b is interpreted as a number,
and is undefined if the conversion to a number fails, or if the node is a leaf. Edge length
and depth (L and d) are undefined (not zero!) if not specified in the Newick tree, as in
cladograms.
Whereas nw ed defines logical and relational operators, nw sched just uses those
of the Scheme language. It just defines a few shorter names to help keep command
lines compact:
expression selects:
l (lowercase ell) all leaves
(& l (<= a 3)) leaves with 3 ancestors or less
(& i (def? ’b) (>= b 95)) internal nodes with support greater than 95%
(& i (def? ’b) (< b 50) unsupported nodes (less than 50%)
(! r) all nodes except the root
(> c 2) multifurcating nodes
When it is clear that all inner nodes will have a defined support value, one can leave
out the (def? ’b) clause.
90
Actions
Actions are arbitrary Scheme expressions, so they are much more flexible than the fixed
actions defined by nw ed. nw sched defines most of them, as well as a few new ones,
as Scheme functions21 :
code effect modifies tree?
L! <len> sets the node’s parent-edge length to len yes
lbl! <lbl> sets the node’s label to lbl yes
o splice out the node yes
p <arg> print arg, then a newline no
s print the subtree rooted at the node no
u delete (”unlink”) the node (and all descendants) yes
The l action of nw ed, which prints the current node’s label, can be achieved in nw sched
with the more general p function: (p lbl).
The L! function sets the current node’s parent-edge length. It accepts a string or
a number. If the argument is a string, it attempts to convert it to a number. If this
fails, the edge length is undefined. The lbl! function sets the current node’s label. Its
argument is a string.
Future
I do not plan to develop nw sched any more, because in my opinion nw luaed is
better. I will probably drop it eventually, but not immediately.
21 Note that you must use Scheme’s function call syntax to call the function, i.e., (function [args...]).
91
Chapter 3
Advanced Tasks
The tasks presented in this chapter are more complex than that of chapter 2, and gen-
erally involve many Newick Utilities as well as other programs.
Poly
M
ic
r
o
as
ag
tu
ilv
r
Sag
i ttar
ius
Pandion
Elanus
Ac
c ipi
us ter
ilv
M
Bu
te
Aquila
tu s
o
ee
lia
Ha
92
Now I also have the following information about the family to which each genus be-
longs:
Genus Family
Accipiter Accipitridae
Aquila Accipitridae
Buteo Accipitridae
Elanus Accipitridae
Falco Falconidae
Haliaeetus Accipitridae
Micrastur Falconidae
Milvago Falconidae
Milvus Accipitridae
Pandion Pandionidae
Polyborus Falconidae
Sagittarius Sagittariidae
Let’s see if the tree is consistent with this information. If it is, all families should
form clades. To check this, I will rename each leaf by replacing the genus name by
the family name, then condense the tree. If the original tree is consistent, the final tree
should have one leaf per family.
First, I create a renaming map (see 2.9) based on the above information (here are the
first three lines):
$ head -3 falc_map
Accipiter Accipitridae
Buteo Accipitridae
Aquila Accipitridae
Pandionidae
Accipitridae
Sagittariidae
Falconidae
As we can see, there is one leaf per family, so the above information is consistent with
the tree.
Let’s see if common English names are also consistent with the tree. Here is one
possible table of vernacular names of the raptor genera:
93
Genus English name
Accipiter hawk (sparrowhawk, goshawk, etc)
Aquila eagle
Buteo hawk
Elanus kite
Falco falcon
Haliaeetus eagle (sea eagle)
Micrastur falcon (forest falcon)
Milvago caracara
Milvus kite
Pandion osprey
Polyborus caracara
Sagittarius secretary bird
osprey
hawk
eagle
kite
secretary bird
falcon
caracara
So the above common names are consistent with the tree. However, some species
have many common names. For example, the Buteo hawks are often called ”buzzards”
(in Europe), and two species of falcons have been called ”hawks” (in North America):
the peregrine falcon (Falco peregrinus) was called the ”duck hawk”, and the American
kestrel (Falco sparverius) was called the ”sparrow hawk”.1 If we map these common
names to the tree and condense, we get this:
Accipiter nisus. To add to the confusion, the specific name sparverius looks like the English word ”sparrow”,
and also resembles the common name of Accipiter nisus in many other languages: épervier (fr), Sperber (de),
sparviere (it). Oh well. Let’s not drop scientific names just yet!
94
osprey
hawk
buzzard
eagle
kite
secretary bird
falcon
hawk
caracara
Distinguishing buzzards from other hawks fits well with the tree. On the other hand,
calling a falcon a hawk does not, hence the name ”hawk” appears in two different
places.
95
Bootscanning of HRV_3UTR.dna WRT CL073908, slice size 300 nt
distance to reference [subst./site]
0.35 HRV-58
0.3 HRV-88
HRV-7
0.25 HRV-89
0.2 HRV-36
HRV-9
0.15 HRV-32
0.1 HRV-67
0.05
0
100 200 300 400 500 600 700 800
position of slice centre in alignment [nt]
until position 450 or so, the query sequence’s nearest relatives (in terms of substitution-
s/site) are HRV-36 and HRV-89. After that point, it is HRV-67. This suggests that there
is a recombination breakpoint near position 450.
The script uses nw reroot to reroot the trees on the outgroup, nw clade and
nw labels to get the labels of the ingroup, nw distance to extract the distance be-
tween the query and the other sequences, as well as the usual sed, grep, etc. The plot
is done with gnuplot.
8 7 1
A A 7 A
1
8 4 1
B 3 B B
8 4 1
C C C
8 2 1
D 3 D 7 D
8 2 1
E 3 E E
8 5 1
F F F
they have the same depth and the same number of leaves. But their shapes are very
different, and they tell different biological stories. If we assume that they are clock-
like (i.e., that the mutation rate is constant over the whole tree), star shows an early
radiation, short leaves shows two stable lineages ending in recent branching, while
balanced shows branching spread over time.
The nodes-vs-depth graphs for these trees are as follows:
96
Number of Nodes as a function of Depth in star
6
4
# Nodes
1
0 1 2 3 4 5 6 7 8
Tree Depth
4
# Nodes
1
0 1 2 3 4 5 6 7 8
Tree Depth
97
Number of Nodes as a function of Depth in short_leaves
6
4
# Nodes
1
0 1 2 3 4 5 6 7 8
Tree Depth
The graphs show the (normalized) area under the curve: it is close to 1 for star-like
trees, close to 0 for trees with very short leaves, and intermediary for more balanced
trees.
The images were made with the nodes vs clades.sh script (in directory src), in
the following way:
$ nodes_vs_clades.sh star 40
where 40 is just the sampling density (how many points to take on the x axis). The
script uses nw distance to get the tree’s depth, nw ed to sample the number of nodes
at a given depth, and nw indent to count the leaves, plus the usual awk and friends.
The plot is done with gnuplot.
98
Chapter 4
Python Bindings
Although the Newick Utilities are primarily designed for shell use, it is also possible
to use their functions from Python programs: all the core functionality of the utilities
is bundled in a C library, libnw, which can be accessed through Python’s ctypes
module. The distribution contains a file, newick utils.py, that provides the Python
to C mappings; it also builds an object-oriented interface over it.
Let’s say we want to add a utility that prints simple statistics about trees, like the
number of nodes, the depth, whether it is a cladogram or a phylogram, etc (in other
words, a Python version of nw stats). We will call it nw info.py, and we’ll pass it a
Newick file on standard input, so the usage will be something like:
The overall structure of this program is simple: iteratively read all the input trees, and
do something with each of them:
1 from newick_utils import *
2
3 for tree in Tree.parse_newick_input():
4 pass # process tree here!
Line 1 imports definitions from the newick utils.py module. Line 3 is the main
loop: the Tree.parse newick input reads standard input and yields an instance of
class Tree for each Newick string. We can now work with it, using methods of class
Tree or adding our own:
1 #!/usr/bin/env python
2
3 from newick_utils import *
4
5 def count_polytomies(tree):
6 count = 0
7 for node in tree.get_nodes():
8 if node.children_count() > 2:
9 count += 1
10 return count
11
12 for tree in Tree.parse_newick_input():
13 type = tree.get_type()
14 if type == ’Phylogram’:
99
15 # also sets nodes’ depths
16 depth = tree.get_depth()
17 else:
18 depth = None
19 print ’Type:’, type
20 print ’#Nodes:’, len(list(tree.get_nodes()))
21 print ’ #leaves:’, tree.get_leaf_count()
22 print ’#Polytomies:’, count_polytomies(tree)
23 print "Depth:", depth
When we run the program, we get:
As you can see, most of the work is done by methods called on the tree object,
such as get leaf count which (surprise!) returns the number of leaves of a tree.
But since there is no method for counting polytomies, we added our own function,
count polytomies, which takes a Tree object as argument.
As another example, a simple implementation of nw reroot is found in src/nw reroot.py.
It demonstrates two approaches: a heavily object-oriented one, in which the user mainly
calls methods on Python objects, and a ”thin” one, in which the calls are essentially to
C functions through libnw. While not as fast as nw reroot, its performance is still
quite acceptable, especially in ”thin” mode.
100
Appendix A
When you need to specify a clade using the Newick Utilities, you either give the label
of the clade’s root, or the labels of (some of) its descendants. Since inner nodes rarely
have labels (or worse, have unusable labels like bootstrap support values), you will
often need to specify clades by their descendants. Consider the following tree:
16
Gorilla
15
Homininae
10
Pan
10
Hominini
15
Hominidae 10
Homo
10 30
Pongo
20
Hylobates
10
Macaca
20
10
25 Papio
Cercopithecinae
10
Cercopithecus
10
Cercopithecidae
10
Simias
5
Colobinae
7
Colobus
Suppose we want to specify the Hominoidea clade - the apes. It is the clade that con-
tains Homo, Pan (chimps), Gorilla, Pongo (orangutan), and Hylobates (gibbons).
101
The clade is not labeled in this tree, but this list of labels defines it without ambiguity. In
fact, we can define it unambiguously using just Hylobates and Homo - or Hylobates
and any other label. The point is that you never need more than two labels to unambiguously
define a clade.
You cannot choose any two nodes, however: the condition is that the last common
ancestor of the two nodes be the root of the desired clade. For instance, if you used
Pongo instead of Hylobates, you would define the Hominidae clade, leaving out the
gibbons.
102
Appendix B
Newick order
There are many ways of visiting a tree. One can start from the root and proceed to
the leaves, or the other way around. One can visit a node before its children (if any),
or after them. Unless specified otherwise, the Newick Utilities process trees as they
appear in the Newick data. That is, for tree (A,(B,C)d)e; the order will be A, B, C,
d, e.
This means that a child always comes before its parent, and in particular, that the
root comes last. This is known as reverse post-order traversal, but we’ll just call it
”Newick order”.
103
Appendix C
autotools – that’s what stable releases are for, among other things.
104
C.2.3 Build Procedure
The package uses the GNU autotools, like many other open source software packages.3
So all you need to do is the usual
$ tar xzf newick-utils-x.y.z.tar.gz
$ cd newick-utils-x.y.z
$ ./configure
$ make
$ make check
# make install
The make check is optional, but you should try it anyway. Note that the nw gen test
may fail - this is due to differences in pseudo-random number generators, as far as I
can tell.
With non-stable releases, it may be necessary to reconfigure (this generally does not
happen when using the tarball generated by the build system). So if you get weird
error messages, try the following (you’ll need the GNU autotools):
$ autoreconf -i
or even
$ autoreconf -fi
Variants
To prevent the use of libxml, pass --without-libxml to ./configure. Likewise,
pass --without-guile or without-lua, to prevent the use of Guile or Lua, respec-
tively.
If you have headers (such as Guile or LibXML’s) in a non-standard location, pass
that location via the CPPFLAGS environment variable when running ./configure.
Likewise, if you have libraries in a non-standard location, use LDFLAGS. The syntax is
that of the -I and -L options to gcc, respectively. For example,
LDFLAGS=’-L/opt/lib’ CPPFLAGS=’-I/opt/include’ ./configure
would cause /opt/lib and /opt/include to be searched for libraries and head-
ers, respectively. Note that there is no space between the -I and the /opt/..., etc.
C.3 As binaries
Since version 1.1, there are also binaries for some platforms. The name of the archive
matches newick-utils-<version>-<platform>-<enabled|disabled>-extra.tar.gz.
”enabled-extra” means that the binary depends on optional software (see C.2.2) and
will expect to find it (as shared libs) on your system. ”disabled-extra” means that the
binary will not depend on those libraries, but of course the corresponding functionality
(see C.2.2) won’t be available. Simply do:
105
The binaries are in src. You can copy/move the binaries wherever it suits you.4 You
can check the binaries by running test binaries.sh (though this is not as strict as
running the whole test suite after compiling).
C.4 Versions
Here are the versions I use (as reported by passing --version to the program listed
in column 2):
tool program version required for
4 Ideally, this should be done with make install, but for some reason this doesn’t seem to work with
106
Appendix D
Changes
The base version was 1.3.5, the first one that was published. Changes are relative to
the previous one, or to v. 1.3.5. Minor changes (that don’t affect the user at all) are not
listed.
107
Bibliography
108