Lesson on phylogenetics in R- October 12 2017 by Liz Miller: ecmiller@email.arizona.edu

In this script, I will go over basics of reading trees in R, what the tree object looks like, some common functions for manipulating trees, and simple plotting. At the end, we will do a PGLS analysis. All of the libraries here have tons of additional functionality that I couldn’t cover today! The goal is to get you started with using phylogenies in R.

Places to find help:

Essential packages! Install if you haven’t already:

library(ape)   #the "original" one. Needed to read and plot trees in R
## Warning: package 'ape' was built under R version 3.4.3
# ape stands for "analysis of phylogenetics and evolution"

library(geiger)   #expands capabiltiies of ape, some macroevolutionary analyses available like fitting Brownian motion and related models

library(phytools)   #mostly for tree manipulation
## Warning: package 'phytools' was built under R version 3.4.3
## Loading required package: maps
library(caper)   #includes even more analyses like PGLS
## Warning: package 'caper' was built under R version 3.4.3
## Loading required package: MASS
## Loading required package: mvtnorm
# caper stands for "comparative analysis of phylogenetics and evolution in R"

Tree Structure in R

You can write out a simple tree longhand. This is the “Newick” format, where parentheses and commas indicate relationships. Then, we will look at what tree objects in R are made of. Finally we will plot the tree.

text.string<-
  "(((((((cow, pig),whale),(bat,(lemur,human))),(robin,iguana)),coelacanth),gold_fish),shark);"

vert.tree<-read.tree(text=text.string)

### Try printing the vert.tree object to the screen to get a sense for what the tree object looks like in R.

vert.tree
## 
## Phylogenetic tree with 11 tips and 10 internal nodes.
## 
## Tip labels:
##  cow, pig, whale, bat, lemur, human, ...
## 
## Rooted; no branch lengths.
str(vert.tree)
## List of 3
##  $ edge     : int [1:20, 1:2] 12 13 14 15 16 17 18 18 17 16 ...
##  $ Nnode    : int 10
##  $ tip.label: chr [1:11] "cow" "pig" "whale" "bat" ...
##  - attr(*, "class")= chr "phylo"
##  - attr(*, "order")= chr "cladewise"
## The tree object contains a set of lists: $edge, $tip.label, $Nnode

vert.tree$tip.label  # Is a list of tip labels in the tree.
##  [1] "cow"        "pig"        "whale"      "bat"        "lemur"     
##  [6] "human"      "robin"      "iguana"     "coelacanth" "gold_fish" 
## [11] "shark"
## Now plot the tree. The simplest way is just using plot()

plot(vert.tree)

## You can make some changes to the appearance easily:

plot(vert.tree, edge.width = 2)

plot(vert.tree, edge.width = 2, type="fan")

## Something more complicated: coloring the mammal clade

wh <- which.edge(vert.tree, c("cow", "pig", "whale", "bat", "lemur", "human") )

colo<- rep("black", Nedge(vert.tree)) # define a default color. The colo object is going to have a color identifyer for every edge (=branch) in our tree

colo[wh] <- "red"

plot(vert.tree, edge.color=colo, no.margin=TRUE, edge.width=2, show.tip.label=TRUE)

The Analyses of Phylogenetics and Evolution with R book has tons of plotting examples. I can usually figure out what I want to plot from that book alone.

Note that tips and nodes in the tree are numbered in a systematic way. This is really important to know for any tree manipulation you want to do. The tips are numbered from 1 to N tips (in the same order as the $tip.labels object), and the nodes are numbered from N tips+1 to N nodes. The root is always numbered first (equal to N tips+1). The number of nodes will always equal the number of tips minus 1.

We can visualize the node numbers:

plot(vert.tree) 
nodelabels() #visualize the node nabels. So this tree has 11 tips. So the root node is #12, and nodes are counted out from there. The number of nodes will always be equal to #tips minus 1. 

Basic tree manipulation

Say we only want a tree of mammals. There are two ways to get this from the tree we have: cut out anything that isn’t a mammal, or extract the mammal clade by itself. I’ll demonstrate both ways.

new.tree <- drop.tip(vert.tree, c("shark", "gold_fish", "coelacanth", "iguana", "robin"))

plot(new.tree)

Notice that we added the underscore for gold fish. If you print vert.tree$tip.label, you notice it is written as gold_fish in the list object. When you plot it, the name has a space instead. If you forget the underscore, the function won’t be able to find “gold fish”, so it will be left in your tree.

plot(vert.tree) 
nodelabels()  # So the mammal clade is node #16

new.tree2 <- extract.clade(vert.tree, 16)
plot(new.tree2) 

The objects new.tree and new.tree2 are identical. Whichever option you use will depend on what makes sense for your tree. If you only want to remove a handful of taxa (perhaps not monophyletic), use drop.tip. If you want a monophyletic clade to be cut out of a bigger tree, use extract.clade.

** Advanced tree manipulation with a huge tree**

Now let’s work with a gigantic tree. This will force us to use some tree manipulation functions.

This is a tree of all ray-finned fishes from Rabosky et al. 2013 (over 7,000 species!!!!!!) To download, go to this link, use Select All, then copy and paste into a text editor like Text Wrangler. Save in your working directory with the file extension “.tree”. Avoid “.txt” for tree files.

You’ll notice this tree looks like the short one we wrote out by hand, except there are numbers. These are the branch lengths. Because this tree is time-calibrated, the branch lengths are in units of Millions of Years. You can also have branch lengths in units of substitutions, for example. If you have no branch lengths, like our tree before, then you have a cladogram, which only shows relationships of taxa.

DO NOT plot this tree in the R console! It is very big, so it will take a long time to plot!

Let’s cut this tree down to something more managable.

# Read the tree from your file:
fish.tree <- read.tree("data/04_fish_tree.tree")
fish.tree  #This tree has 7822 tips and 7821 nodes.
## 
## Phylogenetic tree with 7822 tips and 7821 internal nodes.
## 
## Tip labels:
##  Polyodon_spathula, Psephurus_gladius, Scaphirhynchus_albus, Scaphirhynchus_platorynchus, Scaphirhynchus_suttkusi, Huso_huso, ...
## 
## Rooted; includes branch lengths.

Imagine we want to study butterflyfishes. How would we get the butterflyfish clade out of this big tree? It would be cumbersome to drop some 7,000 tips, so it’s easiest to use extract.clade. But what node belongs to the butterflyfish family?

To find out, use the function findMRCA in phytools!! This requires you to know what tips correspond to the clade you want, so you need to know something about your clade in advance.

findMRCA(fish.tree, tips=c("Chaetodon_melapterus", "Johnrandallia_nigrirostris"))  
## [1] 9164
## that is the node you want to use in extract.clade.

tree <- extract.clade(fish.tree, 9164) #or you can just use findMRCA directly in the extract.clade function

tree # this tree has 94 species. 
## 
## Phylogenetic tree with 94 tips and 93 internal nodes.
## 
## Tip labels:
##  Prognathodes_marcellae, Prognathodes_aculeatus, Prognathodes_falcifer, Prognathodes_aya, Chaetodon_adiergastos, Chaetodon_fasciatus, ...
## 
## Rooted; includes branch lengths.
### Now this is something more easily plotted.
plot(tree)

plot(tree, show.tip.label=FALSE)
axisPhylo() # Shows the scale of the branch lengths in millions of years.

So we know that the butterflyfish family has a crown age of about 50 million years.

Let’s say we wanted that crown age to be more precise for downstream analyses. You can get the age of any node using the branching.times() function.

branching.times(tree) #remember the counting rules for nodes (above) So the node numbered #tips + 1 is the root node.
##        95        96        97        98        99       100       101 
## 50.747990 40.229290 21.008190 14.281050 11.688930 30.570760 28.850140 
##       102       103       104       105       106       107       108 
## 23.503140 20.960250 17.922160  7.393060  2.698550 11.653040 10.322290 
##       109       110       111       112       113       114       115 
##  9.009130  5.328620  4.182830  8.580596  5.610946  3.580506  3.278075 
##       116       117       118       119       120       121       122 
##  1.718075  6.182156  4.640996 11.159830 22.698693 20.316203 15.535773 
##       123       124       125       126       127       128       129 
## 14.743803 18.116373  6.191473 26.774430 23.823120 22.768630 15.666900 
##       130       131       132       133       134       135       136 
## 21.072950 20.132118  9.239218  1.851388 10.894488  8.996468  5.742818 
##       137       138       139       140       141       142       143 
## 12.897520 23.893650 23.073284  2.199584 12.759684  7.025550  4.019000 
##       144       145       146       147       148       149       150 
## 25.011510  9.292010  1.502750  1.658610 22.502610 19.479140 17.976680 
##       151       152       153       154       155       156       157 
## 16.392080  4.013880  1.938670  0.873290 12.668390 17.318832 11.626962 
##       158       159       160       161       162       163       164 
##  4.066392  3.381917  2.824812  2.047991  2.247182  1.931081  1.649207 
##       165       166       167       168       169       170       171 
##  8.129002 41.824120 37.762770 24.443170 22.005530  5.667930  1.906410 
##       172       173       174       175       176       177       178 
##  7.460430  6.924619  2.250370 20.188370 30.385720 15.877220 26.850620 
##       179       180       181       182       183       184       185 
##  8.602520  0.211310 22.290580 19.972370  4.434870 11.138460  0.482660 
##       186       187 
##  0.111360 20.894600
node_ages=as.data.frame(branching.times(tree))

node_ages[row.names(node_ages) == findMRCA(tree, c(tree$tip.label[1],tree$tip.label[5])),]
## [1] 40.22929

Now we want to save this tree for future analyses. Use the “write.tree” function, and give the new tree a name:

write.tree(tree, file="04_butterflyfishes.tree")

PGLS: Phylogenetic generalized least squares

Finally, we will do a PGLS. PGLS is a method used to account for phylogenetic relatedness of species when you’re interested in correlations between numeric variables. This is probably the most common type of comparative analysis. It is very flexible, and you can compare multiple factors, or even categorical factors (using an extension of the method not shown here).

These data for barbets (a king of songbird) include various factors like wing length and song frequency. To download this data, go to the links, and copy+paste into Text Wrangler. Save the trait data as a “.csv” file, and the tree as a “.nex” file.

Trait data Phylogeny

You’ll notice this tree is in Nexus instead of Newick format. Nexus allows for more detailed information, and includes “blocks” of information like the taxa block. We need to read these in differently, but otherwise there shouldn’t be much difference to working with Newick vs. Nexus.

data<- read.csv("data/04_Barbetdata.csv")
tree<- read.nexus("data/04_BarbetTree.nexus")

Note we used “read.nexus” instead of “read.tree”.

Before using PGLS or many other comparative methods, we need to check that the species in the tree match the species in the data and vice-versa. There is a handy function to do that:

check <- name.check(tree, data, data.names=data$Species)
check
## $tree_not_data
## [1] "M_asiatica_chersonesus"    "M_australis_australis"    
## [3] "M_haemacephala_celestinoi" "M_haemacephala_delica"    
## [5] "M_haemacephala_intermedia" "M_haemacephala_rosea"     
## [7] "M_lineata_lineata"         "M_oorti_faber"            
## [9] "M_oorti_sini"             
## 
## $data_not_tree
## factor(0)
## 33 Levels: Calorhamphus_fuliginosus_fuliginosus ...
## There are 9 species in the phylogeny for which we don't have data. Let's just cut these out.

new.tree <- drop.tip(tree, check$tree_not_data)
name.check(new.tree, data, data.names=data$Species) # Now we should be good
## [1] "OK"

We are going to perform PGLS using the caper package. This package works by fusing the trait data and tree into a “comparative data” object.

comp.data<-comparative.data(new.tree, data, 
                            names.col="Species", vcv.dim=2, 
                            warn.dropped=TRUE)

comp.data ## look at the comp.data object
## Comparative dataset of 33 taxa:
## Phylogeny: new.tree 
##    33 tips, 32 internal nodes
##    chr [1:33] "M_asiatica_asiatica" "M_lineata_hodgsoni" "M_eximia" ...
## Data: data 
##    $ wing     : num [1:33] 4.61 4.86 4.36 4.59 4.6 ...
##    $ Lnalt    : num [1:33] 6.8 5.99 6.62 6.21 7.31 ...
##    $ patch    : num [1:33] 7.33 3.67 7.67 5.67 7.67 ...
##    $ colour   : num [1:33] 5 2.67 5 3 5 ...
##    $ Frequency: num [1:33] -10.15 -12.75 -2.31 -8.78 -8.89 ...
##    $ Length   : num [1:33] 0.3671 0.3288 3.2467 -0.0737 1.7412 ...
##    $ Lnote    : num [1:33] 0.025 0.07 0.03 0.04 0.135 0.5 0.05 0.26 0.04 0.045 ...

Note that if there are species in the data not in the tree, or vise-versa, it will automatically drop these. This may or may not be what you want, so it’s best to check first.

We will first test if a species’ altitude is related to the frequency of its song, while accounting for non-independence due to phylogenetic relatedness.

model1 <- pgls(Lnote~Lnalt, lambda="ML", data=comp.data)
summary(model1)
## 
## Call:
## pgls(formula = Lnote ~ Lnalt, data = comp.data, lambda = "ML")
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.04816 -0.01198  0.01042  0.02527  0.04437 
## 
## Branch length transformations:
## 
## kappa  [Fix]  : 1.000
## lambda [ ML]  : 1.000
##    lower bound : 0.000, p = 1.3578e-08
##    upper bound : 1.000, p = 1    
##    95.0% CI   : (0.952, NA)
## delta  [Fix]  : 1.000
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -0.179755   0.149409 -1.2031  0.23804  
## Lnalt        0.054754   0.021624  2.5321  0.01662 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02551 on 31 degrees of freedom
## Multiple R-squared: 0.1714,  Adjusted R-squared: 0.1446 
## F-statistic: 6.412 on 1 and 31 DF,  p-value: 0.01662

Look at the lambda parameter. Lambda is a measure of phylogenetic signal of the residual error of the variables. It ranges from 0-1. When lambda=0, this means there is no effect of phylogeny. When it is =1, the structure of the residual error follows Brownian motion along the tree.

You can also plot the liklihood surface of lambda:

lm.lk<-pgls.profile(model1, which="lambda")
plot(lm.lk)

Next, we will add an additional variable that potentially interacts with altitude: wing length:

model2 <- pgls(Lnote~Lnalt+wing, lambda="ML", data=comp.data)
summary(model2)
## 
## Call:
## pgls(formula = Lnote ~ Lnalt + wing, data = comp.data, lambda = "ML")
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.03564 -0.01143  0.00644  0.01945  0.04477 
## 
## Branch length transformations:
## 
## kappa  [Fix]  : 1.000
## lambda [ ML]  : 1.000
##    lower bound : 0.000, p = 1.5e-08
##    upper bound : 1.000, p = 1    
##    95.0% CI   : (0.940, NA)
## delta  [Fix]  : 1.000
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -1.338053   0.547467 -2.4441 0.020617 * 
## Lnalt        0.067619   0.021240  3.1836 0.003378 **
## wing         0.240871   0.110005  2.1896 0.036464 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02408 on 30 degrees of freedom
## Multiple R-squared: 0.2856,  Adjusted R-squared: 0.2379 
## F-statistic: 5.995 on 2 and 30 DF,  p-value: 0.006449

Some additional packages that I didn’t cover here:

The four packages loaded here can collectively do tons of analyses including ancestral state reconstruction, fitting models of evolution like Brownian motion, simulate phylogenies, make lineage-through-time plots, etc etc. If there is a specific thing you want to do, you can usually find a tutorial for it online.

Here are some more packages you might want to use: