Bob and I have a new paper just out in Molecular Ecology, along with a perspective piece by Lucie Zinger and Hervé Philippe highlighting it in the same issue. This is the first manuscript based on the Beckman Postdoctoral Fellowship that I received this past year. The main idea of the proposal is to examine how well statistical methods used for identifying the boundaries among species using genetic data actually work in real datasets.
Estimating global biodiversity is a major challenge for biologists over the next several decades. We know that there are many more species that exist on earth than are currently described (though we don’t know the exact number), and that many of them are becoming increasingly faced with extinction due to various anthropogenic impacts. In addition, understanding species diversity in many poorly understood biological communities remains challenging. Approaches such as DNA barcoding have been developed to help address these challenges.
However, the reliability of our biodiversity estimates derived from using these methods are depend on the model we use adequately describing the biological processes that have shaped the genetic datasets we collect. To determine whether this is the case, we need to perform an assessment of model adequacy where we ask: Does our model fit our particular dataset?
In our manuscript, we assess model adequacy in DNA barcoding across the tree of life using a technique called posterior predictive simulation. The approach is basically comprised of several steps: 1) Generate estimates of our model parameters (for example, the the number of species, or mutation rate in our genetic dataset) by performing inference on a dataset (in this case, we are using Bayesian inference, so we are thinking probabilistically to try and estimate what the true value for the number of species is), 2) simulate new datasets based on our estimates of the parameter values from these analyses of the real data, and 3) Compare these simulated datasets to the original data using a test statistic. If our model describes our data well, our simulated datasets should be similar to the original data. Pretty simple, right?
The primary statistical models used in DNA barcoding are substitution models that are used to calculate genetic distances between the DNA sequences from different organisms that we have sampled. These distances are then used with clustering algorithms to group the individuals into operational taxonomic units (OTU’s; which you can think of as a proxy for species). The field of DNA barcoding has traditionally relied on the simple Kimura-2 parameter substitution model for calculating genetic distances. We compared the performance of this model to the performance of several more complex ones. You can see a graphical outline of the approach in the figure below. For the study, we utilized the large amount of data available from the Barcode of Life Database.
The results of our study demonstrated two major points: 1) The choice of substitution model can substantially impact biodiversity estimates in barcode datasets, and 2) more complex models perform better than the simple K2P model in in barcoding datasets.
One of the most interesting parts of the study was seeing the large impact that model choice had on the number of OTU’s that we identified. Depending on the clustering algorithm we used, the difference in the number of OTU’s identified between the K2P and selected model across all datasets was as high as 31%. If you extrapolate that across a conservative estimate of eukaryotic species diversity (not including things like bacteria) on earth (say ~10 million species), this would result in a difference of over 3 million species! This was a pretty striking illustration of the large impact that statistical model adequacy can have on our understanding of biodiversity.
The figure on the left is from our paper, the one on the right is from the ‘Perspective’ paper in the same journal issue.