Newsletter of the Biological Survey of Canada (Terrestrial Arthropods)

Volume 23 No. 1, 2004


 

Opinion Page

—The Opinion Page is a forum for views and ideas of potential interest to readers—

Contributions should be sent to the editor.

 

Bioinformatics and misinformatics: the missing links between taxonomic data and taxonomic databases

Terry A. Wheeler

Department of Natural Resource Sciences, McGill University, Macdonald Campus, Ste-Anne-de-Bellevue, QC H9X 3V9 (wheeler@nrs.mcgill.ca)


The Importance of Systematic Data
The systematist has two fundamental responsibilities to the scientific community: assemble, analyse and synthesize the data necessary to describe the diversity of life on earth; and ensure that other scientists and users have access to the data. For 250 years, the primary tool used to disseminate systematic information has been the published revision, containing species descriptions, keys, distribution maps, ecological notes, etc. The rise of the World Wide Web and more powerful desktop computers in the last decade has led to a tremendous increase in the range of media and products that may be used to disseminate the results of systematic research. The opportunity to connect researchers and data almost instantaneously led to the realization that massive amounts of data could now be synthesized, organized and made accessible to a community of users limited only by access to a computer network. I contend that organized Canadian efforts to synthesize biodiversity data have not taken advantage of this opportunity; in fact, ongoing initiatives have obscured an increasing gap between basic research in systematics and dissemination of "products".

It has long been recognized that Canada is losing specialists trained in systematics and that our collective research output in the field has suffered as a result. A variety of solutions has been proposed to address this lack of basic research. The systematics community has repeatedly proposed a bold plan based on training and hiring more systematists. This innovative solution has unleashed a storm of apathy in most circles, with a few notable exceptions; the most encouraging has been the recent hiring of three systematic entomologists by Agriculture and Agri-Food Canada. Other scientists (Canadian and otherwise) have proposed grand but naïve plans to replace primary taxonomic research with so-called "DNA barcodes" based on sequencing a minute portion of the genome (e.g., Hebert et al. 2003, Tautz et al. 2003) or doing away with the tedious (at least to non-systematists) necessity of observing rules of nomenclatural priority and hard copy publications and transferring taxonomy in toto to the Web (e.g., Godfray 2002). Such sweeping and technology-driven proposals have been effective in getting systematics into the pages of Nature, Science and other prominent journals, and I suppose that is a good thing. However, these are, in the end, simplistic and flawed "solutions" that ignore the need for trained systematists to actually recognize new species in the first place, to describe taxa accurately in such a way that they are recognizable to other scientists, and to propose and test hypotheses on phylogenetic relationships. Technological advances have obviously revolutionized the way we conduct and disseminate systematic research, but such advances should be tools, not crutches, that serve as an adjunct to good work done by well-trained systematists (Mallet and Willmott 2003, Scotland et al. 2003, Wilson 2003).

Some countries have adopted a balanced view of the importance of research at all levels of the systematic process and have responded accordingly. In the United States, for example, the National Science Foundation has multiple discrete funding programs for systematic research (see www.nsf.gov/bio/deb/start.htm). There are programs to discover and describe new taxa (Partnerships for Enhancing Expertise in Taxonomy), to conduct large-scale faunal inventories (Biodiversity Surveys and Inventories), to reconstruct phylogenetic history and place species within an evolutionary context (Assembling the Tree of Life), to support curation and access to collections (Biological Research Collections), and to establish bioinformatics frameworks (Biological Databases and Informatics). This is a logical and scientifically valid approach that increases the likelihood that the taxonomic databases will be built on accurate data. The current picture is very different in Canada; other than the traditional sources of (limited) support to individual researchers from agencies like the Natural Sciences and Engineering Research Council and the employment of an ever-shrinking cohort of government systematists, Canadian government agencies have largely ignored the need for research on the identity and relationships of species. Instead, they have opted for presentation over content. No new funds have been allocated across the systematics community and support for systematic research continues to erode. In contrast, bioinformatics is a current hot topic and agencies involved in disseminating and using biodiversity information have embraced packaging and marketing with the zeal of a new recruit at an advertising agency. The result is that Canada can now hold her own with any other scientific power in the production and proliferation of acronyms and websites.

"Initial" Efforts
The Federal Biosystematics Group (FBG), a consortium made up of the five federal Natural Resources departments with a stake in biodiversity knowledge, released a report (Federal Biosystematics Group 1995) on the state of systematics in Canada. The report identified two important areas most in need of financial support: new scientists (namely, 15 systematists, additional support staff, and support for students); and better facilities (namely, a National Collections Strategy, cost sharing to support collections, and computerised access to holdings). Subsequent actions on this front by FBG included changing their name to the Federal Biosystematics Partnership (FBP) followed by the launch in early 2003 of the Federal Biodiversity Information Partnership (FBIP) (www.cbif.gc.ca/fbip/fbip_e.php). No funding programs for systematic research have been established other than a single three-year postgraduate fellowship in systematics, which was awarded only once. Current FBIP projects include scattered "proof of concept" (to commandeer a reprehensible management cliché) projects such as databasing the mosquito collections in a small subset of selected museums across the country, which, it is hoped, will provide more accurate distributional data for monitoring West Nile virus. There is no indication on the FBIP website as to whether species identifications will be confirmed by one of our very few qualified specialists prior to the taxonomic information appearing on the Web.

While FBG/FBP/FBIP coordinated activities within the federal government departments, a broader initiative led to the formation of the Canadian Biodiversity Information Initiative (CANBII), based on the American NBII program. CANBII quickly became CBIN (The Canadian Biodiversity Information Network), which, in turn, became BCIN (Biota of Canada Information Network), which, in the fullness of time, became BKIN (Biodiversity Knowledge Information Network). Some workshops were held and optimistic plans were made. The major "proof of concept" project resulting from the CANBII/CBIN/BCIN exercise is the Butterflies of Canada (www.cbif.gc.ca/spp_pages/butterflies/index_e.php). That project assembled specimen data from many (but not all) major insect collections across Canada on a single, well-known, group of insects for which taxonomic data and curation are in good shape and, thus, used repeatedly in databasing and analysis projects. The butterflies represent a small group of insects in Canada (293 species) and are unusual in that they have been so well studied by systematists that available identification tools like field guides and regional catalogs make specimen identification a simple process for competent entomologists. There have been, apparently, no other concrete products combining data from a large number of collections arising from CBIN/BCIN/BKIN, though there has been limited distribution of the reports of the workshops and identification of some vague objectives. It appears that BKIN has been subsumed within CBIF (see below).

The Global Biodiversity Information Facility (GBIF, www.gbif.org). is an international program that will coordinate national and regional efforts to compile interconnected databases of biodiversity information. Canada, one of the member countries of GBIF, has responded to its commitment to GBIF by establishing CBIF, the Canadian Biodiversity Information Facility (www.cbif.gc.ca), coordinated by FBP (or perhaps FBIP), which has assumed responsibility for the objectives previously held by CBIN, BCIN and BKIN.

Under the Canadian programs, the databases of taxonomic names will be built upon, and linked to, the framework of the Integrated Taxonomic Information System (ITIS) (about which more below).

One of the main weakness in this whole system, aside from the necessity to learn new acronyms every few months, is that the FBIP/BKIN/CBIF initiatives in Canada are overwhelmingly top-down, with federal agencies driving all decisions, meetings, workshops and consultations, as well as dispensing all budgets, much of which seems to be allocated to the aforementioned meetings, workshops and consultations. Information transfer to members of the systematics community is sporadic at best. The university community is notably absent from any substantive input into the programs. On the other hand, the actual data collection and verification is primarily bottom-up, built on the efforts of individual systematists, frequently in the university system. Between the top-down "planning" and the bottom-up execution, there is a broad no-man’s land, and the working systematists grow increasingly disillusioned and cynical with the glowing visions of a computerized utopia coming from above.

Most databasing that has been done at the level of natural history collections has involved individual researchers finding small sums of money for support staff or setting aside some of their own valuable time to organize data on a portion of their own collection, often as part of a larger systematic study. The FBIP/BKIN/CBIF vision of a community of data generators and data users sitting at the computer peering virtually into the drawers of other museums is certainly an appealing vision, but it is, at best, a little farther in the future than we are led to believe, and, at worst, an indication of how out of touch these initiatives are with the current state of raw biodiversity information for nearly all groups of arthropods.

The quality and quantity of information
There seems to be an assumption at some levels that compiling taxonomic databases is a management problem, not a science problem (i.e., we don’t need particular expertise, just some bodies to connect the data). This assumption has led to the generation and proliferation of errors in the few existing databases. I searched ITIS (our "flagship" for taxonomic names), for my own family of expertise - the fly family Chloropidae (for those who wonder if this may be a particularly obscure or arcane choice, the family contains over 2000 described species worldwide, major pests of wheat, oats and rice on most continents, and vectors of conjunctivitis, yaws and Brazilian Purpuric Fever). The ITIS search generated a list of 368 chloropid names (subfamilies, genera and species); most are Nearctic, but there is a strange and woefully incomplete smattering of names from other biogeographic realms. One entire subfamily is missing (although an older, preoccupied name for one of the genera in that subfamily is included, albeit in the wrong subfamily). Some names are recent and valid, described in 1980; others are synonyms that have not been used in 40-50 years. Clearly, the anonymous person who did the data entry did not know anything about these organisms. I say "anonymous" not to spare their reputation, but simply because the database does not identify individuals responsible for the data. I fared no better with a search on Sphaeroceridae, one of the few acalyptrate Diptera families in North America with an authoritative and recent set of revisions, keys and species lists. Some genera were completely omitted; some species turned up in multiple genera. Admittedly, I had better luck when I searched for some major agricultural crops and pests.

The ITIS website identifies the source of its taxonomic data as the NODC Taxonomic Code, database version 8.0; The acronym "NODC" is not, unfortunately, defined on the ITIS website. However, a Google search revealed that NODC is the National Oceanographic Data Center (www.nodc.noaa.gov), which, in turn, gives no indication as to the source of its information on terrestrial organisms. These data, evidently used as the basis for launching the initiative, have simply been incorporated wholesale, with all their errors, into ITIS. Some may assume that bad data ("unverified" sensu ITIS) are better than no data (this is an erroneous assumption); some may assume that seeing such errors would encourage the appropriate specialists to volunteer their time and effort to fix them (this is also an erroneous assumption). Perhaps there was simply a desire to get as many records as possible incorporated into ITIS during the early "proof of concept" stages.

There are multiple problems here. First, given the small number of specialists and the current nature of our workloads, it is unlikely that we (the working systematists) will be lining up to clean up ITIS anytime soon. Second, and in the meantime, the error-filled lists are available on the Web in databases like ITIS or Nomina Insecta Nearctica (www.neartica.com/nomina/main.htm), another widely used, incompletely verified compilation that is rife with errors, at least in Diptera. People who are unaware of the errors incorporated in those resources use them, in turn, as their source for taxonomic names. And so the misinformation radiates out across the Web. So too does the mistaken assumption that as long as we have a name we don’t really need a systematist just to confirm what we already "know".

The dangers of misinformatics
The term bioinformatics has become entrenched in the biological lexicon (Sugden and Pennisi 2000). Some people restrict its use to the compilation and large-scale analysis of genetic data in centralized databases such as Genbank; others, including most of the agencies and initiatives discussed here, use a broader concept encompassing genetic, taxonomic, nomenclatural, phylogenetic and ecological data on organisms. Unfortunately, in its present incarnation, ITIS and similar entities are dispensing as much misinformatics as bioinformatics and until the focus changes from "proof of concept" to "ground-truthing" (to commandeer another reprehensible management cliché) they will continue to do so. There is a fine line between bioinformatics and misinformatics and unless the systematics community is encouraged to become a major player in these initiatives and there is tangible (i.e. financial) support for generating and verifying the data that these databases are built upon, Canada will be standing on the wrong side of that fine line. The proliferation of misinformatics websites has the potential to do more damage to the study of biodiversity than having no databases at all.

Where do we click now?
Progress on two fronts is necessary: the accumulation and verification of accurate primary data to build the databases; and the construction and coordination of databases to build a synthesis of information across collections. Data generation and data organization are tied together as securely as two people in a three-legged race. If we do not move forward together, neither of us will get where we want to go.

I have been assured, on more than one occasion and by more than one web database promoter, that increased funding for their products cannot help but generate additional support for basic systematic research. After many years, and especially since the adoption of the Rio Convention more than a decade ago, I have seen no evidence for this assertion whatsoever, at least in Canada. Too much money from limited departmental budgets has already been spent on ineffectual workshops and consultation reports, all of which state, repeatedly, the painfully obvious. There is little money left over to support meaningful progress toward the long-term goals.

Current federal initiatives in biodiversity databasing must acknowledge the weaknesses in their existing data and organizational structure and increase support for, and involvement of, the working systematics community. The continuing absence of involvement from academia and even of many systematists within the government system is a critical oversight that seriously weakens Canadian initiatives compared to ongoing American, European and Latin American programs. If this unification of purpose does not happen soon, future developments are obvious. Systematists will continue to lose valuable time trying to convince the bioinformatics committees and working groups of the value of our expertise and research, and of the necessity to train and employ more systematists to build the foundation that our database administrators seem to think is already in place. The systematists will also waste too much time trying to correct the damage done by the growing body of taxonomic misinformation that litters the information superhighway. Meanwhile, the biodiversity database designers will surround themselves in pretty paper and ribbons as they gift-wrap the same empty boxes, over and over again.

References

Federal Biosystematics Group. 1995. Systematics: an impending crisis. A statement at the time of the Federal Science and Technology Review. Publishing Division, Canadian Museum of Nature, Ottawa. 18 pp.

Godfray, H.C.J. 2002. Challenges for taxonomy. Nature 417: 17-19.

Hebert, P.D.N., A. Cywinska, S.L. Ball and J.R. de Ward. 2003. Biological identifications through DNA barcodes. Proceedings of the Royal Society of London, Series B 270: 313-322.

Mallet, J. and K. Willmott. 2003. Taxonomy: renaissance or Tower of Babel? Trends in Ecology and Evolution 18: 57-59.

Scotland, R., C. Hughes, D. Bailey and A. Wortley. 2003. The Big Machine and the much-maligned taxonomist. Systematics and Biodiversity 1: 139-143.

Sugden, A. and E. Pennisi. 2000. Diversity digitized. Science 289: 2305.

Tautz, D., P. Arctander, A. Minelli, R.H. Thomas and A.P. Vogler. 2003. A plea for DNA taxonomy. Trends in Ecology and Evolution 18: 70-74.

Wilson, E.O. 2003. The encyclopedia of life. Trends in Ecology and Evolution 18: 77-80.

 

Back to top Biological Survey of Canada (Terrestrial Arthropods) home page