The history of biological or life sciences is not too long, because as an independent field, it came into the 19th century where earlier records include ancient medicine were prevalent in Indian, Greek, Rome, and Egyptian regions. Botany and Zoology come into the picture as disciplines in 18th and 19th century which started when scientists like Alexander Von Humboldt start studying the interaction between organisms and the environment. It is much late when life sciences fascinated the world, and extensive data started emerging in life sciences.
From Warren Weaver in 1938 and Hershey Chase in 1952, these transitions brought out to find code and basics of genetic information in 1961 to 1966, and 1970 brought out the idea of information processing in biological sciences. It is how bioinformatics come into the picture. Dayhoff created the first database in 1971, which was based on information on proteins.
Interestingly, the first biological databases were not on DNA or RNA but on proteins. The human genome project further added a huge lump of information that needed to be processed and stored. These brought out the usage of computer sciences and biological databases come into the picture. Bioinformatics has many definitions and has many aspects related to it. It is peculiar to think that a biologist can study bioinformatics solely, and computer scientists can also do the same because it needs both. So this makes it comparatively a very nebulous term. Biological databases are storehouses of biological data.
The computer started getting used in life sciences because of the inability to have that data stored by the manual and human workforce. Later, it brings out additional advantages of sequence, so biological databases are created and created to get the precise storage and analysis method.
BIOLOGICAL DATABASES AND NCBI
Biological databases are used to store biological information such as nucleotides, nucleosides, DNA, RNA, Proteins, metabolic pathways, information regarding enzymes, etc. The first database was itself made when the insulin sequence was known in the year 1956. Databases are created to store large amounts of information and biological databases. They have stored information regarding genomics, proteomics, metabolomics, and other fields that are recently developed in modern times in biology.
Information technology has aided in developing biological sciences to a considerable extent. Nowadays, traditional knowledge is also getting stored in these databases. A biological database can be defined in many terms, including a storehouse of information in which data can be stored, taken, and developed.
In simple terminologies, data stored in a database can be defined as records and entries. The record has fields which means the type of information which is stored in the database. A simple database containing a single file with many entries regarding one or more categories, but there are many files linked with each other and contain hyperlinks in a complex database system. The information can be traced by crossing the boundary of one file to another.
A typical biological database generally contains the information searched at the initial, description of the searched information, source from which information is taken along with founder (if any), and the unique information ID in the form numeral or alpha numeral code being assigned by the particular database. The extent of information provided depends on the objectives from which the database is created. The complexity of databases increases as the amount and variety of information gets stored.
Complex databases contain highly specialized systems to maintain and generate a cumbersome process as it is not easy for each and everyone to create and maintain a biological database. An ideal database’s quality includes that it should carry as much considerable information about the information stored. The vast diversity of information should be stored irrespective of the field, although in a synchronized manner.
Data that is stored should be stored so that person who doesn’t have prerequisite information must also understand this. The data which is stored must be authentic and cross-verified by various means. Data that are similar and have minute differences must be stored, and it must be accessible to all, but the editing and edition must be done or approved by a person with expertise. A typical database have algorithm and alignment tools(Mount, 2001)
There are different kinds of databases; in broad terms, it is defined as structural and sequence databases. Structural databases have information regarding the structure, and sequence databases have sequence information. Out of structural databases, there is one of the databases, which is termed NCBI.
NCBI stands for National Centre for Biotechnology Information which developed in 1988. It is a database of databases and collaborated with EMBL( European molecular biology laboratory ) and DDBJ ( DNA Data Bank of Japan). It has various databases such as Gen bank and PubChem, pub med, and various others. It is a part of a collaboration that comes under the National Institute of Health. Out of which NCBI taxonomic database is discussed here(coordinators, 2017).
NCBI TAXONOMIC DATABASE
It is well known that taxonomy has a huge amount of data. Still, there was no data system to store that although NCBI was created on Nov 4, 1988, the taxonomy project started in NCBI and its collaboratory body in 1991. Data shared was inconsistent, and the entries which updated; were updated for irregular purposes. Entrez was later used to link information, but no common platform can be used in taxonomy. When three scientists David Hillis, John Taylor, and Gary Olsen, came together to bring an alternative solution.
For the first time, NCBI created a separate domain for the database, which will contain information for taxonomy on the public sequence entries. The taxonomic database provides a nomenclature and naming system for an organism. It provides the structure of classification as well as a phylogenetic tree describing the related families. It includes information regarding evolution. However, it doesn’t contain that categorization which is not closely related. For the different kinds of organisms, it contains specialty databases as well. Entrez provided a specific index number for a particular entry and provided other crosslinks to go to other entries.
According to 2011 data, NCBI taxonomic database contains entries with proper scientific names in 234991 numbers and without scientific names in 405546. A unique taxonomic number is denoted with the prefix of ICBN for plants, ICZN for animals, and ICNB for prokaryotic organisms. Entries with scientific names have names that have the citation, if parenthesis is there, it means the first described name was something else, but the present name is the shown one(Fedrhen,2012).
In terms of bacterial species found in nature are still not grown in vitro. The kinds of species that don’t get their link in the human identified species are being recorded in the database as candidates. This database also contains information regarding extinct species ( around 95). It shows that the verification system is not available with the NCBI taxonomic database. Naming also brings other problems. There are common names for different organisms. Duplication is existing at the genus level, family level, but it is unavoidable at the species level. In this case, NCBI considers this and uses a specific binomial name.
So the NCBI taxonomy has various kinds of naming systems and names: Scientific name – it can be formal and informal, common name – for easy accessibility, and different other names for better understanding. It is found that the publishing name can sometimes be changed when officially a paper is published, so it can create a sort of confusion; that’s why the informal name is used. Misspelling is also offered by NCBI taxonomic database as it doesn’t change or alter the data given or submitted to it.
Gen bank and blast name are also provided, which tells about how the information is stored in a flat-file format, and blast name is used for a comparatively larger group of organisms. Anamorph and teleomorphic names are given for fungal species. These all names have been stored for easy retrieval of information from the database. Even the new commer and young scientist can also access the database very easily. E.g., most of the times, the species name is not recognizable; at that moment, one can quickly identify and find information through blast name, which is stored in a database(fedrhen,2015)
Access to the taxonomic database is provided in the browser or Entrez form run by taxid software; the database works behind the succesful working for the database. The database relies upon a taxonomic element given a unique characterization of a unique id, numeral, or seminumeral unique to the single taxonomic information. It is one of the first databases that use the Entrez system and has stored information in a hierarchy. Rank is given to the searched item, and date-related information such as date of information storage, etc., is also provided in the database. Data can be retrieved in the form of a peri script with the utility service provided. Another feature is saving the searched information; for this, one has to make an account and store information in my NCBI account.
One can also get regular updates for the searched information, and queries can be asked by emailing the query to the web page’s address. The information shown is in two ways – one in which whatever it is requested, the database will show it on the webpage ( hierarchy page ) and another method include in which complete information is showed for the information asked. These kinds of information can be retrieved through two types of links provided by the Entrez system.
One of the taxonomic database’s exclusive features is that it allows for a wild card searching mode. Table dump feature also provides to remove or add new information or edit information. In this way taxonomic database provides a platform for analyzing taxonomy at a much more manageable level; even a new todd can also search for desired information quickly. The taxonomic database is one of the exclusive databases which NCBI provides. Simple searching include; type the name of the species and will show information on the browser (pauflis et al.,2013)
Stevens H. Life out of sequence: a data-driven history of bioinformatics. University of Chicago Press; 2013 Nov 4.
Gauthier J, Vincent AT, Charette SJ, Derome N. A brief history of bioinformatics. Briefings in bioinformatics. 2019 Nov;20(6):1981-96.
Mount DW, Mount DW. Bioinformatics: sequence and genome analysis. New York:: Cold spring harbor laboratory press; 2001 Mar 15.
Coordinators NR. Database resources of the national center for biotechnology information. Nucleic acids research. 2017 Jan 4;45(Database issue): D12.
Federhen S. The NCBI taxonomy database. Nucleic acids research. 2012 Jan 1;40(D1): D136-43.
Federhen S. Type material in the NCBI Taxonomy Database. Nucleic acids research. 2015 Jan 28;43(D1): D1086-98.
Pafilis E, Frankild SP, Fanini L, Faulwetter S, Pavloudi C, Vasileiadou A, Arvanitidis C, Jensen LJ. The SPECIES and ORGANISMS Resources for fast and accurate identification of taxonomic names in the text. PloS one. 2013;8(6).
Thomas C, Essafi H, inventors; Commissariat al Energie Atomique et aux Energies Alternatives, assignee. Process for the automatic creation of a database of images accessible by semantic features. United States patent US 7,043,094. 2006 May 9.
Federhen S. The NCBI Taxonomy database. Nucleic Acids Res. 2012;40(D1):136–43.