$Id: schema-overview.txt 199 2005-01-31 08:15:22Z lapp $

Version: 1.0
Author : Aaron Mackey
===================================================================

BioSQL v. 1.0 Schema Overview

This document describes some of the tables and fields in the BioSQL
schema.  It also aims to demonstrate functional capabilities using
example SQL. Design philosophies and expectations are presented with 
reasoning.

The BioSQL schema is described in four sections: Bioentry, Sequence 
Features, Ontology Terms, and Annotation.  A fifth section demonstrates
possible extensions to the schema that may be useful under some
circumstances.


I. Bioentry with Taxon and Namespace

BIOENTRY

This is the core entity of the BioSQL schema; a bioentry is any single
entry or record in a biological database.  The bioentry contains
information about the record's public name, public
accession and version, its description and an identifier field.
Finally, for working convenience with GenBank records, the division of 
GenBank can be specified in a 3 character field.

For example, this truncated GenBank record:

LOCUS       S63169S6                  22 bp    DNA     linear   PRI 25-AUG-1993
DEFINITION  NDP=Norrie disease {first three exons, microdeletion regions}
ACCESSION   S63178
VERSION     S63178.1  GI:386456
...
//

Would be stored in bioentry as:

name:		S63169S6
accession:	S63178
identifier:     386456
division:       PRI
description:	NDP=Norrie disease {first three exons, microdeletion regions}
version:	1


bioentries need not come from a public database; a bioentry from a
private lab database might look like this:

name:		MyFavGene1
accession:	MFD12345
identifier:     902772
division:       ion_ch
description:	Gene prediction from my secret organism
version:	10

In this case, the identifier 902772 is not an NCBI GI number, but is a
key to lookup this entry in the private database, "My Favorite
Database" (MFD).

BIODATABASE

A biodatabase is simply a collection of bioentries; one bioentry may
only belong to one biodatabase, but one biodatabase may contain many
bioentries.  biodatabase entities can be identified by their name:
"GenBank", "trembl", "MyFavoriteGenes", etc.  Databases may also be
further identified by an authority, the organization under which this
database name is officially mandated.

SQL example - fetch the accessions of all sequences from SwissProt:

SELECT DISTINCT bioentry.accession
FROM   bioentry JOIN biodatabase USING (biodatabase_id)
WHERE  biodatabase.name = 'swiss'

SQL example: Find the database, 'GenBank' or 'GenPept', that contains the
GI number 386456:

SELECT biodatabase.name
FROM   bioentry JOIN biodatabase USING (biodatabase_id)
WHERE  bioentry.identifier = '386456'
  AND  biodatabase.name IN ('genbank', 'genpept')

SQL example - how many unique entries are there in GenBank:

SELECT COUNT(DISTINCT bioentry.accession)
FROM   bioentry JOIN biodatabase USING (biodatabase_id)
WHERE  biodatabase.name = 'genbank'

SQL example - fetch the locus names for the latest versions of all
entries where the biodatabase name is 'swiss' (Mysql syntax):

SELECT MID(MAX(CONCAT(RPAD(LPAD(bioentry.version,5,'?'),10,'?'),
       bioentry.name)),11)
       FROM bioentry JOIN biodatabase USING (biodatabase_id)
WHERE  biodatabase.name = 'swiss'
GROUP BY bioentry.name

BIOSEQUENCE

In BioSQL, all databases have bioentries, but not all bioentries need
have raw sequence data associated with the entry.  The biosequence
table contains the raw sequence information associated with a
bioentry, and alphabet information ('protein', 'dna', 'rna').  
One bioentry may have only one biosequence associated with it, and 
vice versa: a given biosequence applies to only one bioentry.
Sequences may have their own version number, independent of its 
bioentry version information.  The length of the sequence is also 
stored for pre-calculated convenience.

Note: while the schema's basic structure might imply that bioentries
could be associated with multiple biosequences in a one-to-many
relationship between bioentry and biosequence, this is not the case:
the bioentry_id foreign key present in the biosequence table is
constrained to be unique, thus enforcing the one-to-one relationship
between the two tables.

Example SQL - what is the description of the longest sequence in GenBank?

SELECT   bioentry.description
FROM     bioentry
         JOIN biodatabase using (biodatabase_id)
         JOIN biosequence using (bioentry_id)
WHERE    biodatabase.name = 'genbank'
ORDER BY biosequence.length DESC
LIMIT 1

Example SQL - find all bioentries with protein sequences containing "ELVIS":

SELECT  bioentry.*
FROM    bioentry JOIN biosequence USING (bioentry_id)
WHERE   biosequence.seq LIKE "%ELVIS%"
  AND   biosequence.alphabet = 'protein'

BIOENTRY_RELATIONSHIP

Bioentries may themselves be related to one another (e.g., a PDB
record may be composed of multiple subrecords for separate chains, or
multiple SwissProt records may be associated with a given PFAM domain
entry).  These relationships are "typed" via links to ontology terms
using the term_id field.

TAXON, TAXON_NAME

These are tables to store basic taxonomic information about
the organism to which a given bioentry refers, and they reflect the 
structure of NCBI's taxonomy database. Each bioentry can be
associated with only one taxon, but many bioentries can be associated
with the same taxon. In order to get the most value from these tables
it's recommended that you use the BioSQL script load_taxonomy.pl
to populate them.

The taxon_name.taxon_id field is meant to store an NCBI
taxon id. The name_class field stores tags to describe taxonomic names
(e.g. "scientific name") and the name field stores the value (e.g 
"Homo sapien"). This flexibility allows us to store such things as 
synonyms and common names as well as the expected binomial.

The taxon table is designed to store the taxonomic relationship
between taxons found in the taxon_name table. The node_rank field
stores the class of the taxon (e.g. "species", "kingdom"). The 
parent_taxon_id contains the taxon id of the parent taxon, since there
should only be one parent in the taxonomic tree. The right_value and
left_value fields store values that are calculated and entered by the 
load_taxonomy.pl script. These arbitrary values are the upper and
lower bounds of "nested sets", one set for each taxa, where the set 
of the child taxa is contained within the larger set of the parent 
taxon. An example would be the set for the species Procyon lotor,  
365816 to 365823, contained within the set for the genus Procyon, 
365815 to 365828.

Note: Taxon is optional for bioentries. This is because certain
bioentries may not have a clearly identified taxon, or because the
concept of taxon may not be meaningful for the bioentry.

Example SQL - find the taxon id of the parent taxon for 'Homo sapiens'
using a self-join.

SELECT parent.ncbi_taxon_id
FROM   taxon AS parent
       JOIN taxon AS child
       ON child.parent_taxon_id = parent.ncbi_taxon_id
       JOIN taxon_name
       ON taxon_name.taxon_id = child.ncbi_taxon_id
WHERE  taxon_name.name = 'Homo sapiens';

Example SQL - find all human sequences:

SELECT * FROM biosequence
       JOIN bioentry USING (bioentry_id)
       JOIN taxon_name USING (taxon_id)
WHERE  taxon_name.name = 'Homo sapiens'

Example SQL -find the taxon id's of all the parental taxa in the 
Primate lineage using a self-join:

SELECT b.taxon_id FROM taxon as a
       JOIN taxon as b
       ON (a.left_value < b.right_value AND a.left_value > b.left_value)
       JOIN taxon_name
       ON a.taxon_id = taxon_name.taxon_id
WHERE taxon_name.name = 'Primate'


II. Sequence Features with Location and Annotation

SEQFEATURE

More information pertaining to a bioentry is stored as a
generic "feature" of the sequence, the semantics of which are defined
by associations with a specific "source" term and optional qualifiers
(see below under TERM).

LOCATION

The location of each seqfeature (or sub-seqfeature) is defined by a
location entity, describing the stop and start coordinates
and strand.  A seqfeature may have multiple locations (i.e. split
locations are handled).  Start and stop coordinates may be left NULL
to accomodate some forms of "fuzzy" locations.  Additionally, a
location may refer to a "remote" sequence, i.e. not the sequence
associated with the bioentry, this is accomplished by a dbxref_id link.

SEQFEATURE_RELATIONSHIP

Sequence features may also have associated sub-seqfeatures (with potentially
many-to-many parent-child relationships).  These relationships are
also "typed" via links to ontology terms using the term_id fields.


III. Ontology Terms and Relationships

TERM

An ontology (in the current usage) is essentially a dictionary of
terms in a somewhat-controlled vocabulary.  An ontology_term is used
to "label" a seqfeature's name ("exon", "CDS", "5' UTR", etc), as well
as its source ("GeneWise", "Glimmer", etc), and to define the types of
relationships between seqfeatures and their sub-seqfeatures (e.g. "is
composed of", "gives rise to", "transmembrane segments of" - see
seqfeature_relationship below).  While a seqfeature may have only one
term to describe its type and source, relationships between
seqfeatures and sub-seqfeatures may have multiple terms associated
with them.

TERM_RELATIONSHIP, TERM_DBXREF

However, the powerful utility of ontology terms is that they can be
associated with each other in hierarchies; e.g. a "sequence similarity
search" is a general term that includes more specific terms like
"BLAST result" or "HMMER PFAM result".  "BLAST result" may also be a
more specific term for "pairwise sequence alignment".  One might like
to further qualify these relationships by putting names on them: a
"BLAST result" is a "result from" a "sequence similarity search" and a
"example of" a "pairwise sequence alignment".  We refer to these as
"subject", "predicate", "object" (or, "parent", "relationship type",
"child").  The term_relationship table performs this mapping
between terms, using the subject/predicate/object terminology.  This
mapping (or "rule set") must itself be given an ontological namespace
(ontology_id), as the mapping may relate terms between separate
ontologies; we must keep track of where each "rule" comes from.

Finally, ontology terms themselves can be linked to external databases
via a many-to-many relationship using the term_dbxref table.

TERM_PATH, BIOENTRY_PATH, SEQFEATURE_PATH

All three of these tables are meant to store the "transitive closure"
of the respective *_relationship data; that is, if A is related to B,
and B is related to C, then A is related to C, and will have a row in
the table.  The definition of the type of relationship between A and C
depends greatly on the semantics of the individual relationships
between A and B and B and C (including the possibility that A and C
aren't actually related by any meaningful type, and should therefore
not appear in the table).  We leave it to individual implementors to
define the policy for building these paths.

A very generic policy would be to use the ontology of relationship types
involved between A and B and B and C, and choose the greatest common
denominator between the two relationship types (e.g. when the two
relationship types are the same, then A and C are related by the same
type; when the two relationship types differ, then A and C are related
by the first "supertype" that includes both relationship types).


IV. Annotation Bundle

Annotations are similar to Sequence Features in that they describe a
sequence, but unlike Sequence Features they have no locations on the
sequence, they are associated with the entire sequence. Annotations
may come with references and database ids.

BIOENTRY_REFERENCE

A given literature reference may be associated with many bioentries,
and a given bioentry may be associated with multiple references (thus
calling for the linking "bioentry_reference" table to map the
associations between each).  Furthermore, the rank field
may be used to define the order of the references for each associated
bioentry.  Lastly, start_pos and end_pos fields may be used
to associate references with specific locations on the bioentry.

COMMENT

Each bioentry can have one or more simple textual comments associated
with it, and the order of the comments may be specified by the
rank field.

BIOENTRY_QUALIFIER_VALUE, DBXREF_QUALIFIER_VALUE,
LOCATION_QUALIFIER_VALUE, SEQFEATURE_QUALIFIER_VALUE

Furthermore, ontology terms may be used to qualify dbxrefs,
bioentries, seqfeatures, and locations.  Multiple qualifier values can be
associated with each entity.  Together, this allows one to put
meaningful "labelled" data on these otherwise generic objects.  For
example, a SwissProt dbxref might have an additional qualifier value
that was the SwissProt name (GTM1_HUMAN) of the dbxref.

An alternative design philosophy is to break out overtly common
entities into their own entity table with explicitly named fields (as
has been done for literature references and taxa).  While an
ontology-driven seqfeature "meta-table" is theoretically capable of
storing any information about a bioentry, it is sometimes more useful
to have extraordinarily common entities represented by their own
tables. In a "view-capable" relational database, pre-computed SQL
SELECT statements may be used to generate a read-only view to obtain
entity-specific tables.  For example, a "SwissProt_dbxref" view could
be made that had all the dbxref fields plus a "SwissProt_name" field
containing the qualifier value discussed previously.

REFERENCE

Entries in a database may have cross-references to the literature.
The reference table stores each journal article, book chapter,
etc. that may be associated with a bioentry (or multiple bioentries).
A reference's location refers to the journal (including volume, index,
and possibly pages) or book in which the reference is found.  Neither
the location nor the authors fields have any canonical format, they are as
found in the bioentry record.  To help ensure uniqueness, a calculated
checksum (crc) is kept over the author, location, title
fields. Also, if provided by the data source, dbxref_id
will contain the MEDLINE number, or any other identifier if the
reference is indexed in another resource than MEDLINE.

DBXREF, BIOENTRY_DBXREF

Database cross references are links to records in other databases
(whether they be sequence databases or not).  The relationship between
bioentries and dbxrefs is many-to-many: one bioentry may have multiple
associated dbxrefs, and one dbxref may be associated with many
bioentries.  


V. Possible add-ons to the core BioSQL schema

Bioentry date stamping:

CREATE TABLE bioentry_history (
  bioentry_history_id INT UNSIGNED NOT NULL AUTO_INCREMENT,
  bioentry_id INT UNSIGNED NOT NULL FOREIGN KEY REFERENCES 
  (bioentry.bioentry.id)
  startdate DATE NOT NULL DEFAULT CURRENT_DATE,
  enddate DATE NULL
);

Example SQL - give me all the entries from GenBank as they existed on
Jan 1, 2002

SELECT bioentry.*
FROM   bioentry
       JOIN biodatabase USING (biodatabase_id)
       JOIN bioentry_history USING (bioentry_id)
WHERE  biodatabase.name = 'GenBank'
  AND  bioentry_history.startdate <= 2002-01-01 
  AND  (bioentry_history.enddate IS NULL
        OR bioentry_history.enddate > 2002-01-01)

  Alternative, if you want db history info as well:

CREATE TABLE biodatabase_history (
  biodatabase_history_id INT UNSIGNED NOT NULL AUTO_INCREMENT,
  biodatabase_id INT UNSIGNED NOT NULL FOREIGN KEY REFERENCES 
  (biodatabase.biodatabase_id)
  entrydate DATE NOT NULL DEFAULT CURRENT_DATE   -- date db was updated
  comment TEXT -- optional; maybe you want versioning here, I dunno
);

CREATE TABLE bioentry_history (
  bioentry_history_id INT UNSIGNED NOT NULL AUTO_INCREMENT,
  bioentry_id INT UNSIGNED NOT NULL FOREIGN KEY REFERENCES 
  (bioentry.bioentry_id)
  entered INT UNSIGNED NOT NULL FOREIGN KEY REFERENCES 
  (biodatabase_history.biodatabase_history_id)
  removed INT UNSIGNED NULL FOREIGN KEY REFERENCES 
  (biodatabase_history.biodatabase_history_id)
);

ALTER TABLE bioentry_history ADD CONSTRAINT check_entered_removed (
  bioentry_history.removed IS NULL OR bioentry_history.entered 
 <> bioentry_history.removed
);

Example SQL - same as before, retrieve all bioentries from GenBank as
they were on Jan 1, 2002

SELECT bioentry.*
FROM   bioentry
       JOIN biodatabase USING (biodatabase_id)
       JOIN bioentry_history USING (bioentry_id)
       LEFT JOIN biodatabase_history AS enter
	    ON (bioentry_history.entered = entered.bioentry_history_id)
       LEFT JOIN biodatabase_history AS exit
	    ON (bioentry_history.removed = exit.bioentry_history_id)
WHERE  biodatabase.name = 'GenBank'
  AND  entered.entrydate <= 2002-01-01
  AND  (exit.entrydate IS NULL OR exit.entrydate > 2002-01-01)

Advantage: you don't need to store N*M rows for every M database
updates, only N rows of date ranges (or database version refs).

Disadvantage: all historical bioentries remain in the database, even
ones that are no longer "current" - simple SELECTs must specify
enddate IS NULL (optionally, a "is_current" flag can be added to
bioentry) - again, a bioentry_current VIEW may be the best solution.


Sequence redundancy:

There is no utility here for handling biosequence redundancy; i.e. the
biosequence table cannot be easily used to generate non-redundant
sequence views.  An accessory table "biosequence_redundancy" could be
used to store redundant pairs (including a self-self pair) for those
who need it.

CREATE TABLE biosequence_redundancy (
  biosequence_redundancy_id INT UNSIGNED NOT NULL AUTO_INCREMENT,
  biosequence_a INT UNSIGNED NOT NULL FOREIGN KEY REFERENCES 
  (biosequence.biosequence_id),
  biosequence_b INT UNSIGNED NOT NULL FOREIGN KEY REFERENCES 
  (biosequence.biosequence_id),
  UNIQUE(biosequence_a, biosequence_b)
);

This redundancy table could be further "typed" by ontology terms.
Alternatively, all redundancies could be stored as pairwise
seqfeatures.

Example SQL: Give me all nonredundant biosequences from GenPept
(i.e. the NCBI "nr" database)

[ Nasty cross-joining SQL to be delivered ... ]

Feature set versioning:

Could already be accomplished with "dated" source ontology terms, but
that seems like a bastardization.  Suggestions welcome.
