|
|
Note to the Reader
Society for Neuroscience. Neuroscience 2002. Presented as a symposium in
Bioinformatics 2002 A Neuroscientist's Guide to Tools and Techniques for
Mining and Refining Massive Data Sets. Organized by Robert W. Williams, PhD.
and Dan Goldowitz, PhD.
Print Friendly
Everyday bioinformatics for neuroscientists: From maps to
microarrays
Robert W. Williams
Center for Neuroscience, Department of Anatomy and Neurobiology,
University of Tennessee, Memphis, Tennessee 38163
Email questions and comments to
rwilliam@nb.utmem.edu
Introduction
This chapter covers two topics:
First, I would like to introduce (or reintroduce) you to some of the key
features of maps of the mouse and human genomes. These maps have become an
important structural substrate around which many types of biological
information is now being assembled. Those of you who have linked recently to
the NCBI, ENSEMBL, and USCS Genome Browser sites for human and mouse genomes
will have encountered the complex graphic conventions and acronyms that are
used to display different types of genetic information. The progress in the
last year has been astonishing and will have an impact on research that is
carried out in most laboratories that have a molecular or genetic research
angle.
My second aim is to summarize some of the basic informatics and
computational tools and tricks used to manage large and small data sets. The
scale of this work can range from a modest stereological analysis of cell
populations in a few dozen cases, to large microarray databases, through to
huge image data sets (see the chapters by P. Thompson and M. Martone). Most
of us now use spreadsheets such as Excel in some capacity to manage lab
data. I hope to show you a few useful tricks for managing large Excel
spreadsheets. But I mainly hope to convince/encourage you that it is easy
and worthwhile to extend beyond disjointed sets of Excel spreadsheets and to
become comfortable, even proficient, using a simple relational database such
as FileMaker or Microsoft Access. Over the past few years our group has
become completely dependent on relational databases. Databases have replaced
notebooks and spreadsheets for most lab work and even for some primary data
analysis. The improvement in lab data handling has been amazing and
initially unrelated files and data set can often be easily merged. Best of
all, our lab data are now accessible using an Internet connection from any
computer in the lab or across the world. Internet databases are obviously
far easier to replicate, archive, and distribute than raw data stuck in a
notebook.
Topic 1: Physical and Genetic Maps
Maps come in two major flavors: physical and genetic. Physical is an odd
word in this context, but signifies that the map is based on sequence data
and on an assemblage of YACs, BACs, and other clonable pieces of chromosomes
that have been ordered into a contiguous stretch of DNA, preferably without
any interruptions or ambiguities. The NCBI site at
www.ncbi.nlm.nih.gov/genome/seq/ NCBIContigInfo.html has a fine description
about how the mouse physical maps (genome sequence) are being assembled.
Quality of these physical maps is now vastly improved over the situation
even one year ago and the progress will continue for several more years.
The unit of measure of a physical map is generally a base pair or nucleotide
(bp or nt). In humans, by convention the 0 bp position is at the telomeric
tip of the short P arm of each chromosome (usually illustrated at the tops
of most figures) and the end is at the tip of the long Q arm. Murine
chromosomes have extremely short P arms (all chromosomes are acrocentric),
and the 0 bp position is within a few megabases (Mb) of the centromere. A
typical chromosome in human or mouse is between 60 and 200 Mb long.
The great majority of genes have now been physically mapped in several key
species (although sometimes they go unrecognized for a while), and the
phrase gene mapping is beginning to loose its original meaning. The focus
now is turning away from mapping genes to mapping phenotypes across sets of
chromosomes and genes. By mapping a phenotype, I actually mean finding the
set of polymorphic genes (genes with multiple alleles) that modulate some
trait, for instance numbers of tyrosine hydroxlyase-positive neurons in the
substantia nigra or risk of developing Alzheimer disease. I’ll come back to
this topic.
In contrast, genetic maps are based on a somewhat more abstract analysis of
the frequency of recombination events that occur during meiosis along
paired-up sister chromosomes. The greater the distance in base pairs or
centimorgans between two points of a single chromosome, the more likely that
a recombination will occur between those two points to break up the original
arrangement of genes on the parental chromosome. That original order is
called the parental haplotype and the order of the recombination is called
the recombinant haplotype.
Until a few years ago, all genetic maps were constructed by computing the
frequency of recombination between genes and markers on chromosomes. The use
of the term genetic in this context seems inappropriate or superfluous, but
the idea was to minimize confusion: genetic maps are sometimes referred to
as meiotic maps, linkage maps, recombination maps, or haplotype maps (that
really helps to minimize confusion!), and the standard unit of measure of a
genetic map is the centimorgan (cM, defined below).
Chromosomes usually measure from 50 to 300 cM.
Box 1: Markers,
SNPs, Microsatellites, QTLs, etc.
In dealing with maps of various types you will need to know
how some key vocabulary is used by most geneticists: What is a marker, a
microsatellite, a SNP, a polygene, a locus, and QTL? A marker is often a
non-functional but polymorphic stretch of DNA, for example a short
microsatellites or a single nucleotide polymorphisms (a SNP).
Microsatellites (a term that derives from hybridization characteristics)
is highly repetitive DNA sequence that tends to be highly polymorphic
because polymerase has a very hard time replicating this boring DNA
accurately. For example the sequence CAGCAGCAGCAGCAG. CAG is a
tri-nucleotide microsatellite repeat that in the right reading frame
will translate into a string of glutamine residues. If a microsatellite
is in an exon, and if the number of repeats is abnormally large, then
bad things can happen to neurons: Huntington disease is an extreme
example. Markers, whether SNPs or microsatellites, will always be useful
for efficiently screening the structure of genomes and the inheritance
of blocks of DNA and blocks of genes. Markers that happen to be within
genes, such as the key microsatellite in the huntingtin gene are of
course interesting in their own right, especially when they correspond
to resistance or susceptibility to a disease.
A polygene is an odd term that refers to the set of polymorphic genes
that collectively control the variation in a trait. For example, the
BRCA1 and BRCA2 genes are part of a cancer susceptibility polygene.
Usually, we do not know the membership of a polygene; we just know that
a small or large number of scattered genes modulate some trait. Finally,
what is a locus? This is a term used to hedge bets. We would like to
call everything a gene, but many times we only know the approximate
chromosomal position that appears to contain factors that modulate
variation in a trait. This chromosomal region may contain a single
causative gene or it may contain a cluster of genes that collectively
modulate a trait. The safe term is locus. A QTL is a so-called
quantitative trait locus. That translates as follows: a chromosomal
region that harbors one or more polymorphic genes that influence the
variation in a trait in a graded (quantitative) manner. QTLs are
relatives of the modifier loci that one sometimes hears about in the
context of major disease genes and knockouts. A modifier locus is
usually a QTL that modulates the severity of a phenotype. |
The frequency of
recombination is variable and depends on the chromosome, the species or
strains, and the sex. Genetic maps are elastic. Genes and markers on genetic
maps have the same order, but distances vary among experiments and
populations. A useful metaphor: Genetic maps are similar to maps that
measure the separation between cities in terms of the standard driving times
required to get from one to another. Those times will be very contingent. In
contrast, a physical map is structural and not subject to much change.
Let’s look at the type of maps that are now available on line at NCBI.
Reading from the left side of this figure 2, we first see a cytogenetic
ideogram of the smallest of the mice chromosomes, Chr 19. The Zoom is at its
lowest setting and the lines and columns to the right side cover most of Chr
19. The left-most line is the approximate distance in millions of base pairs
(M in the figure) from the tip of the chromosome; the right-most line is the
genetic map measured in centimorgans. Again, there is no single genetic map,
but many alternatives, and the alternative that is displayed by NCBI is the
Mouse Genome Informatics group’s consensus map. There are lots of acronyms
of these and other maps (FPC for finger print clone), and you can find out
what each line or trace means by clicking on the column headers.
Will genetic maps fade out into the history of science as physical maps get
better and better? Absolutely not. The simple reason is that when we try to
discover the genetic basis of differences in phenotype, we almost invariable
rely on recombination events to test the likelihood that a sequence variant
is associated with a variant phenotype. (Cytogenetic abnormalities are an
important exception.) Most of the discoveries of genes and gene
polymorphisms (alleles) associated with diseases rely on probabilistic
recombination events—either the historical recombinations between
populations or the more recent recombination that are unique to large and
small families. Even if we could snap our fingers and sequence the entire
genome of every one on earth, we would still end up tracking the sites of
recombinations and their relations with variation in phenotypes.
Mapping Genes
Almost any sequence of nucleotides from Drosophila, C. elegans, human, and
mouse can now be physically mapped using BLAT or BLAST to the nearest base
pair in a matter of seconds (see Effective Mining of Information in Sequence
Databases, by David Deitcher in this Short Course). Jim Kent’s BLAT program
illustrated below is a remarkable web tool that works well for mouse and
human sequence. Paste any nucleotide or peptide sequence into BLAT at
genome.ucsc.edu/cgi-bin/hgGateway?db=hg12 and within 1–2 seconds you will be
rewarded with a list of hits. This new resource made it possible in
collaboration with John Hogenesch and colleagues at the Genome Institute of the Novartis Foundation,
to locate the base pair position of almost all the GenBank entries used to
make the Affymetrix U74Av2 GeneChip (Fig. 5).
If you simply need to explore a genome location to view sequence,
intron-exon structure, fish out promoter motifs, then just enter a key word
and the BLAT search will deliver you to a particular part of the genome. In
the figure below, I entered a search for HOXB8. Then zoomed-out to get a
view of the entire human HOXB complex on Chr.
Figure 1. Physical and genetic maps of mouse chromosome 19 from the
National Center for Bioinformatics. This figure can be expanded to reveal
fine details and sequence from almost any region.
If you visit this impressive web site you can get complete descriptions of
the various traces that are essentially graphical annotations and summaries
of the human and mouse genomes. You can zoom into the level of the
nucleotide sequence.
Mapping Brain and Behavioral Traits
Mapping phenotypes is a much more difficult task these days than locating an
arbitrary gene sequence. When we talk about mapping a gene that influences
circadian rhythm, neuron number, anxiety, susceptibility to Parkinson
disease, alcoholism, or schizophrenia, we are really talking about matching
differences in structure or function to one or more chromosomal regions;
so-called gene loci (see BOX 1). We would love to map genes for Parkinson
disease, but what that usually means is that we would like to identify
statistically significant association between variability in susceptibility
to Parkinson disease with a genetic polymorphism (variation) that may be
distributed widely across the genome. In other words, we are mapping a
phenotype to multiple regions of the genome. This is the crux of forward
genetics. If the process of mapping these traits intrigues you, then link to
a previous Short Course tutorial on forward genetic methods at
www.nervenet.org/papers/shortsourse98.html.
This Short Course contains much additional information on procedures for
mapping traits and genes.
Chromosome maps have a complex and heterogeneous structure. This is visible
at the cytogenetic level as differences in banding patterns and at a finer
grain as large fluctuations in mean gene density. The haphazard way in which
chromosomes differ between even fairly closely related species demonstrates
abundant chromosomal plasticity. However, gene location, order, and
orientation can also be important as highlighted by the conservation of the
HOXB gene families illustrated in figure 2 from the extremely useful Genome
Brower web site at the University of California Santa Cruz.
Figure 2. Detailed view of the human HOXB complex on chromosome 17
taken from the University of California Santa Cruz Genome Browser.
Topic 2: Excel and relational databases
Bioinformatics is closely associated with genomics and the analysis of
sequence data and maps (see Box 2), but in this section I would like to
broaden that definition to include handling information that is typically
generated and processed in laboratories every day. In the biotech and
pharmaceutical industry work of this type is handled by a LIM system (a
laboratory information management system). This type of everyday
“bioinformatics” often starts with simple decisions about unique case Ids
and identifiers to be used in experiments, extends through to the
organization, use, and security of lab notebooks, and often ends with the
extraction, analysis and archiving of data and experiments with spreadsheets
and statistical programs. Most of this type of information handling is taken
for granted and many of us (and especially our mentors) assume that there is
not much room for modification or improvement in the daily cycle of data
generation, analysis, and publication.
In fact, the efficiency and sophistication of the day-to-day aspects of data
acquisition and handling can be substantially improved. It is becoming more
important to have a lab database and a web site for more that just a
curriculum vitae and a set of pdf files. Lab web sites are becoming one of
the most effective ways to communicate results. www.nervenet.org provides a
good example of how our lab publishes data on-line.
In this section I will make some suggestions about how to move in the
direction of using relational databases to improve lab informatics. The
expense of entering this new sphere is modest and the gain in scientific
efficiency can be substantial. Best of all, these new tools make
collaborative research across cities and continents much more practical.
Excel: uses and abuses
Exposure to practical lab bioinformatics often starts with Excel. Excel has
become a pervasive (almost obligatory) vehicle for data email exchange. It
is also a very powerful tool for analysis. Not many of us have read or
reread the Excel manual: we usually learn on the fly. Let me summarize some
of the key features:
1. File size. Excel has a limit of 65,536 rows and 256 columns. That
is usually not a problem. Our lab still uses Excel for some aspects of
microarray analysis. You can easily pack 12,500 rows and 240 columns worth
of data into an Excel file and you can have multiple spreadsheets in a
single file. We have several Excel files that are about 120 MB in size, and
the program runs reliably if given 300 MB of memory. However, for all but
the smallest projects, it is not a good idea to store data files long term
in Excel. You will hear more about this in the next section. In brief, Excel
is for analysis—not for archiving and databasing. Running up against the
table size limits of Excel is not hard these days. If you begin to work with
even a single Affymetrix GeneChip at the cell level (about 500,000
cells/chip) then you will have to use another software tool (SAS, Systat,
SPSS, S-Plus, Matlab, DataDesk, FileMaker, MySQL, PosgreSQL, etc). More on
this later in the section on Relational Databases.
2. Transposing data. It is easy to transpose a data set in Excel
(that is, switch rows to columns and columns to rows). Select the region of
interest and copy it. Then select the upper left cell of the destination for
the transposed data and use the Paste Special command. There is a check box
labeled “transpose.” Paste Special is a very useful feature that we use
extensively to convert equations to values. This can reduce RAM requirements
and speed execution. Keep equations if you need them permanently for
updating. But if you just want the values, convert equations to values using
Paste Special. You can also transpose values and leave formats alone.
Figure 3. Using lookup functions in Excel to exchange selected
data from between files. Details of this method are described in the text.
Note the equation at the top of this figure. This equation places new data
from the Neocortex database (left side) into the Neocortex column of the
Caudate database (right side)
3. Merging complex tables that share a unique field is easy to do in
Excel. Let’s say you have an Excel table consisting of 6,000 gene
transcripts expressed in the caudate nucleus. You have another list of
12,000 gene transcripts with data on neocortex expression sent to you by a
colleague. You want to extract the neocortical values and align them with
the set of 6000 caudate transcript values. The problem is that the tables do
not overlap perfectly. The solution is simple. If the two tables share a
common field type, for example an Affymetrix ID number or a GenBank
accession number, you are in business. Just use the vertical lookup command
as shown in Fig. 3. Excel will help explain the use, but here is my version
of help: open both files, then add a new column in the Caudate Table labeled
Neocortex. Type in a variant of the equation that is listed toward the top
of the next figure. These equations have the form: =VLOOKUP(CellID,
LookupTable, Offset, FALSE)
.The CellID (A2 in the example below) is the spreadsheet cell that contains
the unique ID that both tables share (the Probe set ID 92996_at in this
example). The Neocortex table is just a region that will be interrogated by
Excel to find the single matching row in the Neocortex table (row 2773 in
the figure); the Offset is an integer that instructs Excel to copy data from
the Nth column to the right of the ID column. In this case, the offset is 3.
FALSE is a flag that instructs Excel to use only perfect matches. Make sure
that this equation works for the first few cells in your new column and then
copy the formula down the whole column. You may need to put dollar signs in
front of some cell references to lock the reference in place so that the
definition of the table does not change as you copy down the column.
4. Excel as a statistical analysis program. Simple statistics (mean,
median, average, errors) can be computed quickly for thousands of rows or
columns of data in Excel. This is an ideal use of Excel. It is also possible
to perform tens of thousands of t tests in Excel in less than a minute. If
you have ten arrays worth of data (5 wildtype and 5 knockout array data
sets), then you can perform a quick t test for every transcript using the
formula:
=TTEST(WT1:WT5; KO1:KO5, 2, 3)
WT1:WT5 is the range of the wildtype data in a single row (five columns
worth of values. KO1:KO5 is the same thing for knockout samples. The
parameter 2 instructs Excel to compute the 2-tailed probability. The final
parameter 3 instructs Excel to assume that the variance of the two groups is
not equal. Excel will return the probability of the t test rather than the t
value. If you have done any array work you will already be familiar with the
multiple tests problem (see the chapter in this Short Course by Dan
Geschwind and colleagues). An array consisting of 10,000 transcripts should
generate about 500 false positive results with alpha probabilities of less
than 0.05; 100 with P <.01; and 10 with a P <.001, etc. If you plot the P
values against their rank order (rank on the x axis from lowest to highest P
values, and the actual P value or log of the P value on the Y axis), then
you will end up with an interesting plot that can be helpful to estimate how
many false discoveries you are making at any given P value. (For more on the
Benjamini and Hochberg method of false discovery rates see
www.math.tau.ac.il/~roee/index.htm ).
It is not a good idea to use Excel in place of sophisticated statistics
programs. If you are gearing up for regression analysis, ANOVA,
non-parametric statistics, factor analysis, principal component analysis,
then buy one of the many good statistics packages. SAS, SPSS, StatView,
Matlab, and DataDesk are powerful tools. DataDesk in particular is an
amazing program that makes working with very large data sets more like a
game than a chore. We routinely review all of our array data with DataDesk
and use this program to generate draft figures for papers. If you buy this
inexpensive program be sure to work through the excellent manual. Ample
rewards.
5. Excel to normalize array data sets. This is a good use for Excel.
Excel can compute rank orders: =RANK(TEST_CELL, ALL_CELLS); compute the
logarithm base 2: =LOG(VALUE, 2); and compute the Z-score for a cell: =STANDARIZE(VALUE,AVERAGE,
STDEV). In many of these formulas you will need to lock one cell reference
so that values do not change when you copy or fill. Use the dollar sign to
lock a reference in a formula, for example if the cell that contains the
average is C12450, then enter it as C$12450. If you copy down the column
then the reference to the average will not change. If you copy to the right
however, then the reference may change to D$12450, since the column letter
was not locked. To lock both use the format $C$12450.
6. Using Excel as a database program. Don’t bother. Excel is great,
but it is definitely not a database program. If you have played with the
database functions that are built into Excel then you have all of the
experience and motivation that you need to graduate to one of several much
better, more powerful, and easier to use database
Figure 4.
Internet access to over 50 laboratory databases hosted on
an inexpensive but robust lab computer: a Macintosh G4 running OS X and
FileMaker server. The top panel is a partal listing of some of the
related databases, including CageDB (animal colony), CelloidinDB
(histology), EyeDB (eye phenotypes), DNADB (sample preparation), F2DB
(genotypes), etc. |
programs. FileMaker Pro and Access are programs with which you can get
comfortable in a few days. Read the next section for details on the
migration to relational databases.
Moving beyond Excel: Relational databases
A bioinformatic imbalance. We often do a great job handling the hard
problems in neuroscience and bioinformatics but often neglect to take care
of the simple housekeeping. This imbalance can lead to serious problems.
Imagine a sophisticated research lab performing hundreds of microarray
experiments and generating and processing megabytes of data every day. Such
a lab will almost invariably have expensive bioinformatics tools (GeneSpring,
SpotFire, etc.) and computer systems for handling array data. But the same
lab may not have a simple database to track the large number of tissue and
RNA samples that are stored in several freezers. In order to confirm the sex
and age of all of the cases in the array data base they may have to rummage
through a set of lab note books and Excel spreadsheets. To determine the
size of the litter to which a particular animal belonged may involve the
laborious analysis of animal cage cards kept in a shoebox in the animal
colony. It may not be practical to determine even after an interval of a few
months which of several investigators, students, or technicians extracted
the RNA; did they use Trizol or RNAStat?
These examples highlight a problem in the typical application of
bioinformatics. We tend to think of bioinformatics as high level analysis
that is applied at the final stages of preparing papers for publication. The
bioinformatic tools enter ex machina to the rescue. Most of us run
microarrays and then learn how to apply sophisticated statistical methods to
parse and interpret patterns of gene expression change. Bioinformatics
should actually be built into a laboratory from the ground up. Data should
ideally flow from one stage and level to the next without the need to
transcribe or reformat. Below is one example that describes how to
accomplish this transformation in your laboratory information management.
The limits of spreadsheets. In 1994 we began a series of experiments
with the aim of estimating the population of retinal ganglion cell axons in
the optic nerves of several hundred (now over a thousand) mice. For each
optic nerve we typically counted 25 electron micrographs and entered the
counts per micrographs and per case in a single row of an Excel spreadsheet.
We calculated means and standard errors for each nerve and row of data.
There seems to be no significant downside to this simple system.
There were a few minor problems that in aggregate became serious and that
illustrated the inadequacy of using Excel as a research database. How does
one handle right and left optic nerves when both sides are counted? That
seems simple; just enter the two sides in separate rows. The consequence is
that some animals were represented on two rows, whereas the majority are
represented on one row.
A second problem was that every time we added data for a particular strain
we had to rewrite some of the Excel formulas used to compute strain
averages. It became awkward to maintain both individual data and strain
averages in a single spreadsheet.
A third problem was keeping track of the latest version of the spreadsheet.
As many as three investigators were working on the spreadsheet each day, and
it was difficult to track versions and to make sure that information was
accumulated and collated correctly. This was a pain to do especially after
the Excel file grew to a large size.
A fourth problem involved the integration of other data types into the
spreadsheet. When we were writing up our results it became obvious that we
would need to consider variables such as brain weight, age, sex, body
weight, and litter size as potential modulators of retinal ganglion cell
axon number. Unfortunately, these data types were scattered in several other
databases. We diligently transcribed data from cage cards and other small
Excel databases and lab notebooks into our optic nerve spreadsheet. This
transcription was associated with the introduction of many transcription
errors and every new case that we added required us to transcribe data from
2–4 other notebooks.
Box 2: Good reading and reference.
Biological sequence analysis: probabilistic models of proteins and
nucleic acids (1998) by R Durbin, SR Eddy, A Krogh, G. Mitchison. $35.
The standard text on sequence analysis; the core topic of
bioinformatics. You can take a tour of the first 23 pages of this book
at Amazon.com.
Bioinformatics, a practical guide to the analysis of genes and proteins
2nd ed. (2001) edited by AD Baxevanis, BF Francis Ouellette. $70.
Provides an overview of common resources and an introduction to Perl.
The main drawback is that practical web-based bioinformatics is moving
so quickly that revisions are needed quarterly. A careful reading of
NCBI on-line documentation will cover much of the same ground. But if
you need hardcopy for bedtime reading...
Bioinformatics, the machine learning approach 2nd ed. (2001) by P Baldi,
S Brunak. $50. A more conceptual companion to the Practical Guide. Most
of the Amazon.com reviews are favorable, but I have to agree that the
coverage of topics I know best (array analysis) is of uneven quality.
Chapter 13 includes an armada of web resources for molecular
bioinformatics that is still useful.
Biometry, 3rd ed. (2001) by RR Sokal, FJ Rohlf. $96. This is one of the
best first courses you can take in statistics. Full of fine examples.
Were you aware that the standard deviation is a biased estimate and is
usually too low (p. 53)? This book does not have statistical tables.
Data reduction and error analysis for the physical sciences. 2nd ed.
(1992) by PR Bevington, DK Robinson. $50. Predates bioinformatics but if
you want an absolutely lucid presentation of the foundations of data
analysis with lots of practical advice and code snippets this is the
right Short Course. Includes some of the statistical tables missing from
Biometry.
Applied Multivariate Statistical Analysis, 5th ed (1998) RA Johnson, DW
Wichern. $105. This volume is a classic but rigorous coverage (“more
equation than words”) that covers the mind-bending world of multivariate
analysis. SK Kachigan wrote a much more accessible and shorter text:
($30, Multivariate Statistical Analysis: A Conceptual Introduction). LG
Grimm and R Yarnold assembled a collection of solid and accessible
chapters in Reading and Understanding Multivariate Statistics ($21) that
gets strong reviews on Amazon.
Fundamentals of database systems 3rd ed (200) by R Elmasri, SB Navathe.
$70. A thorough textbook that will introduce you to the theory and
practice of implementing database systems. |
The solution. The solution was obvious but seemed both risky and
impractical: convert our entire laboratory to a relational database
management system and begin to enter and reenter all data into a set of
interconnected database files or tables. The idea was to eliminate
laboratory notebooks and spreadsheets as much as possible. The process began
in the animal colony and extended through to post publication databases that
are now on-line.
What is a relation? The key feature of a relational database is that
it consists of an often large number of small tables of data that are linked
using key ID fields (for example the Probe_set_ID field in the previous
example).
Instead of trying to cram all data types into a single unwieldy table (the
Excel model), the idea is to parse data into more manageable and logical
pieces. The structure or scheme of a whole lab database system is then
defined in large part by how information flows between and among the various
tables. In the context of an animal colony, rather than having a single
complex ColonyDB table, it is related tables: CageDB, a RackDB, an AnimalDB,
more effective to break up the data types into four smaller LitterDB. These
four tables would all be linked by relations and key fields. For example,
each cage in the CageDB has a Rack_ID. The relation provides a conduit for
information flow and display. A very important idea in relational databases
design is to minimize redundant data among the related tables. Ideally, all
data only are entered into the single most appropriate table. You do not
want to have to enter the sex and age of an animal more than once. A perhaps
counterintuitive example: birth data would typically be entered into the
LitterDB, not the AnimalDB. The AnimalDB would inherit the date of birth
data by following the relational trail between a specific animal and the
litter to which it belongs. Minimizing data redundancy actually improves the
data integrity of the system. You won’t end up with animals that have two or
more different dates of birth. The organization of your database and how you
view and work with the data are two separate issues. Don’t confuse the
underlying database structure with the database interface. For example, the
form illustrated in Fig 5, actually displays data from four different tables
and makes use of relations that rely on the Probe Set ID, the Gene Symbol,
the Locus Link ID, and the GenBank accession number. The layout of the form
can be changed in a matter of seconds to simplify data entry or viewing.
Once the right relations have been made it is also simple to compute new
values and new field types based on data in a multitude of different tables.
You can export and print data from any and all of the tables, and you can
compute new data types across the tables.
Figure 5. Example of a one-to-many relation being used to track
and analyze microarray data. The primary database table contains 12,422
records, each of which corresponds to a unique Probe Set ID (94733_at in
this case). Each probe set, in turn, relates to 16 perfect matches held as
individual records in a second lower-level database table-- the Probe
Sequences that are shown in the lower panel. Selecting the Link to Ensembl
button (right side) opens a window on the www.Ensembl.org mouse sequence web
site. Apparently complex databases of this type are simple to make using
FileMaker.
Choosing a database is a important issue since you will probably
have to live with, manage, and pay for occasional upgrades of software for a
long time. The choice is not irrevocable, but migrating from one database to
another can take months. Even a “simple” upgrade can take weeks.
We considered and experimented with a few alternative relational database
programs, including Microsoft’s Access, FileMaker Pro, Helix, Acius’s 4D,
and Panorama. FileMaker was our final choice because of the ease of
implementing complex and visually self-explanatory tables and relations. It
lacks many sophisticated features expected on enterprise products like
Oracle 9, but that is not what we needed. FileMaker now has strong support
for Macintosh, Windows, and Linux platforms. Upgrades have kept pace with
technology without sacrificing ease of implementation. The interface with
Excel is also smooth, making FileMaker an easy upgrade to a relational
database system.
FileMaker vs. MySQL. We have compared the efficiency of implementing
database systems in FileMaker and a free and powerful relational database
called MySQL (Fig. 6). The Mouse Brain Library (www.mbl.org) was originally
implemented as a FileMaker database in just under two weeks by a high school
senior with strong programming skills. This web-accessible database has
performed admirably for several years with almost no unintentional downtime
and now accommodates a wide variety of images for approximately 3000
histological slides and over 200 strains of mice. The Internet interface was
not difficult to implement in FileMaker and allows rapid searches by
genotype for acquisition of images.
Figure 6. Internet implementations of the Mouse Brain Library (www.mbl.org)
using FileMaker or MySQL The MBL in concert with the iScope, and a
collection of C++ and CGI-like web interface programs deliver images
that range in resolution from whole slides (top), down to ~0.2 microns
per pixel per slide. The iScope is an Internet-driven microscope that
can deliver Z-axis image stacks in color and at sizes up to 1280x960
pixels. These stacks are suitable for high-resolution on-line
high-resolution stereology. |
Once we had built and full tested the FileMaker version we then decided to
replicate the entire system using a free and powerful relational database
called MySQL on a Linux platform. This free implementation took a skilled
database programmer just over 3 months. That is not atypical for MySQL.
However, replicating the MySQL implementation from one site to another site
took less than a week. The moral is that if you want to maximize efficiency
of time and ease of implementation then use a database system that has a
strong and logical interface and high-level graphical interface tools. In
contrast, if you want to provide a free system for use by a broader
community then either convert to MySQL or PostgreSQL (both open source
databases that run on most major operating systems: see www.mysql.com and
www.us.postgresql.org). If speed is a major consideration (lots of array
files), then MySQL is now a faster database management system than
PostgreSQL or FileMaker. For a cogent comparison of these DBMS see
www.webtechniques.com/archives/2001/09/jepson/.
A precaution: There is a certain macho urge to use the most robust
heavy-iron commercial program you can get your hands on as part of a
laboratory database system. Oracle, Sybase, and similar high-end systems are
intended primarily for mission-critical 24:7 activity (student records,
payroll, etc.). Experts on databases generally know these systems well, and
they genuinely think they are being helpful by recommending Oracle with its
sophisticated transactional processing. But Oracle and Sybase are a mismatch
for a typical laboratory. Research and lab databases need to change on a
weekly basis. The layout of fields for data entry may change on a daily
basis. Local control, speed, and mobility are far more important than
processing speed or high level feature sets. Don’t go hunting with a tank.
You need to know how to make changes to the structure of your tables, in the
layout of your entry forms, and how to efficiently export data for
downstream statistical analysis. A strong point in favor of Excel is its
transparency, and you don’t want to lose that advantage when moving to a
relational database. You need to retain full control of your own data.
Security. Backing up and making weekly permanent archives are both
critical. The difference between a backup and an archive is that the backup
is volatile on a daily, weekly, or monthly basis and will be overwritten at
some point. In contrast, archives are intended to be as permanent as
possible. Even the simple systems such as FileMaker Server Edition will
backup on any schedule you would like. Archiving to CD or DVD at the end of
the week is a new obligation that needs to be taken seriously, but that
would be true no matter what system you use.
Figure 7. Gene to protein synopsis taken from the Google image
archive (source: Rockefeller Univ.)
Acknowledgments
This work was supported in part by grants from the Human Brain Project (MH
62009). I thank my colleagues Drs. Lu Lu, David Airey, Glenn Rosen, Mel
Park, Guomin Zhou, Elissa Chesler, Siming Shou, Ken Manly, and Jonathan
Nissanov. Special thanks to an extraordinary group of programmers: Tony
Capra, Michael Connolly, Alex Williams, Nathan Laporte, Arthur Centeno, and
Yanhua Qu. Thanks to Emily Carps for help editing.
Williams RW (2002) Everyday bioinformatics for neuroscientists: from maps to
microarays. In: Bioinformatics 2002: a neuroscientist’s guide to tools and
techniques for mining and refining massive data sets. (Williams RW,
Goldowitz D, eds) pp. XX–XX. Washington: Society for Neuroscience.
Copyright © 2002 by R.W. Williams
|
|
|