Tip:
Highlight text to annotate it
X
Well, thank you very much. I'd like to thank David Landsman and David Lipman and everyone else
involved in organizing this symposium, for inviting me to participate. I felt a little bit horrified at
seeing the order of speakers and wondering whether I'd have any factoids left at all to share with you
about the early days of getting the DNA sequence databases off the ground, but I'll sort of skim through
and look for what might be different or novel relative to the others. I won't translate the title of my talk.
I also quickly realized that you would have seen the same title four or five times in a row,
so I'll explain where I got this in a slide or two. So we've been hearing a lot about the collaboration
and US/Japan and US/Europe and since I'm going to be nationalistic about my newfound home in Canada
over the past few years - and just to make the simple point that around the year 2000, Canada and
Canadian scientists got quite concerned about the race to do big-science in the context of
genomics and proteomics, and set up something called Genome Canada, and a number of
affiliated gene centers of which I'm affiliated with one, Ontario Genomics Institute, and have
pumped about one and a half billion dollars into genomics and proteomics projects over the last eight years.
So I thought I'd try to find something as a new starting point, and I'll come back to this towards
the end of the talk. In 1731, about 251 years before GenBank got started something called the
Library Company got started by Ben Franklin and a group of 49 or 50 friends, at which point
he had drawn up instruments of association, and it was the first lending library in North America.
It so happened it was private, but it was the model for public lending libraries that he went on
to establish, and I took the title of my talk from the Latin, the motto that they have inscribed
on their seal - this company still exists today - and translates briefly as "To pour forth benefits
for the common good is divine." Which I think is a good precedent and model and really
resonates with all of the themes you heard in Bruno's talk and the other talks so far. So I'm not
going to really dwell on all this, but just sort of setting a little bit of timeline. I believe the first
nucleic acid sequence was published around 1965. It was alanyl tRNA put together with a lot of
hard biochemistry and stitchery. Just again small on today's standards, but to show the sense in which
the world is racing along. The first so-called complete genome was a phage sequence phiX174 in 1977,
and that was a bit over 5,000 bases and, of course, there were many other sequences coming out
at that point in time. Excuse me, because of the new technologies that Sanger and Maxam and Gilbert
had established, and it really gave rise to a growing interest and enthusiasm and concern in the
community about being able to get all these sequence data on a database. So I won't talk about
the Rockefeller workshop. It's been mentioned several times. I will note that there was a group
at Los Alamos that was quite interested not only in the sequence data but also in the analysis.
This included Stan Ulam, Bill Byers, Temple Smith, Mike Waterman, Walter Goad, and George Bell
and Michael and Temple both attended the Rockefeller workshop as did Howard Bilofsky, who
of course was part of GenBank later on from BBN. So sort of passing into the more intense period
of sorting out what should happen next - a really key point for the effort at Los Alamos was the arrival
of Minoru Kanehisa from the east coast of the US who began working with Walter's group,
and just everything, A to Z, from entering sequences to developing software tools for organizing the data
and sorting out how it should be organized and what processes will be involved, and I noticed
a letter just recently that Walter had sent to Minoru about ten years ago, saying that in looking
back to those early days that he thought Minoru had as much as anyone to do with the start of
the Los Alamos sequence library. So really, really tremendous impact on getting the effort going there.
Shortly after the Rockefeller workshop, the folks at Los Alamos were quite excited about the idea
of building such a resource and began collecting data and organized a small workshop where they
invited scientists, I think for the most part from the US, to visit and around August of '79 to
discuss the possibility of building up an analysis and database resource. This resulted in or
was meant to feed into an unsolicited proposal that Los Alamos submitted to the NIH towards
the end of 1979, where they proposed to build such a resource. That was ahead of the curve
in terms of NIH deciding how it was going to do, what it was going to do. There was also a lot
of back and forth in the community and at the NIH about whether to fund and what resources
would be involved in the funding, and an analysis group as well as a database group, but I think
there was a nice note of this from Bruno that they reached their hundred KB point in mid-1980
or late 1980 and had a celebratory party. One of the ongoing challenges in working the databases
was trying to figure out how far apart and how many zeros to add on in order for the celebrations
not to be so fast upon themselves that they were no longer exciting. But there was a workshop
that you heard earlier that Europe was off and running. That and the fact that the community in
North America was still quite concerned that something get going led to a workshop in mid-1980
in July, organized by Elke Jordan and Marvin Cassman and chaired by John Abelson and I came
across a letter that Larry Kedes wrote to Walter Goad and perhaps to all participants in the workshop
on the second and last day of the workshop that sort of underlined their sense that they wanted to
stop having meetings and get on with establishing the database, where he says " The meeting adjourned
with the optimistic expectation that there would never have to be another one." At sort of the end
of the summer of 1980, Jim Fickett joined the group at Los Alamos. He was actually on the
faculty as a mathematician in Texas but got interested in what was going on in T-10
and gave up his faculty job to come back and take on a post-doctoral role as an entrez into the
T-10 group, and he very quickly settled into and got involved in a lot of the software tools that
were involved in building up the entry and analysis of the data that was going on in the database then.
So I think sort of ending this period of pre-GenBank was the formal publication as a lab
technical report in early '82, March '82, of the Los Alamos Sequence Library database report
that Jim, Walter, and Minoru put out, and that contained just under half a million bases at the time.
But I just thought I'd sort of give a quick survey of some of the characters I've been talking
about at this stage. Upper left corner is George Bell, who with Walter Goad established the
Theoretical Biology Group early on and Charles Delisi, was part of that at the beginning also.
In the right hand side you see Temple Smith and Mike Waterman who were at various times
affiliated with Los Alamos and involved in that Rockefeller meeting as I mentioned, and very
interested in the sequence alignments, which of course the publications they put out later attest to.
I believe this picture was actually snapped by David Lipman on a trip from Arizona heading back
towards the East Coast in Los Alamos. Then in the lower left you see Minoru Kanehisa and
Jim Fickett who were both mentioned earlier. I think that's Jim and in situ at GenBank and
Minoru some years later. So getting into Phase 1 of the contract from just very early -
January of '82, I joined the T-10 group, and I think my entrez into the group was - though as
best I could tell as I arrived there were all brighter than me. None of them were card carrying
biologists, and they were eager in lining themselves up for the proposals they would be writing
in response to the NIH RFP to have biologists on hand. So that made for a post-doctoral
opportunity for me, which I was delighted to take and greatly enjoyed. So I joined Walter's
group and Jim and Minoru at that point in time. About a month later the proposal was sent off
from Los Alamos, and there was a quirk that we got caught up in. And I don't think this had
been realized too far ahead of time, but it played out nicely in the end or at least we worked to
make it play out well, which is that since the NIH decided to - NIGMS decided to do GenBank
as a contract, the Department of Energy in its wisdom thought it would be inappropriate for one
government agency to contract with another government agency, and so we were encouraged to
find collaborators who would actually in working with us be the primary contractors for the NIGMS.
And so Los Alamos participated in two competing proposals, one through IntelliGenetics and the
other through BBN Labs, and of course as you know from an earlier comment, Margaret Dayhoff
and the folks at Georgetown University also submitted a proposal. So you've heard a bit about
the deliberation process. The contract started in mid-1982. The proposal with BBN Labs was
picked as the successful one, and that effort early on was lead by Howard Bilofsky with
Wayne Rindone and Fran Lewitter still leading the charge. Fran is now at the Whitehead Institute
in charge of some of the genome informatics effort going on there. Walter Goad and Greg Hamm
had been, I think, in touch and intending to and beginning to collaborate from well before the
contract was awarded. In fact, as was indicated, Greg was involved in helping to write the
RFP on which the contract was based, so there was a close connection there, but shortly after the
contract was awarded, Walter had organized a workshop at the Aspen Center for Physics, which
involved a lot of the people who were interested in the sequence databases and analysis,
and from Aspen, Walter and Greg wrote a letter to five or six different academicians in Japan
asking them and inviting them to participate in a collaboration with the EMBL and GenBank
and asking if there are ways in which they could be supportive of or assist in making that happen.
It actually took a while as you heard a few moments ago for that effort in Japan to take off,
but once it did, it was a strong and third member of that collaboration. So just seeing milestones
go by - five million databases towards the end of '84. A really significant aspect of the effort
and a sort of play on whether or not the databases had time to really curate and annotate the
sequences they were bringing in was the spinoff of initially an *** sequence and analysis database
that Gerry Myers undertook. He had spent a sabbatical at Los Alamos to work in and around
the GenBank effort but got very interested in *** sequences, and he told me in an email a couple
of weeks ago that his first jump into that was actually prompted by his daughter writing a report
in high school and bugging him about pulling some sequences out of GenBank that she could
include to make her report more interesting, and he didn't turn back. He stuck with the *** sequences
and went forward and eventually secured his own funding separately from the NIH to really do
a very careful collection and annotation and characterization of all the *** sequences.
That's actually gone forward. Gerry is retired now, but Betty Korber, Cathy Macken, and several
other coworkers have carried that general trend forward now for a number of other sequences
including STDs, oral pathogens, human papilloma virus, and flu. So that ended up being a very
strong branch off of the initial effort. So I thought I'd just close this section by mentioning that
we became aware towards the end of '86 that DDBJ was getting off the ground, and I came across
a letter recently that I had sent to the folks at DDBJ, Tateno and indiscernible , I believe, basically
repeating the proposition that Walter and Greg had offered several years earlier and with a lot
of details about how they could begin interacting with us and the folks at BBN Labs also.
So early '87, which was the end of the first GenBank contract let out by NIGMS. We're up to
about 10 million bases. There was another competition again. And there's a key point here,
which I think in part addresses some of the questions that come up, sort of along the lines of
there are all these great ideas early on about what it would take and how one should keep up
with the data, and the project really was resource-bound in the first few years. We had put in
a proposal that was a match to the amount of money that we understood to be available, not a
match to what we thought it would take to get the job done. We worked within those bounds
for the first few years. We had a number of stormy meetings with our advisory board where
they expressed their concern about our not keeping up with the literature. Don Lindberg showed
one quote from me remembering some of those moments, and I think the real turning point,
which was a significant turning point as we went into the second contract period was that
Jim Cassatt, who was the new program manager from NIGMS, worked with us and with the
NIH system and eventually lead to Jim Wyngaarden putting in a really significant up-ramp
I think in the vicinity of $4,000,000 a year to go towards supporting the GenBank contract.
And that represented at least a five-fold increase over the resources that had been available for
the first five years and at the end of the day made a tremendous difference in our ability to catch
up with the literature and with making the data available to the community. So in this go-round
we were in three competitive proposals, one with IntelliGenetics, one with BBN Labs and a third
with DNAstar including Fred Blackner and Temple Smith in that go-round. In this case
IntelliGenetics was the successful proposal paired with Los Alamos. That was lead by IG's CEO
Mike Kelly with initially David Kristofferson, but very quickly David Benton coming in as the
program manage of IntelliGentics. Significantly, at this point for us Paul Gilna and Tom Marr
were recruited into T-10. Tom was already at the lab in another group and Paul Gilna at the
University of Chicago. And Paul and Tom more or less took on the roles that I and Jim had
supplied under Walter Goad's leadership. I was sort of the biology guy. Jim was sort of the
computer guy, and that were the hats that Paul and Tom took in going forward. Paul Gilna
played a very critical role in working with the journal editors similar to what Graham Cameron
was describing for both himself and Patricia Kahn at EMBL and working with our advisory board
to track down leads and any means of persuasion that would help with getting journals onboard
with direct submission of database stuff. Soon after that a joint international advisory board was
established overseeing all three databases. I'm skipping over a lot of work and a lot of flurry, but
really significantly over this middle period of this contract was that we really did turn around the
amount behind the literature we were, and really switched over to the majority of the data coming
in through direct and electronic submission rather than through manual data entry. Towards the
end of that contract, I took on a role as group leader for the theoretical biology group and turned
over the leadership of the project to Paul Gilna and Mike Cinkosky. Mike had been with the
database all along, working with initially Jim Fickett and then Tom Marr, but took on the role
of overseeing the computational work. And Mike really played a significant role in a lot of the
changes we were going through in computer hardware, going from customized code dealing with
text databases or files to working with the Sybase relational system during that time period.
So towards the end of that period we're up to 90,000,000 bases and, of course, as you heard earlier
the database went over to NCBI at that point in time. I thought I'd just throw out a few more
pictures here. Upper left-hand corner is a picture in the beginning of the first contract with
Walter, myself, a guy named Randy Linder who was a summer student who is now in the
faculty at the University of Texas as a member of the molecular evolution department. Lower
left-hand corner is Howard Bilofsky who lead the effort at BBN. Lower right is Tom Marr
who was at Los Alamos as part of the second contract in particular, went on to Cold Spring Harbor,
then to found a company called Genomica, then on to the University of Alaska to take on the bioinformatics
chair, and recently is back in Colorado with another startup under his belt. The upper right-hand
corner is the GenBank team and I won't name everyone, but this is the one picture I could grab
easily that you can see Michael Cinkosky and Paul Gilna. Those who know them have found them
already in there, I'm sure. And then finally, NIH actually had a wonderful symposium in '91
to honor Walter Goad and his effort in getting the GenBank database off the ground, and these
are the database groups from IntelliGenetics and T-10 in Los Alamos who on the occasion sent
a message to Walter - hello, Walter - as you can see here. And again a nice selection, although
quite a subset of all the people that were involved in getting it off the ground. So I just thought
I'd spend the last two or three slides on just some of the transitions that we went through.
In addition to just the notion of keeping up with the data and all of the interesting biology
associated with it, there was quite an upheaval in technology during the ten years that we were
involved with the project or ten-plus years including the sequence library. We went through
big changes in server hardware. We went through big changes in operating systems both on the
hardware and the operating system front. One of the ironic things at the end of the day was that
in order to do what we had to do and do it well, we figured out that we should get off of the
massive mainframe system at Los Alamos that had been part of the reason that one would want
to be at Los Alamos in the first place and move on to computers that we had more direct control
and that were more comfortable for the aspects of editing and managing text files and the database
systems associated with them then than the heavy-duty number-crunching codes that were more
the reason to be on a big mainframe. And then finally, data management ended up moving over
to a relational database management system around 1990. It's interesting to note that one of the
first debates between Minoru Kanehisa and Walter Goad when Minoru arrived at Los Alamos
was what Walter was aware that there was a test relational database management system up
on the Livermore timesharing system at Los Alamos called FRAMIS, and for all the good
reasons that would want to be in a relational database, Walter thought it would be great to put
the sequence library into that system. Minoru in looking through it thought that the system
wasn't at all adequate for dealing with the kind of data that were going to be in the sequence database
and really pushed hard to stay out of FRAMIS, although Minoru did go ahead to publish a
paper about the experience and appropriateness of relational database management systems
around that time for sequence data. So it took the practicality of relational systems about six
years to catch up with the concept, which was very appealing. Community interface also went
through a lot of transitions. Again, I'm not going to read through all of these, but the way we
distributed the data went through a lot of different cycles. Some of these rose and fell in short order.
Some rose and stayed on for quite some time as it made sense, and the same point is to be made
about data collection, how we got the data starting with manual entry, which was talked about
earlier, and I think the annotations that one saw on instructing data entry folks on how to read
through a sequence really is to the point of - quite complicated at times, but also quite time
consuming and as Graham pointed out the fact that by this point in time most of these sequences
were starting in computers. We switched as rapidly as possible and with the help and the
leadership of people like Rich Roberts in particular in encouraging us but also encouraging the
journal editors of the world including the journal he was affiliated with to move in this direction.
So these are quite significant transitions, and I think at the end of the day in spite of all the
technical things, what I would think of as the social change or the cultural change in science
of separating the DNA sequences from being in the same direct pathway of journal publication
with this pseudo-parallel pathway of being submitted in advance of publication in electronic form
to the databases was very significant and, I think in terms of a paradigm shift was maybe the
most significant for us. So drinking from the fire hose, which is how we felt on most days
in terms of the rate at which sequence data were coming out. There were a number of paradigm shifts.
I just mentioned the electronic data publication as one. Database quality assurance, this was hotly
debated on our advisory board. It was hotly debated by database staff, and it was certainly hotly
debated in the community. Early on we had a very strong editorial role, and we shifted as we had
to, and after getting the go-ahead and resolution from our advisory boards to really focus on just
making sure the sequence data we were getting in - and we were keeping up with the literature
and sacrificing the editorial and curatorial components of going over the data and making sure
that one data entry was consistent with the other and that they all fit together in a nice way.
And then finally, data integration, revisions and updates. That's shifted quite significantly over
the last few years. Certainly in the last few years of the database from again being a role where
we tried to impose order on the entries and perhaps in the spirit of Rich Roberts' suggestion
really end up shifting more to an archival role where the community was free to and encouraged to
let us know when something needed to be changed or combined or fixed or corrected, but we
didn't go looking for those things ourselves except for when they could be detected by
automated means. So I guess the last couple of slides - one, at various times I thought it would
be interesting to go back and look when and if we ever made predictions about how much data
there would be. The first two instances we were pretty close within the confines that we had.
Jim Fickett and Minoru and Walter made the statement that there could be as many as ten to the seventh bases
within three years, and they were quite close to that mark. Jim and I made a statement in a later
publication about there being close to ten to the tenth bases in 10 to 20 years and again ended up
being pretty close to the mark. Both of those were offered, and I think based on more than a quick
conversation sort of there could be as many as thought. In the third example here with the benefit
by then of early discussions about genome sequences and the possibility of doing the genome, I
actually went through and tried to build in real estimates based on real notions of how much data
would have to be collected in order to do the genome in the timeframe that's being suggested.
So the somewhat realistic modeling ended up being the furthest off the mark at the point in time
when that was made. So I thought just as another sort of numbers game, say not so much making
a prediction of how much sequencing we'll be doing, but beginning to wonder what are the
upper bounds or how much will be enough in some directions, and this isn't meant to be exhaustive
but including the WGS section we're up at about 2 times 10 to the 11th bases in one of the
recent releases. I thought it would be interesting to at least conceptually ask how much DNA -
how many genomes are there on the planet and how many bases in those genomes if one said
you were going to sequence all the DNA, and I'm ignoring generations in time here and just
taking a current snapshot of the world. If you backed off from that picture, which I don't supply
a number for because there are a fair number of unknowns there, but if you said in a more limiting
way you wanted at least one genome sequence for all described species, there's still a fair amount
of back-of-the-envelope guesstimation there, but one comes up with a number of a little over
3 times 10 to the 15th bases. If you said, well let's not limit ourselves to describe species
but make a guess at how many species there are that haven't been described yet, one goes up
about an order of magnitude in the number of bases that would be included in that. If you then
on top of that said, well, what we're interested in is population variation within a species,
then you go up another two orders of magnitude to 3 times 10 to the 18th. A project that we're
funding now and been quite interested in is DNA bar coding, and that takes a different stance
for those of you who are familiar with it, sequencing 650 bases from the mitochondrial genome
of any indiscernible species in order to be able to uniquely determine the species in the future,
and by comparison the reference database, their general approach is based on doing ten specimens
per species, so if you said how much sequencing would it take to bar code every species or all
of the described species or all of the estimated species, you end up in the 10 to the 11th or 10 to the 10th domain
which is actually on a par with how much data there are today, so not too strange to estimate.
So my last slide here - Francis - is going back to the data - the Library Company that I mentioned
at the beginning. When they got started - actually about, I guess, 60 years after they got started
they built a building to house themselves at this location in the Independence district of Philadelphia.
You can't see it in the detail in this painting but there's a nice statue of Ben Franklin, recognizing
his role in getting it off the ground at the top there. At some point after that, the Library Company
needed to move off - about 100 years later move off to other headquarters, but another institution
that Ben Franklin founded called the American Philosophical Society, which you saw Bruno cite
earlier on today, decided to acquire the property and built a facsimile building on the same site in
which to house their library for the American Philosophical Society, and it's a nice point of
residence that Walter Goad's papers are now resident in that building, and anyone here with due
notice and willingness to observe all the rules that they associate with getting access to the
papers can go and sift through all of Walter's papers and records in getting the Los Alamos
Sequence Library and GenBank off the ground originally. So I'll close there, but just as everyone
else would acknowledge, everyone has helped out on every front in the past with DDBJ, EMBL
GenBank and its several partners. I counted that I've moved I think offices ten times and homes
six times since I first started with GenBank, so I was delighted to have this invitation, I was also
panic stricken because I'm down to maybe one file folder about a quarter inch thick of any
records whatsoever of the early days of GenBank, so I relied on sending emails and phone calls
out to a number of people who were very generous in sharing information back. Brian Gregory
years ago helped with the genome size estimations and finally the APS Library staff and LANL
were helpful, and I just want to say both retrospectively but going forward it has really been
wonderful to see how GenBank has flowered under NCBI's stewardship over the last few years
in the continuing collaboration with their partners. Thanks. Applause