Wednesday, February 02, 2005

Idea for a side project which will never get off the ground

Just as mathematicians have The Mathematics Genealogy Project, neuroscientists should have a Neuroscience Genealogy Project. (For a sample of what the MGP can do, put in the name of one of your college math professors and start clicking the "Advisor" links. He or she is probably not that many generations from someone mentioned in your textbooks, such as Weierstrass, Hilbert, or Dirchelet. Erdos and von Neumann had the same thesis advisor.)

Neuroscience, being much younger than math, should be easier to catalog. As I've become more a part of the field, I've realized that even though SfN annual meeting attendance has grown from 1,396 in 1971 to over 31,000 in 2004, everyone within a subfield still seems to know (or at least know of) everyone else. Famous people of today were often trained by famous people of yesterday, and incest seems to run rampant, with former labmates helping out each other's students. Many subfields in neuroscience are still comprehensible communities.

I'd like to see graphs showing the intellectual fathers and mothers of the field and their descendants, so that we could see whose intellectual traditions had influenced the largest number of today's neuroscientists. Links would be made from PIs to their graduate students, and postdocs, and perhaps to their undergraduates too. From this we could see where an individuals influences came from, and who that individual infleunced. Links could maybe also be added to the graph between individuals who have carried on a significant collaboration.

So how to gather this data? The reason I'm thinking about this is because I think it could be done in a semi-automated fashion. We already have big publication databases like PubMed, ISI, and Google Scholar. I think that one could write a bot that could crawl these databases and use a few simple heuristics to infer the relationships that would make up the edges of a neuroscience genealogy digraph from bibliographic citations.

For each individual on the graph, the things to look at would be: (a) The order of authors on that individual's publications, (b) Frequency of coauthorship with another individual, and (c) where each frequent coauthor falls in the individual's career. For example, one's first papers are often written with one's graduate advisor. One may start out as a middle author on these papers, but a couple papers should be published early on where the student is the first author and the advisor is the last author. So have the bot search PubMed for an individuals first publications, and have it create a vertex for most common last author on the individual's first few first-authored papers, and a "graduate student" edge from the added vertex to this individual's vertex.

After doing this, take the next few first-authored publications and find the most common last author of these. Call this new last author our sample neuroscientist's postdoctoral advisor, and make a new vertex for this advisor and add an edge from it to the initial edge. Finally, when our neuroscientist begins a long string of last-authored publications, have the bot start looking at the first authors of these papers to determine our initial individual's students and postdocs. Do the same thing with in the opposite direction, too: The bot could move further up the tree by inferring our initial neuroscientist's advisor's advisor in the same way that it inferred his or her advisor.

Clearly these heuristics would need work and accounting for edge cases, but I think they would do a not-so-bad job for most stereotypical neuroscience careers. Because there would be multiple "Public JQ"s in PubMed, one could separate the publications of the neuroscientist from the non-neuroscientists with, say, ISI's Journal Citation Reports list of neuroscience journals.

Oh, and the bot could be seeded at the beginning with the lists of faculty that are available on the websites of neuroscience departments.

My question is whether these heuristics would work well enough or if it might not be easier in the long run to spend the time up front creating a hand-labeled training set and then using statistical learning techniques to build a classifier from this training set. Each vector to be classified would represent the relationship between a pair of individuals. The dimensions of this vector would be ones like I cited above (relative locations in lists of authors, location of the period of coauthorship in the timeline of an individual's career) and the output classes would be one of the possible edges, e.g., "Individual A is the graduate advisor of individual B." Now that I think about it, I suppose it would make sense to look at the entire publication record of both individuals when classifying an edge. That is, for Individual A to be the graduate advisor of Individual B, A should be last author and should already have gone through a period of first-authorship, and B should be first author and not have many publications yet.

So clearly these ideas need fleshing out before they are remotely implementable. My feeling is that doing this with statistical learning is probably the better way to go, even though it would probably require one to go through the tedious process of building a fairly large training set up front. The genealogy of neuroscience does seem like the sort of problem a computer could be programmed to solve fairly reliably, though.


Post a Comment

<< Home