Mind to Market

Wednesday, April 16, 2008

Mining the Text Mountain

I've had my doubts about text mining in the past; such as why we spend so much energy wrapping facts into coherent sentences only to have machines struggle to unwrap them. Not that there is any doubt that there is a huge amount of unmined text out there; decades of work for text miners. It's just that it seems of questionable value when this knowledge could be tagged as it were entered thus obviating the need to mine it later. But there is this vision, a vision that with sophisticated text mining tools some bigger picture can evolve from a mass of unstructured data.

In a recent article (subscription required) in Genome Technology William Hayes, director of informatics at Biogen Idec, alludes to this vision stating that writers cannot pre-tag the information while writing as the information can only be viewed within some larger context. The jury is still out on this but regardless of how knowledge is gathered and transferred in the future, we’ve got years of literature to mine from.

Hayes also claims that "almost all of our [biomedical] knowledge…is captured in the literature." I'm sure we can claim that researchers do publish their "good" results in the literature, but this is far less than our total knowledge. Because the process of writing and publishing knowledge is so difficult and expensive, only the very most promising results are usually published, leaving the vast remainder in lab notebooks, hard drives and researchers’ heads where it is difficult to distribute.

Larry Hunter, director of the Computational Bioscience Program at the University of Colorado, sees text mining as a way to improve current databases but remains skeptical that such systems can automatically keep a scientist up to date with the literature. Don't throw those reading glasses away just yet.

Labels: , , ,


  • I recently checked out this system- XTRactor – it provides 100% manually annotated data for all drug discovery related areas. A techinically-qualified team manually curates all the abstracts from Pubmed as and when they are published. And more so it acts as an alert service to track your areas of interest and present them to your inbox regularly. Check it out for free at : www.xtractor.in
    What more no need to do topic tracking for your research needs the data gets added to your profile at ease, just you need to provide your Keyword’s of choice.

    It also has collaboration and community building exercise, which makes it far more attractive.

    By Anonymous Anonymous, At 7:35 AM  

  • More context regarding my quote:

    Pre-tagging is useful but will never be complete as what the reader is interested in may not be what the writer has in mind - plus the information density of the literature is much higher than can be captured by a bit of markup (now if the markup and additional structured data is several times the size of the original text - maybe that will suffice - but good luck making that happen via the writing process).

    I also incorrectly stated the second quoted reference - should have been: "Almost all of our accessible biomedical knowledge is captured in the literature which along with the available biomedical databases is everything that is 'mine-able'."

    I agree that negative results are not captured effectively and there is a great deal of hidden information but the lack of structure and contextualization means that it won't be that rich a resource to mine (writing a paper or developing a database requires a certain amount of rigor and generation of metadata required of a useful resource).

    By Blogger William, At 7:03 AM  

Post a Comment

Subscribe to Post Comments [Atom]


Create a Link

<< Home