Mind to Market

Wednesday, April 16, 2008

Mining the Text Mountain

I've had my doubts about text mining in the past; such as why we spend so much energy wrapping facts into coherent sentences only to have machines struggle to unwrap them. Not that there is any doubt that there is a huge amount of unmined text out there; decades of work for text miners. It's just that it seems of questionable value when this knowledge could be tagged as it were entered thus obviating the need to mine it later. But there is this vision, a vision that with sophisticated text mining tools some bigger picture can evolve from a mass of unstructured data.

In a recent article (subscription required) in Genome Technology William Hayes, director of informatics at Biogen Idec, alludes to this vision stating that writers cannot pre-tag the information while writing as the information can only be viewed within some larger context. The jury is still out on this but regardless of how knowledge is gathered and transferred in the future, we’ve got years of literature to mine from.

Hayes also claims that "almost all of our [biomedical] knowledge…is captured in the literature." I'm sure we can claim that researchers do publish their "good" results in the literature, but this is far less than our total knowledge. Because the process of writing and publishing knowledge is so difficult and expensive, only the very most promising results are usually published, leaving the vast remainder in lab notebooks, hard drives and researchers’ heads where it is difficult to distribute.

Larry Hunter, director of the Computational Bioscience Program at the University of Colorado, sees text mining as a way to improve current databases but remains skeptical that such systems can automatically keep a scientist up to date with the literature. Don't throw those reading glasses away just yet.

Labels: , , ,