Mind to Market

Wednesday, April 16, 2008

Mining the Text Mountain

I've had my doubts about text mining in the past; such as why we spend so much energy wrapping facts into coherent sentences only to have machines struggle to unwrap them. Not that there is any doubt that there is a huge amount of unmined text out there; decades of work for text miners. It's just that it seems of questionable value when this knowledge could be tagged as it were entered thus obviating the need to mine it later. But there is this vision, a vision that with sophisticated text mining tools some bigger picture can evolve from a mass of unstructured data.

In a recent article (subscription required) in Genome Technology William Hayes, director of informatics at Biogen Idec, alludes to this vision stating that writers cannot pre-tag the information while writing as the information can only be viewed within some larger context. The jury is still out on this but regardless of how knowledge is gathered and transferred in the future, we’ve got years of literature to mine from.

Hayes also claims that "almost all of our [biomedical] knowledge…is captured in the literature." I'm sure we can claim that researchers do publish their "good" results in the literature, but this is far less than our total knowledge. Because the process of writing and publishing knowledge is so difficult and expensive, only the very most promising results are usually published, leaving the vast remainder in lab notebooks, hard drives and researchers’ heads where it is difficult to distribute.

Larry Hunter, director of the Computational Bioscience Program at the University of Colorado, sees text mining as a way to improve current databases but remains skeptical that such systems can automatically keep a scientist up to date with the literature. Don't throw those reading glasses away just yet.

Labels: , , ,

Monday, March 24, 2008

Killer Semantic Apps

Is coming up with a definitive application for demonstrating the utility of semantic analysis really that difficult? TextWise, a software development company in Rochester, New York, apparently thinks a good idea for this technology is worth at least the $1 million they are offering the winner of their SemanticHacker $1M Innovators Challenge.

The rules of this contest require the contestant to use, or propose to use, the SemanticHacker API, based on TextWise's Semantic Signatures® technology, to develop a software application for a specific industry vertical. Although it is up to the contestant to propose a vertical, TextWise suggests industries such as "healthcare or pharmaceuticals might be good places to start." Wonder who tipped them off?

As explained in TechCrunch, Semantic Signatures® uses natural-language processing to extract relevant terms from text then applies semantic analysis to automatically categorize Web pages. Not a bad idea, but the technology can be a bit flaky. Semantic Signatures® used Wikipedia as a reference; connecting the concepts extracted from the text and matching them to Wikipedia articles.

One of the cornerstones of the W3C specification for the Semantic Web is its use of Web Ontology Language (OWL), although OWL only specifies the format of the ontologies there is the assumption that human domain experts will be required to accurately develop an ontology. TextWise claims that ontologies developed in this way "do not align with customer needs and…rapidly become obsolete." Perhaps, but without some agreement on the ontology all you have is a folksonomy, which reduces its value in collaborative efforts.

The Holy Grail that drives the concept of semantic analysis is the ability for the software to do "connecting the dots" process that is normally done by humans. We humans can juggle a few thousand "dot" in our heads but connecting one to another, or maybe some complex combination of five to another twelve, gives most of us a headache. And when you start thinking about connecting a million or more dots, well, time to start thinking about a simpler project, like brain surgery.

Labels: , , ,

Tuesday, October 09, 2007

Data Glut in Research

Chemical & Engineering News placed information management in pharmaceutical R&D on their cover for the October 1 issue. In addition to pointing out some of the information ills that plague the industry, C&EN has added a blurb on the "Internet Orphan" Semantic Web. Although short on concrete examples of just what SW can do, the article points out that SW technologies could replace data-mining as a way to derive knowledge from your data.

So what about this "Internet Orphan" label? Apparently since SW technology has been around for several years and no one has really picked it up it has earned the name. Life sciences with their exploding stores of data, and thinning drug pipelines, has a real need for technologies that can wring more knowledge from the databases. Coupled with the fact that biology is a science, and is therefore smaller and more quantifiable than the Web as a whole, applying the structure of SW to the biological knowledge domain makes much better sense than applying it to the entire Web.

And least I forget the hype; SW is being compared to the early days of the Web. "What is happening now on the Semantic Web is similar to what was going on in the five years leading up to that explosion [of 1995 that kicked off the Web]," claims John Wilbanks, executive director of Science Commons.

Labels: , , ,

Sunday, September 24, 2006

The Odd World of Information Extraction

Information extraction is the discipline of automatically extracting structured information from unstructured documents. In other words: teaching computers to read. Humans usually write with the idea that the writing will be read by other humans and thereby transfer some information/concept/idea from one mind to another or many others. Unfortunately, as you may have surmised from your exploding inbox, blogroll or stack of unread magazines there is way too much to read. In the life sciences field alone there are more than 300 scientific articles published every day. Seven days a week, 365 days a year. Even Evelyn Wood couldn't keep up.

And each article, each email, each blog is written by a human using standard grammar and syntax so that most people with fluency in the writer's language can read it. Yet, despite the ease with which humans can read and comprehend, computers have a much, much harder time. The fact is we humans like writing and reading unstructured information. Sure we read spreadsheets, balance sheets, income statements, and phone books but we rarely stop there; these documents are only part of the story, the rest is contained in introductions, commentaries, discussions, and conclusions which are written in full sentences. How many people would read the sports section of the newspaper if it only contained the statistics? How many people would read the business section if it only contained the stock quotes? We write in unstructured form because we think and read in unstructured form and, to a certain extent, because we've always communicated in this way. In fact, the more creative we are with our unstructured form, i.e. if the way we structure our language is different from other writers, we get kudos for originality and avoid plagiarism.

But who's reading all this unstructured writing? With the tremendous influx of new information everyday, it's getting more and more difficult to read everything required to keep up in one or two primary fields as well as several secondary ones. In a field such as biology, there are hundreds of journals covering an equal number of knowledge domains. The choice is then to very strategically select the articles you will read, rely on colleagues to tell you about others and go to conferences to hear people talk about still more. Even still you have just scratched the surface. If you are involved in an interdisciplinary field bridging multiple domains you can know longer just get by on specializing in a very narrow sub-domain, you must broaden your range to encompass several domains. At this point it may be prudent to look for technologies that can rapidly read and assimilate vast amounts of unstructured information. A technology such as information extraction.

A writer labors to produce a legible, coherent document full of complex, intricate syntax and word play only to have it read by a computer whose main objective is to strip out all originality and deconstruct the document down to its essential concepts. So what's the point? Is the primary purpose of a document to be read by machines or humans? "Both" may be most plausible answer.

Labels: , ,