Mind to Market

Sunday, September 24, 2006

The Odd World of Information Extraction

Information extraction is the discipline of automatically extracting structured information from unstructured documents. In other words: teaching computers to read. Humans usually write with the idea that the writing will be read by other humans and thereby transfer some information/concept/idea from one mind to another or many others. Unfortunately, as you may have surmised from your exploding inbox, blogroll or stack of unread magazines there is way too much to read. In the life sciences field alone there are more than 300 scientific articles published every day. Seven days a week, 365 days a year. Even Evelyn Wood couldn't keep up.

And each article, each email, each blog is written by a human using standard grammar and syntax so that most people with fluency in the writer's language can read it. Yet, despite the ease with which humans can read and comprehend, computers have a much, much harder time. The fact is we humans like writing and reading unstructured information. Sure we read spreadsheets, balance sheets, income statements, and phone books but we rarely stop there; these documents are only part of the story, the rest is contained in introductions, commentaries, discussions, and conclusions which are written in full sentences. How many people would read the sports section of the newspaper if it only contained the statistics? How many people would read the business section if it only contained the stock quotes? We write in unstructured form because we think and read in unstructured form and, to a certain extent, because we've always communicated in this way. In fact, the more creative we are with our unstructured form, i.e. if the way we structure our language is different from other writers, we get kudos for originality and avoid plagiarism.

But who's reading all this unstructured writing? With the tremendous influx of new information everyday, it's getting more and more difficult to read everything required to keep up in one or two primary fields as well as several secondary ones. In a field such as biology, there are hundreds of journals covering an equal number of knowledge domains. The choice is then to very strategically select the articles you will read, rely on colleagues to tell you about others and go to conferences to hear people talk about still more. Even still you have just scratched the surface. If you are involved in an interdisciplinary field bridging multiple domains you can know longer just get by on specializing in a very narrow sub-domain, you must broaden your range to encompass several domains. At this point it may be prudent to look for technologies that can rapidly read and assimilate vast amounts of unstructured information. A technology such as information extraction.

A writer labors to produce a legible, coherent document full of complex, intricate syntax and word play only to have it read by a computer whose main objective is to strip out all originality and deconstruct the document down to its essential concepts. So what's the point? Is the primary purpose of a document to be read by machines or humans? "Both" may be most plausible answer.

Labels: , ,

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]



<$I18N$LinksToThisPost>:

Create a Link

<< Home