Mind to Market

Sunday, September 24, 2006

The Odd World of Information Extraction

Information extraction is the discipline of automatically extracting structured information from unstructured documents. In other words: teaching computers to read. Humans usually write with the idea that the writing will be read by other humans and thereby transfer some information/concept/idea from one mind to another or many others. Unfortunately, as you may have surmised from your exploding inbox, blogroll or stack of unread magazines there is way too much to read. In the life sciences field alone there are more than 300 scientific articles published every day. Seven days a week, 365 days a year. Even Evelyn Wood couldn't keep up.

And each article, each email, each blog is written by a human using standard grammar and syntax so that most people with fluency in the writer's language can read it. Yet, despite the ease with which humans can read and comprehend, computers have a much, much harder time. The fact is we humans like writing and reading unstructured information. Sure we read spreadsheets, balance sheets, income statements, and phone books but we rarely stop there; these documents are only part of the story, the rest is contained in introductions, commentaries, discussions, and conclusions which are written in full sentences. How many people would read the sports section of the newspaper if it only contained the statistics? How many people would read the business section if it only contained the stock quotes? We write in unstructured form because we think and read in unstructured form and, to a certain extent, because we've always communicated in this way. In fact, the more creative we are with our unstructured form, i.e. if the way we structure our language is different from other writers, we get kudos for originality and avoid plagiarism.

But who's reading all this unstructured writing? With the tremendous influx of new information everyday, it's getting more and more difficult to read everything required to keep up in one or two primary fields as well as several secondary ones. In a field such as biology, there are hundreds of journals covering an equal number of knowledge domains. The choice is then to very strategically select the articles you will read, rely on colleagues to tell you about others and go to conferences to hear people talk about still more. Even still you have just scratched the surface. If you are involved in an interdisciplinary field bridging multiple domains you can know longer just get by on specializing in a very narrow sub-domain, you must broaden your range to encompass several domains. At this point it may be prudent to look for technologies that can rapidly read and assimilate vast amounts of unstructured information. A technology such as information extraction.

A writer labors to produce a legible, coherent document full of complex, intricate syntax and word play only to have it read by a computer whose main objective is to strip out all originality and deconstruct the document down to its essential concepts. So what's the point? Is the primary purpose of a document to be read by machines or humans? "Both" may be most plausible answer.

Labels: , ,

Monday, September 18, 2006

Linear Thinking

One (simple) definition of linear thinking that I can agree with is to take information from one situation and use this information in another situation to make a conclusion about the later situation. We go to a new restaurant and look at the menu which we've never seen before and choose an entree that is a combination of something we like and are familiar with and what's described on the menu. If we order shrimp linguini in white sauce we would expect a few shrimp with linguini shaped noodles in a cream based sauce. If we were served a well-done T-bone steak this will exceed the constraints of our linear thought process (and deprive the waiter of a tip!).

We could go on to say that linear thinking involves sequential ordering of concepts between subdivisions but contains no loops linking in outside elements into the sequence. Although linear thinking works in many situations, i.e. the restaurant example, it may not encompass sufficient complexity to deal with more complicated situations. Linear thinking is relatively safe and conservative; it would be easier to describe a linear thought process to others than to describe a non-linear one.

So what is the counterpart to linear thinking? If we consider linear thinking to be "inside the box" then non-linear thinking may be considered "outside the box." One problem with working with the linear thought process is that you are always defining new systems in terms of systems you already are familiarity with. In some cases this is too great a limitation and it hinders the creative or developmental process. In The Innovator's Dilemma Clayton Christensen argues that following a linear development process will trap a company into simply improving existing technologies instead of finding new disruptive technologies that will eventually outpace existing ones. For the most part, our management and accounting systems are set up as linear systems; next year's budget is dependent upon this year's, our customers want to continue to use this year's products with only a few upgrades. The resistance to non-linear thinking is high, even to the point of supporting archaic processes.

But non-linear thinking in and of itself is not a solution. Nor is it even the framework of a solution; it is merely the lack of constraints imposed by linear thinking. It suggests that a solution to a problem may be beyond the types of solutions that have been used in the past. Ironically, what is considered non-linear thinking the first time through will be incorporated into the knowledge base and will be considered linear in subsequent usage.

Labels: , ,

Friday, September 01, 2006

Biz Plan

I've been heads down on a business plan lately and all that that entails. Mostly combing through hundreds of articles, documents, websites. And of course talking to people, mainly my partner Frank, who has first hand knowledge of the industry from a technical point of view. Not that he didn't observe the business side and reach some understandings about it. The bioinformatics industry has gone through some very dramatic transformations, as has all of IT, over the last ten years. Just by coincidence I ran into the former CFO of Genomica at BioWest last week. They had raised a total of $120 million in a matter of months on the basis of a business plan, a management team and a whole lotta hype. Those were the days!!

Labels: ,