Teaming up with biophysicist Gregory E. Sims, statistical mathematician Se-Ran Jun and theoretical physicist Guohong A. Wu, Kim decided to try a simple variant of the word frequency technique. They eliminated all punctuation and spaces from a text, created a dictionary of all the two-letter, three-letter, and other word combinations in the books, and counted the variety of each fixed-length "word" or feature. The features were not consecutive letter combinations, but overlapping sequences obtained by sliding a two-, three- or more-letter window along the text, advancing one letter at a time.
In a test of free online books obtained through Project Gutenberg, they found that this method, which they called the feature frequency profile (FFP) method, was more successful at identifying related books - books by the same author, books of the same genre, books from the same historical era - than word frequency profile analysis. In fact, a good tree can be constructed by looking at a single "optimal" feature length, such as nine letters, where the "vocabulary" is very large, instead of looking at all possible lengths.
"I was just stunned when I saw this," Kim said. One of the reasons this method works better, he said, may be that, while word frequency analysis treats each word independently, feature frequency analysis picks up syntax.
"Here, if I take a 9-letter window and slide it along the text," he said, "I am actually picking up the relationship between the first and second words - the local syntax - which was impossible to pick up from the word frequency method. Apparently, that is very important."
Buoyed by this success, the researchers applied the technique to whole genomes of mammals, where there is the least controversy in evolutionary relationship. "We treat the genome like a book without spaces," Kim said.
Since these genomes are very large, the researchers translated the genome sequences using a reduced, two-l
|Contact: Robert Sanders|
University of California - Berkeley