"We've found that when we add the spliced untranslated regions to our system, we not only get good predictions for UTRs but also improved predictions of the protein-coding region of the gene. By correctly identifying UTRs, we can avoid labeling them incorrectly as part of the protein-coding region," said Brent, who, with various colleagues, developed both TWINSCAN and N-SCAN. "It's important to know these two areas. Some of the signals that regulate transcription reside right near the transcription site. There is a huge amount of biology to be discovered there, and the appreciation of this area is growing daily."
While genomics researchers 15 years ago paid little attention to parts of the genome outside the coding regions, they have discovered some strange functions in UTR that have provoked second and third thoughts.
For instance, it recently was discovered that huntingtin, a gene associated with Huntington's disease, has a second protein segment encoded upstream of the main one. This protein in the so-called untranslated region is involved in regulating the gene. Running the modified TWINSCAN, on both the human and fruit fly genomes, Brent and colleagues predicted about 25,000 transcription-start sites, compared with a known 6,000.
"In the human genome, we found many extra exons on genes that were already known, or in some cases, spliced UTRs on genes that weren't even known to exist before," Brent said.
The system takes advantage of the scarcity of the CG sequence, finding so-called CpG "islands" known to be more common near the transcription-start site. It also has a knack for recognizing sequences that indicate splice sites. Over the past two years, TWINSCAN has been finding and predicting genes in numerous genomes that other gene prediction systems have missed. The addition of N-SCAN to
'"/>
Source:Washington University in St. Louis