The NIST team has already applied their root-based vocabulary rules to the chemical structures in PubChem, a "monstrous database" of millions of compounds and chemical substances, to the world wide protein data bank (PDB), and to specific NIST-based databases, said John Elliot, a biophysicist and another member of the team. While the scientific databases haven't reached a Facebook-like level of more than a billion users, they are actively used by many scientists in the NIST community and beyond.
Once the preliminary vocabulary was established, the NIST team also worked to categorize the descriptions of molecules and scientific experiments in a hierarchical fashion that would allow a search to return comprehensive, yet precise results. A common problem with many search approaches is that they get too many results, said Elliot. Elliot described his team's approach as similar to the problem of locating the Doritos in a large Walmart store. "First you find the grocery market section, then the next level of hierarchy is snacks, after which you go to the chips section, and then you'll quickly know if they have Doritos or not," said Elliot. "So even if a store has a million products, you can find out if they have your product really quickly." The team said the hierarchical approach could also guide scientists who need to pick out key words to index in their research papers.
Organizing the huge amounts of data generated by science is a big challenge, the team said, but it has potentially huge payoffs.
|Contact: Catherine Meyers|
American Institute of Physics