New microscopes are capturing images of the brain faster, with better spatial resolution, and across wider regions of the brain than ever before. Yet all that detail comes encrypted in gigabytes or even terabytes of data. On a single workstation, simple calculations can take hours. "For a lot of these data sets, a single machine is just not going to cut it," Freeman says.
It's not just the sheer volume of data that exceeds the limits of a single computer, Freeman and Ahrens say, but also its complexity. "When you record information from the brain, you don't know the best way to get the information that you need out of it. Every data set is different. You have ideas, but whether or not they generate insights is an open question until you actually apply them," says Ahrens.
Neuroscientists rarely arrive at new insights about the brain the first time they consider their data, he explains. Instead, an initial analysis may hint at a more promising approach, and with a few adjustments and a new computational analysis, the data may begin to look more meaningful. "Being able to apply these analyses quickly -- one after the other -- is important. Speed gives a researcher more flexibility to explore and get new ideas."
That's why trying to analyze neuroscience data with slow computational tools can be so frustrating. "For some analyses, you can load the data, start it running, and then come back the next day," Freeman says. "But if you need to tweak the analysis and run it again, then you have to wait another night." For larger data sets, the lag time might be weeks or months.
Distributed computing was an obvious solution to accelerate analysis while exploring the full richness of a data set, but many alternatives are available. Freeman chose to build on a new platform called Spark. Developed at the University of California, Berkeley's AMPLab, Spark is rapidly becoming a favored tool
|Contact: Jim Keeley|
Howard Hughes Medical Institute