by Greg Nawrocki

Why cancer researchers care about Grid computing

news
Nov 10, 20052 mins

One of the characteristics of life science and biomedical science is the diversity of data types – the heterogeneity of data and the way it’s described. In cancer research in particular, this presents interesting challenges for collaboration between different scientists and exchange of data sets.

A really interesting project to watch is caBIG, which is focused on allowing better sharing of data and tools for cancer research. According to one of caBIG’s participants, Peter Covitz, Director for Core Infrastructure at NCI Center for Bioinformatics:

“Even within a given type of data from, say, a measurement technology, or a theoretical description of biology – even within an area that is ‘the same’ from a conceptual standpoint, there is often a diversity of terminology or a subtle differences of meaning. This is a problem that commonly confronts informaticists who want to integrate resources in life sciences.

Other scientists – such as those in high energy physics, for example – may have tremendous amounts of data and separate challenges with large computational loads, but they tend to deal with a relatively modest number of ‘well understood’ data types in their domains. They don’t’ have this diversity and heterogeneity problem that we face.

With caBIG, we’re taking the best possible technology for integrating and sharing resources – namely, the Grid technology that’s evolved over the years, driven by physics and astronomy’s cases – and we’ve extended it to a common base of needs for the life sciences community. The extensions that we’ve put in have been largely about better support for descriptions of data and diverse data types, and semantic control of those data types by binding them to structured ontologies.

We’ve integrated everything into the grid framework that Globus already provided and thus created a massive data Grid. The locations are the NCI, the Georgetown Lombardi Cancer Center, the Duke Cancer Center, and the University of Pittsburg Medical center. Some locations have more than one node, so there are 6 or 7 total nodes.

Given the distributed nature of the sites, the diversity of the data, it’s sheer volume and the way it is presented and manipulated, CaGrid is probably one of the more sophisticated data Grid architectures out there today.”