by Greg Nawrocki

Troubleshooting SOAs and Grids Not Going to be Easy

news
Sep 26, 20053 mins

With BEA World kicking off this week, there’s a ton of noise about J2EE environments and SOAs. This article today from Network World highlights the recent trend of Grid computing as the preferred application server for SOAs. In 2004 (and again in this Computerworld column last year), Ian Foster called out the fact that SOAs and Grid computing were on converging paths (via web services standards and capabilities) … it’s interesting to see how quickly that prediction has been realized, with mainstream end users like Wachovia and Acxiom running SOAs on top of service-oriented Grid infrastructures. Another interesting point in the article is that one of the challenges that we often see called out as an inhibitor for Grid growth is that applications have to be re-written for Grid environments … but Wachovia points out that Grid-enabling applications with Java actually isn’t that difficult.

Another discussion that I believe will start to get more air-time is what it means to troubleshoot SOA / Grid environments.

Even in the pre-SOA world, one of the problems that we see most consistently around application interconnections is the human labor factor. People set applications up, they think they understand what their behavior will be, but later on, with hundreds of connections and things going wrong in real-time, figuring out what’s where and what’s happening in relation to other sub-systems … it’s extremely difficult.

“Typically, troubleshooting three tier system architectures (web server, application server, and database) like WebLogic consists of hours, or even days, manually sorting through various logs including web server requests, java exceptions, diag logs, and system configurations in dozens of different places each time something goes wrong,” says Michael Baum, CEO of a new datacenter search start-up called Splunk. “And this usually requires systems administrators, database administrators, and developers to cross administrative domains to determine various system interdependencies … and to identify where an exception or break-down may have occurred.”

Baum says that over the last few decades of IT, the systems management industry has really grown up on a complete focus on physical infrastructure management — performance, provisioning, CPU utilization, process utilization, etc. But what’s really lacking is easier ways to understand the logical layer, and the interdependencies between all of the different hardware resources and software services. The ability to reverse engineer that logic at run-time to troubleshoot systems today is a very human-intensive task that’s typically done with very blunt instruments like Grep and Perl and Awk. A medium sized enterprise data center could be generating anywhere between 10 to 100 megabytes a day in its logs, and that figure is up in the terabyte range for large enterprises. As Grid and SOA adoption continues to move forward, troubleshooting scenarios will get increasingly complex.