Digging into application performance characteristics can help you deliver solid service levels and fight unnecessary spending All but the smallest IT departments tend to compartmentalize skill sets. After all, hiring high-quality network, server, and storage admins is hard enough without attempting to recruit candidates who are strong performers across multiple IT tasks. Plus, “siloing” skills works as an organizing principle: It cleanly separates responsibilities among different members of the team and ensures that each portion of data center infrastructure gets its due. Too bad technology is trending in a completely different direction. Data center technologies are become increasingly converged and virtualized, and as I’ve pointed out in the past, it’s getting harder and harder to survive as a network guy who knows nothing about storage — or vice versa. But that’s not the only drawback to the siloed approach. Another side effect of siloing is that the infrastructure team invests too little effort in understanding what makes applications tick from a technical perspective. In most IT organizations, each mission-critical application has its own dedicated administrators. But these application-centric administrators seldom have a deep understanding of the infrastructure running underneath and depend on the infrastructure team for design, implementation, and support. In turn, the infrastructure team, which could generally care less about the application, depends upon the software vendor for guidance on how to deploy the application so that it gets the resources it needs. This application delivery chain — from the end-user workstation, through the network, to the application stack and servers, and all the way down to the storage infrastructure — is only as strong as its weakest link. All too often the weakest link is between the applications team and the infrastructure team (especially when isn’t a skilled DBA in the mix). To understand the problem, consider a scenario that I’ve seen play out in many different forms of over the years: It’s 2:15 p.m. on a Monday. Users have started reporting severe performance problems with a mission-critical application. The application administrators aren’t seeing any application-specific problems outside of sucky performance, so the problem is referred to the infrastructure team. The server admin jumps in and determines that the application servers are operating within bounds, but the database server is experiencing a large amount of storage latency. The storage admin then confirms that the SAN volumes attached to the database server are indeed maxing out, but there’s nothing actually wrong with them. By this time, the problem is bad enough to be visible to several layers of management, so a troop of suits suddenly descends on the storage admin’s office to see what the problem is. Of course, when you have a hammer everything looks like a nail, so the storage admin recommends adding more spindles or, in a moment of panic, upgrading the database volumes to SSDs. In this scenario, nobody along the chain actually looked at what the application was trying to do or why it suddenly elevated disk load. The real problem — one I actually witnessed — was that two fairly intense database-stored procedures were scheduled too closely together. If one ran just a bit longer than the application vendor assumed it would, the two would overlap and pummel the database volumes with I/O. Together the two procedures took longer to complete and spilled over to overlap with other procedures, until the snowball gathered into a noticeable problem. The application team knew the user-facing portion of the application inside and out, and the server folks knew the operating system and hardware, but nobody actually owned that extremely narrow slice of air where the rubber met the road. What inevitably results in these cases is unexpected performance brownouts like this one and reactionary overprovisioning of infrastructure resources to “fix” them. In today’s “do more with less” IT environment, this kind of outcome is all too common. For example, you might ask, “Where was the DBA in all this?” Good question! There used to be one, but a tight budget and a massive influx of new applications shifted that person’s responsibilities to roll out another new application. The “do more with less” mantra actually resulted in “doing less with more”: The cost of that additional storage gear could have gone toward the salary of someone who would have known that buying more metal was unnecessary. The only way to avoid this kind of problem is to ensure that no gaps exist in the application support chain. In a perfect world, that’d mean having that DBA position back, but in this day and age, adding another FTE is rarely looked at as a solution to anything. Ultimately, the responsibility devolves to the storage, server, and network admins. They simply need to bootstrap it and school themselves in the nuts and bolts of the applications they support. After all, when things go pear-shaped, admins get the blame. Unexplained spikes on a performance graph don’t happen by themselves, and most often they have little to do with malfunctionaing hardware. Even transitory anomalies that escape notice by the suits are worth exploring to find out the cause. They could be signs of a problem that will bite you big time next Monday. This article, “Calling all admins: Know thy applications,” originally appeared at InfoWorld.com. Read more of Matt Prigge’s Information Overload blog and follow the latest developments in storage at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter. Software DevelopmentCareersTechnology Industry