Cisco's Unified Computing System is a more manageable, more scalable, and essentially superior blade server system, despite 1.0 warts Credit: Ken Wolter / Shutterstock Bottom Line Cisco UCS 1.0 is like no other blade-based server infrastructure available today. Its reliance on 10Gb Ethernet grants it plenty of bandwidth, while Cisco’s model of treating chassis as simple extensions of the fabric allows for a new order of scalability and significant reliability. Cisco started from the ground up and really has built a new way to manage server resources. Revolutionary. Cutting edge. State of the art. These are words and phrases that are bandied around so very many products in the IT field that they become useless, bland, expected. The truth is that truly revolutionary products are few and far between. That said, Cisco’s Unified Computing System fits the bill. To fully understand what Cisco has done requires that you dispense with preconceived notions of blade servers and blade chassis. Rewire your concepts of KVM, console access, and network and storage interfaces. Reorganize how you think of your datacenter as islands of servers surrounded by storage arrays and networks. Cisco had the advantage of starting from scratch with a blade-based server platform, and it’s made the most of it. [ Compare Cisco UCS to traditional blade server systems. Compare Cisco UCS to HP BladeSystem Matrix. ] In short, UCS is built around a familiar concept — the blade chassis — but rearchitects it to enable both greater manageability and greater scalability. For the architectural background, read my summary, “How Cisco UCS reinvents the datacenter.” This article focuses on the nitty-gritty details of UCS, and my experiences working with the system in a recent visit to Cisco’s San Jose test labs. UCS building blocks A Cisco UCS chassis provides eight slots for half-width blades, each equipped with two Intel Nehalem processors, up to 96GB of RAM with 8GB DIMMs, two SAS drive slots, an LSI Logic SAS RAID controller, and a connection to the blade backplane. In addition, each blade is outfitted with a Cisco Converged Network Adapter, or CNA. The CNA is essentially the heart of the system, the component that makes UCS unlike traditional blade systems. The CNA is a mezzanine board that fits a QLogic 4Gb Fibre Channel HBA and an Intel 10Gb Ethernet interface on a single board, connecting directly to the chassis network fabric. The presentation to the blade is two 10Gb NICs and two 4Gb FC ports, with two 10Gb connections to the backplane on the other side. The initial release does not support multiple CNAs per blade, or really even require one. But the CNA is integral to how the entire UCS platform operates, as it essentially decouples the blade from traditional I/O by pushing storage and network through two 10Gb pipes. This is accomplished through the use of FCoE (Fibre Channel over Ethernet). Everything leaving the blade is thus Ethernet, with the FC traffic broken out by the brains of the operation, the Fabric Interconnects (FI). So we have some number of CNA-equipped blades in a chassis. We also have two four-port 10Gb fiber interface cards in the same chassis and two FIs downstream that drive everything. It’s not technically accurate to call the FIs switches, since the chassis function more like remote line cards populated with blades. No switching occurs in the chassis themselves; they are simply backplanes for blades that have direct connections to the FIs. Physically, the FIs are identical in appearance to Cisco Nexus 5000 switches, but they have more horsepower and storage to handle the FCoE to FC breakout tasks. They offer 20 10Gb ports, and they support a single expansion card each. The expansion cards come in a few different flavors, supporting either four 4Gb FC ports and four 10Gb Ethernet ports, or six 10Gb Ethernet ports, or eight 4Gb FC ports. This is in addition to the twenty 10Gb ports built into each FI. There are also three copper management and clustering ports, as well as the expected serial console port. The FI is wholly responsible for the management and orchestration of the UCS solution, running both the CLI and GUI interface natively — no outside server-based component is required. Test Center Scorecard 20%20%20%20%10%10%Cisco UCS 1.0810811999.2 Excellent Connecting the dots Perhaps a mental picture is in order. A baseline UCS configuration would have two FIs run in active/passive mode, with all network communication run in active/active mode across both FIs and each chassis. (Think of a Cisco Catalyst 6509 switch chassis with redundant supervisors — even if one supervisor is standby, the Ethernet ports on that supervisor are usable. The two FIs work basically the same way.) They are connected to each other with a pair of 1Gb Ethernet ports, and they have out-of-band management ports connected to the larger LAN. The blade chassis is connected by two or four 10Gb links from each FEX (Fabric Extended) in the chassis, a set to each FI. That’s it. A fully configured chassis with 80Gb uplinks will have four power cords and eight SFP+ cables coming out of it — nothing more. Conceivably, an entire rack of UCS chassis running 56 blades could be driven with only 56 data cables, 28 if only four 10Gb links are required on each chassis. From there, the pair of FIs are connected to the LAN with some number of 10Gb uplinks, and the remainder of the ports on the FI are used to connect to the chassis. A pair of FIs can drive 18 chassis at 40Gb per chassis with two 10Gb uplinks to the datacenter LAN, allowing for eight 4Gb FC connections to a SAN from an eight-port FC expansion card. The basis of the UCS configuration is the DME (Data Management Engine), a memory-based relational database that controls all aspects of the solution. It is itself driven by an XML API that is wide open. Everything revolves around this API, and it’s quite simple to script interactions with the API to monitor or perform every function of UCS. In fact, the GUI and the CLI are basically shells around the XML configuration, so there’s no real disparity between what can and can’t be done with the CLI and GUI, or even external scripts. UCS is a surprisingly open and accessible system. Following that tenet, backing up the entirety of a UCS configuration is simple: The whole config can be sent to a server via SCP, FTP, SFTP, or TFTP, although this action cannot be scheduled through the GUI or CLI. The initial setup of a UCS installation takes about a minute. Through the console, an IP is assigned to the out-of-band management interface on the initial FI, and a cluster IP is assigned within the same subnet. A name is given to the cluster, admin passwords are set, and that’s about it. The secondary FI will detect the primary and require only an IP address to join the party. Following that, pointing a browser at the cluster will provide a link to the Java GUI, and the UCS installation is ready for configuration. Build me up, Scotty The first order of business is to define the ports on the FIs. They can either be uplink ports to the LAN or server ports that connect to a chassis. Configuring these ports is done by right-clicking on a visual representation of each FI and selecting the appropriate function. It’s simple, but also cumbersome because you cannot select a group of ports; you have to do them one by one. Granted, this isn’t a common task, but it’s annoying just the same. Once you’ve defined the ports, the chassis will automatically be detected, and after a few minutes, all the blades in the chassis will be visible and ready for assignment. This is where it gets interesting. Before anything happens to the blades, various pools and global settings must be defined. These pools concern Fibre Channel WWNN (World Wide Node Name) and WWPN (World Wide Port Name) assignments, Ethernet MAC pool assignments, UUIDs (Universally Unique Identifiers), and management IP pools for the BMC (Baseboard Management Controller) interfaces of the blades. These are open for interpretation, as you can assign whatever range of addresses you like for the UUID, WWNN, WWPN, and MAC ranges. In fact, it’s so wide open that you can get yourself into trouble by inadvertently overlapping these addresses if you’re not careful. However, assigning pools is extremely simple, accomplished by specifying a starting address and the number of addresses to put into the pool. Make sure you get it right, however, because you cannot modify a pool later; you can only specify another pool using an adjacent range of addresses. You also need to worry about firmware revisions. You can load several different versions of firmware for all blade components into the FIs themselves and assign those versions to custom definitions, ensuring that certain blades will run only certain versions of firmware for every component, from the FC HBAs to the BIOS of the blades themselves. Because UCS is so new, there are only a few possible revisions to choose from, and loading them on the FIs can be accomplished through FTP, SFTP, TFTP, and SCP. Once present on the FIs, firmware can then be pushed to each blade as required. You also can set up predefined boot orders — say, CD-ROM, then local disk, followed by an FC LUN, and PXE (Pre-boot Execution Environment). These can also be assigned to each server instance as required and can include only one element if desired. You can also define VLANs to present to the blades and which VLAN should be native. It’s assumed that each server will trunk those 10Gb interfaces, but native VLAN assignment means that that isn’t a hard and fast requirement. In production, it’s likely that each blade will trunk, so that assumption is valid. However, the FIs don’t play nice with VTP (VLAN Trunk Protocol), so VLAN definitions are manual, not derived from the rest of the switched LAN. If you have a pile of VLANs that you need to present to your servers, be ready for lots of clicking and typing. Cisco hopes to remedy this in an upcoming release. There are a few other odds and ends, such as scrub policies. These exist to determine what action to take when a service policy is pulled from a physical blade with local disk — in other words, whether the local disk should be erased or left alone. Unfortunately, this “scrub” really isn’t — it just destroys the partition table, without actually overwriting the disks. Once you’ve created your pools, you can start building your blades into actual servers. The options for building out servers are simple: Either a blade boots from the SAN or PXE, or it boots from local disk. Managing storage is outside the scope of UCS, so let’s assume you have a competent storage administrator, and you need a bunch of LUNs assigned for our budding UCS installation. Through the UCS GUI, you can pull up a simple list of all WWNN and WWPN assignments and immediately export that list to CSV, making it extremely simple to pass that information off to the admin for the storage configuration. Talk about handy. But I digress — we haven’t even built a server yet. Service profiles Server builds are defined in service profiles, which are themselves derived from service profile templates. Service profile templates allow you to define specific server instances and automatically provision one or more servers. Once you’ve created one global profile, you can duplicate that profile to however many servers you may need to fulfill that task. The configuration profiles determine the firmware revision for each blade component; the WWNN, WWPN, and MAC pools to choose from; the boot orders you may have defined; and even the boot policy — boot from SAN, local, or what have you. All of this is surprisingly simple to organize. You can also call upon the Ethernet and FC port designations you created earlier — such as eth0, eth1, fc0, and fc1 — that correspond to each FI, thus providing redundancy across each blade. I did run across a few bugs here; for example, port assignments that were clearly defined as Fabric A and Fabric B somehow merged in to Fabric A when that template was applied to a server and had to be manually corrected. I was assured that this bug was being actively addressed. In the grand scheme of things, it’s minor and highlights the fact that this is a 1.0 release. There are two forms of service profile templates: initial and updating. Each has specific pros and cons, and it’s unfortunately not possible to switch forms after a profile has been created; if you begin with an initial profile, the profile cannot later be used to propagate updates. Initial profile templates are used to build service profiles once, with no attachment to the originating templates. Updating templates are bound to those service profiles, so changing settings on an updating template will cause those changes to be pushed out to all bound service profiles. This is a double-edged sword, because while it does simplify the management of service profiles, making those changes results in a reboot of those profiles — sometimes with little or no warning. Something as innocuous as changing the boot order on a template could cause 20 blades to reboot when you click Save. It would be nice to have an option to stagger the reboots, schedule them, or both. Cisco has acknowledged that problem and is working on a fix. Initial profiles do not have this problem, but once built, they must be manually modified one by one, server by server, if changes are required. There is no best-of-both-worlds solution here, unfortunately. In any event, you can create a service profile that defines what firmware a blade should run on each component; what WWNN, WWPN, and MAC addresses to assign to the various ports on the blade; what management IP address to assign to the BMC; what order to boot the blade; and where the blade boots from — local disk or SAN LUN. You can then assign that profile to either a specific blade, or you can put all the identical blades into a pool and assign the profile to the pool, letting UCS pick the blades. Here, a curious thing happens. PXE this Each blade is but an empty vessel before UCS gets its hands on it. With each server profile, a blade must conform to any number of specific requirements, from the firmware revision on up. Cisco accomplishes the transformation from blank slate to fully configured blade by PXE booting the blade with some 127.0.0.0 network PXE magic and pushing a Linux-based configuration agent. The agent then accesses all the various components, flashes the firmware, assigns the various addresses, and makes the blade conform to the service profile. This takes a minute or two, all told. Following that, the blade reboots and is ready to accept an operating system. This process presents a bit of a quandary: What if I want to PXE boot the OS? Through a bit of magic, the UCS configurator PXE framework will not interfere with normal PXE operations. It’s apparently smart enough to get out of the way once the blade has been imprinted with the service profile. From that point on, you can install an OS as normal — say, VMware ESX Server, RHEL 5.3, or what have you. You can also use the virtual media facilities present in the remote KVM feature. This is somewhat old hat by now, but you can select an ISO image from your local system to present to the blade as a connected CD or DVD, and boot from that to install the OS. Here’s where another funny thing happens: Generally speaking, there are no drivers to install. Windows Server 2008, RHEL 5.3 and later, and VMware ESX 3.5 U4 already have all the required UCS drivers present in the default install. You might think that Cisco’s been planning this for some time. You might also think that Cisco has some significant pull with various OS vendors. You might be right. Bouncing around the room So you have your blades built with Windows Server 2008, VMware ESX, RHEL 5.3, or whatever. Each of them can play on however many VLANs you’ve defined, bind to whatever SAN LUNs you’ve presented, and are basically fat, dumb, and happy. So what happens when a blade goes down? There isn’t a truly defined high-availability aspect to UCS, which is somewhat disappointing. However, if you assign the server instance to a pool of blades, and it boots from a SAN LUN, then the failure of the blade running that instance will result in the instance being booted from another, identical blade in that pool. This process takes several minutes, due to the fact that UCS needs to prepare the target blade with all the specifics of the service profile, then reboot, but it does provide basic HA capabilities. It would be nice to see some form of “real” HA defined on UCS, though this poor-man’s HA is functional. Another significant facet of UCS is the concept of organizations. Cisco’s management framework for UCS is not unlike that of LDAP in that it leverages the concepts of inheritance. Thus, it’s possible to create organizations that have their own policies, pools, and service profiles, while child organizations can draw from the parent organization pools and so forth, inheriting policies and pools from above. This makes management simpler by allowing you to create global pools and policies that can become catch-alls, while getting more granular with those that are applied to the specific organization. Further, administration can be delegated along organizational lines. Using another facility dubbed Locales, administrative users can be granted specific rights to specific management duties to specific organizations, with those rights flowing downhill to sub-organizations. The tale of the scale As with all IT infrastructure initiatives, scalability is key. Surprisingly, this isn’t really an issue with UCS. Each UCS 6120XP FI can handle 144 blades with dual LAN uplinks, and the soon-to-be-released 6140s will handle up to 304 blades in the same fashion. This controller-to-blade ratio is off the charts, allowing UCS installations to scale dramatically, while requiring only the relatively cheap chassis and blades rather than the pricier FIs. There are also significant provisions for multitenancy. For instance, perhaps you have separate working groups or even customers that need dedicated physical separation from not only each other, but through completely separate LANs. This can be achieved through the use of Pin Groups, which essentially pin specific physical interfaces to groups of servers. These can be applied on either LAN or SAN connections, so you can pin specific SANs to specific service profiles — not specific blades. This permits situations such as the following: Say four blades are deployed from a single service profile created for a specific department with its own LAN and SAN. These service profiles would be pinned to specific uplink ports run to that LAN and SAN. Should a blade fail, the service profile that was assigned to that blade will be brought up on another blade — perhaps within another chassis — yet that server instance will still maintain the physical separation as part of the pin group. This is a huge benefit for service providers and enterprises that have physically disparate network and storage segments. It places the UCS solution in the middle of any number of different network topologies while retaining physical separation, and it happens automatically. The real tale of scalability rests with the fact that the chassis themselves are just sheet metal, a backplane, and some fabric ports. There are no smarts in the chassis, which makes them cheap. That coupled with the significant scaling of the FIs means that the more chassis you add, the cheaper the solution becomes. If there’s a single lesson to take away from UCS, it’s that the chassis are nothing more than extensions of the FIs, and they have more than enough bandwidth to run whatever you need. That said, once you’ve filled up a pair of FIs, you have to start over with a new cluster; different UCS clusters cannot intermingle under a single management domain as of yet. Caveat emptor To be frank, the features, scope, and breadth of the UCS offering is quite impressive for a 1.0 release. That’s not to say there aren’t problems. For one thing, it’s not terribly clear when changes made to service profiles will cause a blade to reboot. In some instances, warnings are issued when configuration changes may cause a blade to reboot, but otherwise the state of a blade is somewhat opaque. I encountered a few minor GUI problems and one more significant glitch: During one service profile push, the PXE blade prep boot didn’t happen. A manual reboot of the blade through the KVM console got everything back on the right track, however. Throughout all the buildups and teardowns of the blades, this was the only time that happened. Of some concern is the fault monitoring aspects of UCS. For instance, when a drive was pulled from a RAID 1 array on a running host, the event failed to throw a fault showing that the drive had failed. However, it did produce a notification that the server was now in violation of the assigned profile because it only had one disk. Further, re-inserting the disk cleared the profile violation, but produced no indication of the RAID set rebuild status. Indeed, there doesn’t seem to be a way to get that information anywhere aside from a reboot and entry into the RAID controller BIOS, which is somewhat troubling. Cisco has filed a bug related to this problem and expects it to be fixed in an upcoming release. A minor consideration is that, while Cisco is agnostic as to the make of the FC SAN attached to UCS, it must support NPIV (N_Port ID Virtualization). Most modern FC SANs shouldn’t have a problem with this, but it is an absolute requirement. Finally, there’s the matter of cost. In keeping with all things Cisco, UCS isn’t terribly cheap. Unless you’re planning on deploying at least three chassis, it may not be worth it. The reason for this is that the chassis are relatively affordable, but the FIs and associated licenses are not. However, the scalability inherent in the UCS design means that you can fit a whole lot of blades on two FIs, so as you expand with chassis and blades, the investment comes back in spades. A well-equipped redundant UCS configuration with 32 dual-CPU Nehalem E5540-based blades with local SAS drives and 48GB of RAM each costs roughly $338,000. But adding another fully equipped chassis costs only $78,000, nearly half the price of a traditional blade chassis with similar specs. I certainly found some problems with UCS, but they float well above the foundation, which is equally impressive for its manageability, scalability, and relative simplicity. There’s a whole lot to like about UCS, and the statement it makes just might cause that revolution. Technology Industry