by Guerry Semones

Building cloud-ready, multicore-friendly applications, Part 2: Mechanics of the cloud

how-to
Apr 30, 200916 mins

Orient yourself and your applications in the cloud

In the first half of this article you learned the four attributes that your code must have to take advantage of multicore computers and cloud service platforms. But once deployed to the cloud, what makes your applications soar? Appistry’s Guerry Semones brings the cloud down to earth with this overview of the mechanics of scalability, reliability, load balancing, and more, in cloud computing’s distributed environments.

In the first half of this article you learned about four important attributes your code needs in order to run most effectively on multicore computers, or in multi-computer environments like the cloud:

  • Atomicity
  • Statelessness
  • Idempotence
  • Parallelism

But how exactly do these features help you take advantage of cloud platforms? Applications in the cloud inherit capabilities from the underlying cloud architecture — capabilities like scaling out horizontally, scaling up across multiple cores, availability, reliability, manageability, load balancing, and command and control. I touched on these benefits in my previous article; in this one, I’ll explain how cloud platforms deliver these benefits to your code.

First, let’s make sure we have a shared understanding of what exactly is meant by a cloud platform; then we’ll be able to talk about the benefits of cloud computing to architects and developers.

What is a cloud platform?

First, you need to orient yourself in the cloud. Figure 1 categorizes different cloud technologies into simple architectural layers. The breakdown is not perfect, as some products may touch more than one layer, but it’s a fine starting point.

Layers of a cloud platform
Figure 1. Layers of a cloud platform

The infrastructure-as-a-service cloud

Infrastructure-oriented cloud architectures, including infrastructure-as-a-service (IAAS) offerings, provide access to virtualized, on-demand computing resources. Amazon EC2 is a well-known example of this approach. The user can request that Linux and Windows virtual machine instances be created on the fly and billed based on actual usage. The cloud infrastructure allows the user to manage virtual machines (and associated resources, like IP addresses) and their configurations. With EC2, clients do not know where the machines are physically located or what kind of hardware is being used. This is what makes the service cloud-like.

Cloud platforms vs. platform-as-a-service (PAAS)

Platform-oriented approaches to the cloud, including platform-as-a-service (PAAS) and cloud application platforms, run atop an underlying cloud infrastructure. Cloud platforms abstract applications away from the cloud infrastructure and provide supporting services and functionality to those applications. The distinction between cloud infrastructure and cloud platforms is a critical one for architects and developers to understand.

Salesforce’s Force.com and Google’s App Engine (GAE) both typify the PAAS approach. Google App Engine users are solely concerned about the application they are creating to run on the platform. To deliver an application, they simply package it and deploy it to GAE The deployment happens in a single step and the end user does not know whether the application is being run on one virtual machine or 10 at any particular moment. In addition, the application can take advantage of special services provided by the GAE platform, such as authentication or data access.

Cloud application platforms, like their PAAS cousins, allow the developer to focus solely on the application deployed on the platform. Likewise, cloud application platforms offer the same or similar benefits described briefly for GAE above, such as virtualizing your application across the infrastructure, simplifying deployment, or providing special services. A key difference between some cloud application platforms and their PAAS cousins is portability across cloud infrastructures. For example, you can only deploy GAE applications on Google’s services, whereas cloud application platforms like Appistry CloudIQ Platform allow for in-house private cloud deployment, as well as deployment on public cloud infrastructures. Among other differences, PAAS solutions often restrict tool choices, whereas typical cloud application platforms allow you flexibility in the choice of implementation languages, IDEs, and tools.

Ideally, you should not have to care about the underlying cloud infrastructure that runs your code. Likewise, you should not be concerned with writing application code to implement scalability, reliability, and other cloud and distributed computing features that a cloud platform could provide. Your focus should be on the business logic that brings your added value, while the cloud virtualizes your application, manages its lifecycle, and leverages your application over the underlying cloud infrastructure. Cloud platforms take your code — which is ideally atomic, stateless (where possible), idempotent, and parallelizable — and does the heavy distributed computing and multicore lifting, giving you benefits that are otherwise hard to achieve on your own.

Scaling out, scaling up, and scaling down gracefully

Cloud platforms horizontally scale out your application by running it across many servers, or workers. When transaction loads are high or you anticipate the need for more throughput, you can add more workers. When loads drop, workers can be shut down (offering green dividends by reducing power use) or shunted over to another application that needs the workers now.

Why should you care if you’re a developer? If you have provided the cloud platform with a well-designed application, the cloud platform should be able to scale your application for you. Therefore, you don’t have to write the scalability code. In most cloud platforms, your code doesn’t know it’s in the cloud, much less being scaled out.

What about scaling up across multiple cores to utilize all the available processing power? The same principles apply. If your code follows the principles outlined in Part 1, then the cloud platform can automatically scale the execution of your code across whatever cores are available without you having to use any special language primitives or tools. The ability to do this varies by the cloud platform.

If you run stateless, atomic code on a cloud platform your application should gain the resilience and ability to scale up and down gracefully. If you need more resources, you can add more nodes, and scale out horizontally; if your cloud platform utilizes multicore efficiently, you get to scale up across cores. If one or more nodes die, availability ensures that new work will get done, and reliability ensures that in-flight work has a chance to complete. Either way, you can scale down with a degree of grace, even in the face of hardware failures.

Availability

Cloud platforms distribute your code across the cloud in different ways. Some platforms put all of your code on every worker and can execute your code on any of those workers at any given time. Other platforms specify workers for given tasks or roles. Sometimes all of a transaction will occur on one worker. Other platforms may optionally distribute even the execution of a single transaction. Regardless of the model, cloud platforms make your application code highly available by distributing and managing it across multiple workers.

When your code is atomic and stateless in nature, it can then reside wherever the cloud platform puts it in the cloud. In an ideal setup, the code can execute anywhere without you or the code having to think about it. At its root, this means that you automatically have high availability. If a given compute node dies, who cares? The other nodes have the code and can fulfill transactions.

Reliability

What do I mean by reliability? Say you request code to execute, and something bad happens. If your code is reliable, the requested work still gets done; at the very least, the environment does its best to complete it instead of just giving up — or, worse, losing the work entirely.

There are a number of models for attaining reliable execution in cloud platform environments. If the cloud platform is designed to provide reliability to your code, then you’ll likely be allowed to declaratively configure (outside your code) how you want reliability to behave at runtime. Without a cloud platform that virtualizes and watches over your application, trying to write reliable, distributed applications from the ground up can be a lot of work to do yourself

Figure 2 illustrates one reliability model that directly shows the benefits of atomic, stateless, and idempotent code. Say you’ve requested that your code execute in the cloud, and a failure occurs. Perhaps the worker doing the work suffers a power supply failure. The cloud platform detects the loss of work, and, depending on packaging-time configuration, retries that work on a different worker instead of returning the failure immediately to the requester. The cloud platform then retries that work until success is achieved, or until some configured threshold is met and failure is returned.

Figure 2. Cloud platform retries failed work reliably. (Click to enlarge.)

If your code takes advantage of the attributes of atomicity, statelessness, and idempotence, then you can have the flexibility to reach for reliability, especially if the environment leverages this functionality for you. Without these attributes, your options are narrowed. For example, consider atomicity in the reliability model just discussed. If the executed code encapsulates multiple non-atomic steps, then the complexity of retrying those steps goes way up. Likewise, if the code is a long-running series of steps, rather than stand-alone atomic steps, then a retry must rerun the entire series when failure happens, instead of just picking up at the step that failed.

Another approach to reliability in the cloud (besides retries and other approaches already discussed) is to execute duplicates of the same task in parallel. The task that completes first is accepted by the client, or the results of both are analyzed and one is chosen. This is illustrated in Figure 3.

Figure 3. A cloud platform executes same task twice in parallel; the first to complete wins. (Click to enlarge.)

Of course, not all code is idempotent and repeatable, often because it affects state of some sort. In such a case, the cloud platform needs to be able to deal with that, preferably in an application-configurable manner. You’ll see some possible solutions later in this article.

Manageability

Even as developers, we are affected by how difficult or easy it is to deploy and manage code in the runtime environment. When the runtime environment, even in development and testing, is distributed across multiple servers, the complexity and time to manage the application goes up dramatically. Cloud platforms take this into account — more often than not because the developers that are creating and maintaining the cloud platform are affected by the same complexities!

Some cloud platforms allow you to code and test your application on one box rather than many, and some cloud application platforms allow you to develop most or all of your applications outside the cloud platform with your normal development and testing tools. (This is not true for many platform-as-a-service environments.)

Beyond this point, there are varying levels of difficulty in deploying and managing your application on the various cloud platforms. The worst-case scenario arises all too often, where you must manually deploy to each server or virtual machine directly, as illustrated in Figure 4.

Figure 4. Manually managing individual servers or virtual machines in the cloud. (Click to enlarge.)

(In the discussion that follows, I’ll be focusing on those feature sets that I consider easiest to deal with. Your mileage will vary based on the cloud platform you choose.)

Imagine that you have some code ready to run. Typically, you will package the application in some way, bundling with it configuration information that tells the cloud platform how you want the application managed. Next, you will deploy that application into the cloud platform with a single command. Some (but not all) cloud platforms will automatically distribute your application to all of its workers (or some workers, depending on the platform’s model), and get your application up and running, as shown in Figure 5. You’re done — now use your client and access your cloud application.

Figure 5. Managing a cloud of servers or virtual machines as a single entity, and with a fine degree of granularity. (Click to enlarge.)

Subsequent versions of your code are handled the same way. You will usually repackage the code and redeploy it, probably with some mechanism for package versioning. The cloud platform will update the code for you.

Another difficulty arises from managing things at the granularity of a virtual machine image. In typical infrastructure-as-a-service cloud environments, you create a virtual machine image and populate that image with services, applications, and configurations. If something changes in your application’s configuration, you revise the virtual machine image, a time-consuming and potentially error-laden process. The term image sprawl aptly describes the growing pool of images that result from this model.

Some cloud application platforms provide a finer degree of granularity. With such platforms, you address the virtual machine image as what it should be: an operating system layer plus a cloud application platform agent. The agent oversees the virtual machine, and allows services, applications, and configurations to be deployed, updated, and versioned, and their lifecycles maintained, without changing the underlying virtual machine image. The difference between these two image-management styles is illustrated in Figure 6.

Figure 6. Coarse-grained vs. fine-grained image management. (Click to enlarge.)

Why do I even bring this up in an article primarily focused on developers? Deployment is not a production- and testing-only concern. Anything that affects time usage during development needs to get the hairy eyeball — at least, my pragmatic-programmer roots think so.

Of course, this level of manageability goes way beyond the developer. I’m aware of one company with over three hundred workers (and over five hundred cores) that manages its private production cloud running multiple cloud applications with less than one-third of an administrator’s time.

Load balancing

Cloud platforms use various types of load balancing. It may be as simple as using software- or hardware-based load balancers between the cloud application and its client — or it could be as sophisticated as the cloud platform utilizing its own built-in software-based load balancing. Load balancing affects both scalability and availability. When your application’s work is distributed across many workers, you want to make sure that the resources of each worker in the cloud are fully utilized when needed. It would not help you if some workers were maxed out and others ignored or underutilized while your application is under a spike of heavy load.

The situation gets more complex when you introduce servers with varying capacities and speeds into your cloud. If your cloud is made up of workers that range from older single-core processors up to machines with four or more cores, then you have workers with very different capability footprints. As demand ramps up, then the CPU cores, memory, and other resources should be utilized fairly across the distributed workers. Slower machines should carry the load they are capable of, and faster machines should be utilized more. The old naval adage that the armada is as fast as its slowest ship should not apply here.

Either way, wouldn’t it be great if you as a developer did not have to worry about this? Some cloud platforms take care of this for you, so you don’t have to sweat the application infrastructure or architecture to make sure the application load balances across the cloud.

Command and control

Using atomic, cohesive code opens up the possibility of using declarative state machines. Declarative state machines have been around for a while; they allow you to design flows of steps in a declarative way, often in XML or some other domain-specific language (DSLs). They are often used in middleware and in defining business logic and workflows. Spring Web Flow is based on this concept, as is Microsoft Windows Workflow, and Appistry’s own Process Flow technology. There are many other examples.

Typically, the model runs something like this: a state machine of different steps is defined. Each step or state is tied to a task or unit of executable logic. The state to which the machine branches is determined by the task execution results of the prior step. If a step succeeds, then the next step takes some happy path. If a step fails, then the next step may execute a compensating task to deal with the failure, or request help, or return failure. Success or failure is usually defined by conditional logic, rules, data values, thrown exceptions, and other conditions.

By using declarative state machines to orchestrate atomic, stateless, and perhaps idempotent code in the context of distributed environments like cloud platforms, you can get surprising levels of robustness, reliability, and flexibility. Additionally, if your code must be stateful or cannot be safely re-executed because its operations are not idempotent, the declarative state machine makes that code more reliable. The declarative nature of the state machine allows your design to accommodate failures in these conditions, without putting the failure handling inside your code. Also, some state machines allow for snapshooting progress in the state machine steps, so that a process interrupted by failure can be resumed and completed. Again, this is something that would not be possible without making sure the code breaks down nicely into atomic steps or tasks.

When this technology is seen in cloud platforms, it allows the orchestration of your code across many workers in a reliable, scalable, available, and load-balanced way without your code knowing about it.

Where to from here

Do cloud platforms exist in the real world? There are differences from platform to platform, both in features and focus, but typically each cloud platform hides the cloud infrastructure from your application, virtualizes your application to manage and leverage it in a cloud-like manner, and provides essential services so that you as a developer do not need to re-invent the wheel.

You should now have a good sense of how cloud computing will change the way you design code. Like any design principles, the ones described in this series have to be applied with some common sense. There are no magic or silver bullets, and hammers aren’t the right tool for every job. However, considering these principles will help you leverage your code on cloud platforms now and in the onrushing future.

See the Resources section below to learn more about cloud computing.

Guerry A. Semones is a founding senior engineer and product manager at Appistry, a pioneer and leading provider of next-generation cloud application platforms. Guerry also serves as liaison to the Appistry Peer2Peer developer community. Find out more by reading his blog.