Note: This document is part of a collection of documents that comprise the Reference Architecture for Private Cloud document set. The Reference Architecture for Private Cloud documentation is a community collaboration project. Please feel free to edit this document to improve its quality. If you would like to be recognized for your work on improving this article, please include your name and any contact information you wish to share at the bottom of this page.
Consumers of IT platform services have the business on their minds and have decomposed needs and problem areas into processes, tasks and timelines for addressing those needs. In short they have a plan that includes IT capabilities. Part of that plan will include execution steps for development, test, staging and deployment of business applications hosted on IT resources. The consumer wants to select from a set of IT resources and services using self-service to acquire them and compose into a higher level of service to meet their needs. Over time consumers will need to expand or collapse resources that make up services to respond to changes in business need. This elastic capability also carries the implied characteristics of performing these elastic responses in a rapid or agile manner. The consumer also has the expectation that resources will be managed and optimized on their behalf in the most efficient and secure manner. That is the user is not concerned with the details of how a resource presents a capability. Their concern is that the capability is there reliably and they are charged appropriately for the amount that they use the resource or service. To compose services holistically the platform must provide the capability to stitch together resources and services offered by the platform into user defined services. The user defined services become an automated operation available to the user through self-service.
The provider view takes the form of internal IT delivering the Private Cloud Infrastructure as a Service platform within their organization. In addition to the above cloud computing characteristics the remaining characteristics also apply:
The cloud computing Deployment Models defined by NIST also drive automation requirements and capabilities since resources and services may be deployed privately on-premise, hosted or in a combined hybrid cloud deployment model. Providers require automation to drive predictable results in every layer of the Private Cloud Reference Architecture starting with the foundation of providing Infrastructure as a Service through higher service models that include Platform and Software as a Service. Starting at the lowest layers of the infrastructure, providers use automation to perform bare-metal provisioning and configuration of hardware resources that compose pools of these resources that are allocated by provisioning jobs. IT uses automation to create and update the enterprise service catalog and configuration management data stores dynamically as resources are built. The service and configuration stores are leveraged by automation to make further downstream decisions such as host selection and intelligent placement. Automation continues throughout the service management lifecycle to include business processes such as the onboarding of new applications or users to the platform and performing routine management and updates of the platform.
As mentioned earlier in this article the automation capability is key in enabling the platform deliver on cloud computing characteristics. In this section we’ll look at each layer of the reference architecture and their respective components and define where automation influences component operation.
The Service Delivery layer is the interface between business and IT. It serves as the conduit for translating business requirements into IT services and is responsible for managing ongoing delivery of those services. Service Delivery contains several components that are directly integrated into, have a dependency upon or influence private cloud platform automation. Each component is listed here. For more information about the component refer to the Private Cloud Reference Model.
Financial Management incorporates the functions and processes used to meet a service provider’s budgeting, accounting, metering, and charging requirements. While the primary concerns for Financial Management are for providing cost transparency, the overall costs of operating infrastructure components is directly tied to how efficiently these components are operated throughout the lifecycle of the infrastructure. Financial Management must provide data artifacts about facility operating costs that include power, cooling and other environmental costs that impact the total operating costs of the infrastructure. This information may also include data that define peak periods where costs are at their highest so automation activities can take this data into account when determining task start times and areas of the infrastructure that can be taken offline and powered down.
Demand Management involves understanding and influencing customer demands for services, plus the provision of capacity to meet these demands. The process of understanding service demand is a business intelligence function that allows the organization to model the characteristics of service needs and project them on the Infrastructure as a Service platform to define new areas of automation that must be defined to meet demand. Demand Management initiates automation to respond to business demand for resources while insuring the necessary resiliency exists to meet demand in the presence of resource decay.
Service Catalog Management is influenced by Business Relationship Management and takes the end-to-end view of services offered by the platform. Automation includes activities that continually update the status or state of the service catalog and their attributes. Service Life Cycle Management includes continued improvement of a service and that will involve the review and evolution of automation capabilities included in the service.
In delivering a platform capability the provider designs for and builds capability that will allow it to meet its published Service Level Agreement (SLA). Deciding factors in determining an SLA is the predictability in establishing and operating the service in the steady state and during periods of degraded state. Automation is used not only in establishing a service but also during periods of diminished capability reacting to failure conditions. Automated response to failure is referred to as the remediation of failures and takes the form of some level of automation. This automation is always triggered as the result of an indident.
Availability Management defines processes necessary to achieve the perception of continuous availability. Those processes will always include the use of automated procedures throughout the infrastructure fabric management to create redundancy where needed and resilient set of resources to maintain published SLA availability.
Capacity Management defines the processes necessary to achieve the perception of infinite capacity. Capacity must be managed to meet existing and future peak demand while controlling under-utilization. Capacity Management is closely related to Demand Management and the same resource automation is leveraged to provision and collapse capacity as demand need changes.
Information Security Management strives to make sure that all requirements are met for confidentiality, integrity, and availability of the organization’s assets, information, data, and services. Infrastructure automation must take into account security attributes assigned to resources during task operations to meet multi-tenancy requirements to prevent information security issues from occurring during the management of pooled resources.
The Infrastructure Layer provides hypervisor services (VM resources) to the Platform and Software Layers. It defines the capabilities necessary for these VMs to execute; it includes hypervisor, physical servers, network devices, storage systems, and facilities (which include space, power, cooling, and physical interconnects). The Infrastructure Layer includes the physical hardware from many vendors. This creates the opportunity for many different types of automation technologies that may need to interact with others present in this layer. The automation that exists in the layer will directly affect facility and hardware configuration while updating state in the service and configuration data stores.
The facilities component is quite broad and contains many industrial control interfaces for monitoring and operation of power, cooling, airflow and other environmental concerns. The component also includes the interfaces that operate and monitor hardware racks in the datacenter. Facilities also include the core communication capabilities and interconnect between devices in the datacenter. Automation in facility equipment may be specific to a component or hardware vendor and only communicate the resulting state of the component through its automation. For example the failure of an air handling unit that causes the shutdown of the unit and trigger an alarm that must be acted upon by a different set of automation.
Compute components include the physical servers used to host physical application workloads or virtual machine hosts. It is inclusive of all the device components within the server that can be operated and monitored externally. Compute component automation includes the bare-metal provisioning of server hardware up to the point where the server is configured into the private cloud fabric management to assume the role of physical application host or virtual machine host.
Storage components represent physical storage devices that present units of storage consistent with the architecture of the component and present proprietary, industry standard (SMI-S) or both management interfaces allowing the discovery and provisioning storage capability provided by the component. Automation at the component level will include:
Network services provide addressing and packet delivery for the provider’s physical infrastructure and the consumer’s VMs. Network capability includes physical and virtual network switches, routers, firewalls, and Virtual Local Area Network (VLAN). Network automation is provided by propriety management interfaces although there are emerging industry wide standards efforts that the architect should monitor.
The Operations Layer defines the operational processes and procedures necessary to deliver IT as a Service. The main focus of the Service Operations Layer is to define the business requirements of the organization. Cloud-like service attributes cannot be achieved through technology alone; mature IT service management is also required.
Change Management is responsible for controlling the life cycle of all changes. Its primary objective is to implement beneficial changes with minimum disruption to the perception of continuous availability. Changes are developed in a non-production environment that mirrors the production environment assuring that the development and testing efforts occur in an environment that is most likely to show the same results as in the production environment. Testing of changes are final tested in a staging environment. Changes are implemented into the fabric management automation and therefore a contributing technology that raises the overall maturity of the IT service management capability. Automation is software development and must come under the organizations Software Development Lifecycle (SDL) practices. In fact the automation is under change control and subject to the same change control processes as the rest of the IT organization.
Automation is a consumer of configuration management and influences service assets. The automation components of a private cloud are designed and authored to consume declarative data held in a configuration store. Avoiding the hardcoding of configuration data allows automation reuse in private cloud fabric management scenarios simply by updating configuration held in the configuration data stores. Reuse results in less development and testing effort of the fabric management automation. Over time this has the effect of increased predictability and agility of fabric management operations in the private cloud.
Instantiation or upgrade of a release is accomplished by transitioning the release from staging to production through the use of automation that has been previously tested on the staging environment. Automation is used by fabric management to perform updates of the private cloud infrastructure by defining a resource upgrade domain and performing the update in a predictable and repeatable manner. Updates continue for each upgrade domain until completed and the appropriate configuration and service management records are updated to reflect the change.
Knowledge Management is responsible for sharing and storing information in the enterprise. Automation plays both a direct and an indirect role in knowledge management. The notification processes used within an enterprise are driven by configuration data and automation. This automation usually implements a decision tree to select notification levels based on the severity or type of event being raised. The fabric management automation is responsible for updating the service and configuration data stores when management operations are performed and this has the indirect effect of sharing information in the organization.
Incident management benefits from automation by creating well defined incident management processes utilized by staff to record and process an incident. Fabric Management uses automation to initially triage or remediate issues systematically. When automated remediation is not possible the data collection artifacts gathered by automation aids problem resolution.
Requests for fulfillment of IT operations are the result of users making requests of the platform or through systematic activities requiring a resource change. These requests trigger automation to allocate, provision or change resources on behalf of the user or process making the request.
Automation is used by fabric management to configure access control on resources that have been provisioned on behalf of a user or process. A service hosted on the private cloud infrastructure contains many access control boundaries that must be configured for appropriate access before the service is made available to the consumer. Automation of access control requests increases service predictability.
Systems management involves encoding common tasks that are performed often or on a scheduled basis. IT staff have long encoded common tasks into scripts that are reused as needed. This is a form of automation that carries forward to the private cloud. This same automation may be integrated into fabric management to perform common or remedial tasks on the infrastructure. In the case of the private cloud this system management automation is subject to change control.
The Management Layer contains the tooling capabilities required to execute and implement the Service Operations and Service Delivery processes and procedures that support IaaS, PaaS, and SaaS. These capabilities are incremental moving up through the Infrastructure, Platform and Software Layers.
Service Reporting in a private cloud can be a complex operation since the number instances reported on may be quite large. Automation may be used to facilitate the gathering of event and performance information from instances and correlating that data to a service and tenant. This level of automation is generally provided by the private cloud platform monitoring tools to handle the details of collection and correlation of resource instance data. Once data has been captured and correlated the architect must define the appropriate thresholds for triggering Service Management operations.
The configuration management system is a critical component of the private cloud responsible for collecting, maintain and exposing configuration data to all layers of the private cloud. Automation of management operations in each respective area of the private cloud causes configuration items for a resource to be created or updated. Automation continually uses configuration management data to define create or update services on the platform. Over time configuration may fall out of the desired state and trigger remediation automation to correct the condition and update the appropriate data stores.
Fabric Management is the toolset responsible for managing workloads of virtual hosts, virtual networks, and storage. Fabric Management provides the automation necessary to manage the life cycle of a consumer’s workload.
Instantiation or upgrade of a release is accomplished by transitioning the release from staging to production through the use of automation that has been previously tested on the staging environment. Automation is used by fabric management to perform updates of private cloud infrastructure by defining a resource upgrade domain and performing the update in a predictable and repeatable manner. Updates continue for each upgrade domain until completed and the appropriate configuration and service management records are updated to reflect the change.
Thomas W Shinder - MSFT edited Revision 3. Comment: fixed typo.