Mission-critical apps in the cloud: Ensuring reliability

7-minute read

According to a recent study, 94 percent of enterprises use cloud services, and 67 percent of enterprise infrastructure is now cloud-based. In the race to realize the benefits of cloud computing, many mission-critical applications pose unique complications, as many of these involve large on-premise footprints and legacy infrastructures that present challenges in mapping to cloud architectures. In addition, the growing popularity of microservices has increased the complexity of deployment and management, particularly for complex and critical applications.

I’ll begin this discussion with a quick review of the “gold standard” for building reliable cloud-based systems—the AWS Well-Architected Framework. We’ll then go over key considerations for mission-critical applications in the cloud, followed by eight vital elements that make up the “how” behind the “what” of managing these apps securely, reliably, and efficiently.

94 percent of enterprises use cloud services, and 67 percent of enterprise infrastructure is now cloud-based.

Using the AWS Well-Architected Framework as a starting point

Developed by Amazon Web Services (AWS), the Well-Architected Framework is a set of guidelines designed to help teams design and operate reliable, secure, efficient, and cost-effective systems in the cloud. The framework revolves around the six “pillars” of operational excellence, security, reliability, performance efficiency, cost optimization, and sustainability.

In the Well-Architected Framework, architects, developers, and operations teams find a structured approach and best practices to guide them through the process of creating a robust cloud infrastructure. By addressing key architectural principles, it helps organizations build a firm foundation for cloud applications while supporting identification of potential risks and weaknesses early in the design process.

I recommend using the AWS Well-Architected Tool to aid in designing your workloads and adhering to best practices. Choose an architecture that aligns with your specific requirements—for example, a cell-based architecture that confines failures to a limited number of components.

While the Well-Architected Framework is a crucial step, it is not an endpoint in itself. To ensure long-term efficiency, security, and cost-effectiveness of critical apps in the cloud, organizations must complement the framework with consistent monitoring and proactive management. In essence, the Well-Architected Framework is an effective foundation, but it must be combined with tactical measures if long-term success is to be achieved.

Key considerations for migrating and managing critical apps in the cloud

As your team advances towards migration of your mission-critical application to the cloud, keep in mind the following considerations, both today and in the future:

Zero downtime: The application must maintain uninterrupted availability and accessibility, with a focus on resilience and minimizing downtime, regardless of cost considerations. Employ deployment strategies that do not impact users and that enhance the application’s uptime and availability once it’s deployed in the cloud.
Manage performance of the application: Make sure the cloud-based application has the necessary scalability to meet evolving business demands and meet desired SLAs.
Choose the right cloud service to match your SLA: Confirm that the chosen cloud service delivers the uptime, latency, bandwidth, and throughput guarantees needed to match your SLA.
Have the right deployment and testing strategies in place: For example, have an effective rollback mechanism in place to handle failed deployments reliably.
Develop and implement cost-management policies: Cloud costs can quickly spin out of control if policies are not in place to offer guidance to the application’s users.
Expect the unexpected: Have a disaster recovery plan and test it on a regular basis.

The “how” behind the “what”: 8 tactical elements for success

In my work with clients on migrating and running mission-critical applications in the cloud, I’ve identified eight essential elements that go beyond the Well-Architected Framework to help ensure both short- and long-term reliability, efficiency, scalability, and cost effectiveness.

1. Cloud-agnostic infrastructure as code

Using cloud-agnostic infrastructure as code (IaC) tools such as Terraform enables teams to develop a single script that can target any cloud provider, offering the flexibility to spin up in AWS, Azure, Google Cloud Platform, or any other environment.

Cloud-agnostic IaC allows diversification of the infrastructure across multiple providers, reducing the risk of a single point of failure. Development teams can also adapt more easily to changing requirements, making adjustments to the infrastructure code and deploying it to the cloud environments that best suit their needs.

2. Build automation

Tools such as Apache Maven and Gradle automate the creation of software builds and the associated processes, including

Compiling computer source code into binary code
Packaging binary code
Running unit tests

Build automation enables teams to define a build process in code and execute it consistently and repeatably across multiple environments, enabling faster and more efficient migration of mission-critical applications to the cloud. Teams can also reduce the risk of errors in the migration process by automating the tasks that are most prone to human mistakes.

Cloud-agnostic IaC allows diversification of the infrastructure across multiple providers, reducing the risk of a single point of failure.

3. CI/CD platforms

Continuous integration and continuous deployment (CI/CD) platforms automate the building, testing, and deployment of application code and infrastructure configurations. CI/CD pipelines typically include the following steps:

Code commit: Developers commit their code changes to a shared repository.
Build: The code is automatically built and tested.
Deployment: The code is automatically deployed to a production environment.

Teams can use CI/CD platforms to deploy code changes to production in a safe and controlled manner, reducing the chance of downtime and disruptions—especially important when the application is critical to business operations. CI/CD enables faster delivery of new features and bug fixes and can be used to automate security checks and tests, ensuring that the code is secure before being deployed to production. Popular tools in this area include Jenkins and AWS CodeBuild/CodePipeline.

4. Code quality tools

Ensuring code quality is essential for any application. Tools such as SonarQube enable development teams to set a threshold for code quality and will automatically reject a code merge if it fails to meet the threshold. Teams can identify and address code defects early in the development process, before they can lead to security breaches, compliance violations, and other adverse outcomes. Code quality tools also help teams collaborate more effectively by providing a shared view of code quality.

Teams can use CI/CD platforms to deploy code changes to production in a safe and controlled manner, reducing the chance of downtime and disruptions.

5. Configuration management

We recently worked with a telecom client on a self-service capability for customers to add an extra phone line. This feature involved 18 or more different APIs, each with multiple versions. Configuration management enabled us to track which version of which API was deployed in which location, simplifying our task considerably.

Ansible, Chef, Puppet, and other configuration management tools help businesses ensure that all servers in a cloud environment are configured consistently and make it easy to repeat deployments and configuration changes across multiple servers. Teams can also track changes to server configurations over time, facilitating the task of auditing configuration changes and identifying potential problems.

6. Logging, monitoring, and alert platforms

Tools such as Splunk and SumoLogic enable continuous monitoring of critical application performance and infrastructure health, delivering alerts when anomalies are detected. These capabilities enable identification and resolution of problems—and their root causes—before they can cause outages or disrupt services. Organizations can also provide evidence that they are monitoring cloud environments and addressing problems, which can facilitate compliance with security regulations and other reporting requirements.

Configuration management tools help businesses ensure that all servers in a cloud environment are configured consistently and make it easy to repeat deployments and configuration changes across multiple servers.

7. Observability platforms

Observability platforms collect and analyze data from a broad range of sources—including metrics, logs, and traces—to help businesses measure and understand the state of a critical application, in addition to analyzing logs for anomaly detection.

These capabilities provide teams with a comprehensive view of their cloud environment, including performance, health, and behavior of applications, infrastructure, and services. Businesses also gain real-time insights into the cloud environment, allowing them to respond to problems quickly, minimize the impact on their customers, and prevent errors from happening again in the future. The most commonly used platforms for this purpose are Grafana, Dynatrace, and AppDynamics.

8. Performance and disaster recovery testing

As their names imply, performance testing gauges the performance of a critical application under a variety of load conditions, while disaster recovery testing assesses how quickly the business can bring the application back online in the event of a disaster. Businesses can rely on these platforms—such as SOASTA Cloud Test, Load Storm, and BlazeMeter—to ensure that critical applications continuously meet requirements for performance and recoverability. The result is improved reliability and a reduced risk of outages, which benefits the business in the areas of customer satisfaction, compliance, and others.

When businesses also gain real-time insights into the cloud environment, they can respond to problems quickly, minimize the impact on their customers, and prevent errors from happening again in the future.

Unlocking success for critical apps in the cloud

As cloud services become more stable, reliable, and cost-effective, allowing mission-critical applications to remain in on-premise architectures is becoming less and less of a viable option. By using the pillars of the Well-Architected Framework as an overall guide and incorporating the eight tactical elements we outlined above, organizations can enjoy the benefits of cloud computing for critical apps while maintaining control over data quality, reliability, and security.

Like what you see?

Connect

Vinod Perla is an Architect in Logic20/20’s Advanced Analytics practice.