3-minute read
Online consumers have a new standard experience when it comes to content: personalized, fast, continuous, and unlimited. Competitive companies in today’s market cater to this standard and invest in the tools required to maintain it. Unfortunately, like any system at scale, the infrastructure behind this experience is susceptible to outages and failure. Cloud computing best practices can reduce the likelihood and severity of these outages, increasing reliability and customer trust.
In this article, I’ll talk about how to increase cloud reliability by making workloads more reliable.
Cloud reliability best practices
The AWS Well-Architected Review enables cloud architects and development leads to evaluate and redesign systems intentionally, with awareness of five key areas: operational excellence, security, reliability, performance efficiency, and cost optimization.
According to AWS, you can use a few design strategies to improve cloud reliability. These best practices fall into three main categories: foundation, change management, and failure management. Automation is a key theme across these practices. Let’s take a closer look.
Foundation
1. Increase and diversify your resources. The impact of an outage can be severe. When your system unexpectedly goes down, you can lose revenue or suffer damage to your company’s reputation. Increasing and diversifying your cloud resources is a good way to combat this, since it allows you to keep some things running even if other things break. Microservices are a great example of this concept. Amazon offers solutions such as RDS, Aurora, or DynamoDB.
2. Know your capacity. If your system is asked to do too much at once, there’s a higher likelihood it will break. With cloud services, you have more control over resource allocation. Whether your solution is entirely cloud-based or a hybrid, it’s important to know how much it can handle. AWS has service limits, so you’ll have to pay attention to those and adjust as needed. You can use AWS Config. You can also automatically scale compute and storage or use spot EC2 instances to minimize the cost.
Change management
3. Manage change well. Everyday variables can affect your systems, and it’s a good idea to think about possible changes before they happen. Having a plan in place is essential to being flexible and dynamic when it’s most important. This is where automation can be very helpful.
With AWS, you can automate change management using a handful of tools like auto scaling, elastic load balancing, and more. Services like AWS CodePipeline let you automate all phases of code deployment, freeing up your time and attention for more important tasks. Also, you want to consider serverless architecture using a service like AWS Lambda.
Failure management
4. Know your recovery procedure. What will you do when faced with a failure? It’s important to have an idea of your anticipated recovery time for different events. You should have a strategy for each type of event and test these often. Use automated services where possible and ensure that each event has some automated response in place to ensure action even without human involvement.
As you continue to improve your cloud workload reliability, you increase your ability to prepare for and respond to the changing needs of your customers. Maintaining a scalable and dynamic collection of services is the surest way to stay ahead. Based on your availability requirements, you should architect your solution accordingly either by using multi-AZ or multi-region deployments.
Claim your competitive advantage
We create powerful custom tools, optimize packaged software, and provide trusted guidance to enable your teams and deliver business value that lasts.
Ilya Tsapin is the Architecture Practice Area Lead at Logic20/20. He has experience in project management, architecture, Agile methodologies, and more.Â