Description The reliability pillar encompasses the ability of a workload to perform its intended function correctly and consistently when it’s expected to. This includes the ability to operate and test the workload through its total lifecycle.
Resiliency is the ability of a workload to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions, such as misconfigurations or transient network issues.
What – Reliability is a measure of stability or consistency. The quality of being able to be trusted or believed because of working or behaving well
Why – To focus on workloads performing their intended functions.
When – To recover the systems from infrastructure or service disruptions, automatically. I.e.. to recover quickly from failure to meet demands.
Shared Responsibility Model for Resiliency
AWS responsibility – Resiliency of the cloud
Customer responsibility – Resiliency in the cloud
As a customer, you are responsible for the management of the following aspects of your system to achieve resilience in the cloud.
Networking, quotas, and constraints
– Plan your architecture with adequate room to scale and understand the service quotas and constraints of the services
– Design your network topology to be highly available, redundant, and scalable.
Change management and operational resilience
– Change management includes how to introduce and manage change in your environment.
– Workloads in the cloud must adapt to changes in demand scaling in reaction to impairments or fluctuations in usage.
– A resilient strategy for monitoring workload resources considers all components, including both technical and business metrics, notifications, automation, and analysis.
Observability and failure management
– Observing failures through monitoring is required to automate healing so that your workloads can withstand component failures.
– Failure management requires backing up data, applying best practices to allow your workload to withstand component failures, and planning for disaster recovery.
Workload architecture
– Your workload architecture includes how you design services around business domains, apply SOA and distributed system design to prevent failures, and
build in capabilities like throttling, retries, queue management, timeouts, and emergency levers.
– Rely on proven AWS solutions to align with best practices
– Use continuous improvement to decompose your system into distributed services to scale and innovate faster
Continuous testing of critical infrastructure
– Testing reliability means testing at the functional, performance, and chaos levels, as well as adopting incident analysis and game day practices to build expertise in resolving issues that are not well understood.
Five Design Principles of Reliability
- Automatically recover from failure
- Test recovery procedures
- Scale horizontally (adding instances) to increase aggregate workload availability
- Stop guessing capacity
- Manage change through automation