Reliability Pillar

Description The reliability pillar encompasses the ability of a workload to perform its intended function correctly and consistently when it’s expected to. This includes the ability to operate and test the workload through its total lifecycle.

Resiliency is the ability of a workload to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions, such as misconfigurations or transient network issues.

What – Reliability is a measure of stability or consistency. The quality of being able to be trusted or believed because of working or behaving well

Why – To focus on workloads performing their intended functions.

When – To recover the systems from infrastructure or service disruptions, automatically. I.e.. to recover quickly from failure to meet demands.

Shared Responsibility Model for Resiliency

AWS responsibility – Resiliency of the cloud
Customer responsibility – Resiliency in the cloud

As a customer, you are responsible for the management of the following aspects of your system to achieve resilience in the cloud.

Networking, quotas, and constraints
– Plan your architecture with adequate room to scale and understand the service quotas and constraints of the services
– Design your network topology to be highly available, redundant, and scalable.

Change management and operational resilience
– Change management includes how to introduce and manage change in your environment.
– Workloads in the cloud must adapt to changes in demand scaling in reaction to impairments or fluctuations in usage.
– A resilient strategy for monitoring workload resources considers all components, including both technical and business metrics, notifications, automation, and analysis.

Observability and failure management
– Observing failures through monitoring is required to automate healing so that your workloads can withstand component failures.
– Failure management requires backing up data, applying best practices to allow your workload to withstand component failures, and planning for disaster recovery.

Workload architecture
– Your workload architecture includes how you design services around business domains, apply SOA and distributed system design to prevent failures, and
build in capabilities like throttling, retries, queue management, timeouts, and emergency levers.
– Rely on proven AWS solutions to align with best practices
– Use continuous improvement to decompose your system into distributed services to scale and innovate faster

Continuous testing of critical infrastructure
– Testing reliability means testing at the functional, performance, and chaos levels, as well as adopting incident analysis and game day practices to build expertise in resolving issues that are not well understood.

Five Design Principles of Reliability

Automatically recover from failure
Test recovery procedures
Scale horizontally (adding instances) to increase aggregate workload availability
Stop guessing capacity
Manage change through automation

Leave a Reply Cancel reply