Description The ability to support development and run workloads effectively, gain insight into their operations, and to continuously improve supporting processes and procedures to deliver business value. The operational excellence pillar focuses on running and monitoring systems, and continually improving processes and procedures.
What – Operational excellence is a business philosophy that focuses on continuous improvement and the optimization of business processes and systems to achieve better results
Why – To get new features and bug fixes into your hands quickly and reliably and to improve the ongoing activities of an organization.
When – To increase efficiency, reduce costs, improve quality and customer satisfaction.
Design Principals & Tools of Operational Excellence
- Perform operations as code.
- You can define your entire workload (applications, infrastructure, etc.) as code and this will limit human error and create consistent responses to events.
- IaC (Infrastructure Provisioning as a code) tools include AWS CloudFormation and HashiCorp Terraform.
- SCaC (Software Configuration as a code) tools include chef, puppet and ansible.
- Make frequent, small, reversible changes.
- Design workloads to allow components to be updated regularly to increase the flow of beneficial changes into your workload.
- Make changes in small increments that can be reversed if they fail to aid in the identification and resolution of issues introduced to your environment (without affecting customers when possible).
- Always separate your environment as per your requirements like Development, Quality Assurance, User Acceptance testing, and Production.
- AWS Services include AWS System Manager, Elastic Bean Stack Blue/Green Deployments, EC2 Auto scaling etc…
- Refine operations procedures frequently.
- Continuously monitor and remove unwanted resources and improve operations procedures to reduce manual intervention and increase efficiency.
- You can monitor your resources for associations and delete if not required. Caution – Many services include “Deletion Protection” so enable this for active resources wherever possible to avoid accidental deletions.
- Anticipate failure.
- Perform risk analysis (pre-mortem) exercises to identify potential sources of failure so that they can be removed or mitigated.
- Test your failure scenarios and validate your understanding of their impact.
- Set up regular game days to test workload and team responses to simulated events.
- Learn from all operational failures.
- Drive improvement through lessons learned from all operational events and failures.
- Share what is learned across teams and through the entire organization.
Four Best Practices of Operational Excellence
- Organization (Resource Management): This area focuses on organizing and managing resources, roles, and responsibilities to optimize operations. Evaluate requirements/needs, governance, compliance, threat landscape.
- Prepare (Proof-of-concept): This area involves preparing for production changes before they occur.
- Operate (Monitor Systems): This area focuses on running and monitoring systems to deliver business value.
- Evolve (Review & Refine): This area involves regularly reviewing and refining operational processes to optimize operations.
Takeaway –
- Operational excellence is an ongoing and iterative effort.
- Every operational event and failure should be treated as an opportunity to improve the operations of your architecture.
- Focus on incremental improvement based on priorities as they change, and lessons learned from event response.
Next Topic Security