AWS Well-Architected Framework Applied: Operational Excellence

August 7, 2019

According to the TechJury 2019 report, AWS is still the most popular cloud provider, and it makes the question of building an AWS infrastructure more popular than ever. But, unfortunately, the AWS Well-Architected Framework (AWS WAFR) and related to its materials have way more theory than real experience and examples.

So, in this article series, we’ll explain how (AWS WAFR) applies to real-world web applications, showing the usage of each pillar on the specific application we’ve developed recently.

This article focuses on the first pillar, defined by AWS as “Operational Excellence”.

The Application

First of all, we need to give you some information about the application.

All the examples shown in this series are taken from the architecture and the code of the collaborative platform being developed in Quantum right now.

The purpose of this project is to create a platform where the owners of unique and specific equipment, services and expertise could communicate with startups, thereby engaging the necessary resources to implement innovative technological ideas.

The web application contains plenty of features that help platform users communicate with each other: personal and project chats, project forums, social media-like connections, and publishing documents and media files. The system has a comprehensive notifications system (internal and by email), which is the basis for the core collaboration platform workflows: connections between people, invitations to project managers and security managers, invitations for people to join various social and business initiatives, necessary resource requests, notifications about new posts and messages, etc. Via platform capabilities, you can own your data, and it’s secure. You can even assign a Security Manager to review content you put on site, which adds another layer of protection for your digital intellectual property.

From the perspective of the tech stack, backing all these features, it’s pretty simple: Python/Django on the server side, along with PostgreSQL for database, React.JS/Redux on the front end, plus ElasticSearch for speeding up the search across different types of data.

Knowing all the requirements, both business and technical, we can start applying WAFR to our application, so let’s talk about the first pillar: Operational Excellence.

Operational Excellence: Definition and Design Principles

Amazon defines operational excellence as “the ability to run systems and gain insight into their operations to deliver business value and to continually improve supporting processes and procedures” in its whitepaper. Speaking simply, an application that’s built with operational excellence in mind should be able to update, scale, collect business insights and tech metrics, and notify the team about everything unusual (servers down, application exceptions, intrusion attempts) in a well-documented, automated manner with all the needed setup defined in code. And these code changes should be performed in small portions, documenting each step and each change. That seems like a lot of requirements, but with AWS, it can be implemented pretty easily.

Our application of Operational Excellence

Quantum uses the following main services for maintaining Operational Excellence: CloudFormation, CloudWatch/CloudWatch Agent/CloudTrail, and AWS Lambda. Let’s discuss each one in detail.

CloudFormation (CF): all the infrastructure is managed and updated via CloudFormation templates, stored in a Git repository. It helps track and document each change and allows us to perform code reviews for each change. All of the resources, like EC2, RDS, S3, ELB, etc. are managed solely via CloudFormation templates, with no manual setups. Configuration of the OS on EC2 instances is performed by passing User-Data through CF templates.
CloudWatch/CloudWatch Agent and CloudTrail: CloudWatch is used for monitoring EC2-level metrics(CPU load, IO load, net load, etc.), and application-level metrics(request rate, error codes, etc.). Notifications about drastic changes in service load or any incidents are sent via SNS to e-mails of responsible engineers in case of an incident and to the AWS Lambda function in case of a need to scale up or down. We also use CloudTrail, which helps us track accesses and actions on the AWS console and AWS API. This info is analyzed to prevent unauthorized intrusion or situations of uncontrollable scaling.
Cloud Trail is used to detect and prevent cases of suspicious access attempts and uncontrollable scaling of a resource.
AWS Lambda: we are using AWS Lambda for automated scaling of the infrastructure. Changes are triggered via SNS, in response to them, Lambda function launches a new CloudFormation template or destroys redundant for the current load, updates Route 53, sends a notification about the results to the responsible engineer.

Possible improvements

Making use of all the best practices contained in WAFR in your application — an evergoing task that can take an indefinite amount of time, so all of the above are the things we’ve implemented from the start. Right now, we are also considering using the AWS Trusted Advisor to see what else can be improved in the infrastructure, and we have an idea to move from Bitbucket and Jenkins to CodeCommit and CodePipeline to keep all of our services on Amazon.

This is the end of our first AWS WAFR article, thanks for your time and interest!

The next part is coming soon, and it’ll be about the Security pillar of our application.

P.S.: Share your ideas and insights in the comments, and tell us what else you would like to read about.

All the best, your Quantum team.