Building a house on weak foundations is a disaster waiting to happen with nothing to stop it. Although writing software is a lot like building a home, we have some room for improvement regarding the foundations we decided to take advantage of. In most situations, our abilities to do that are constrained, though, and if significant changes are needed, a rewrite might be the only option, unfortunately. AWS Well-Architected Framework gives architects a well-defined set of rules that they can follow from the project's inception, and they can be confident that these would lead to excellence right from the start. In addition to the existing five pillars a new one was introduced lately, the Sustainability Pillar. We could split the best practices from this pillar into two groups essentially: infrastructural and software/architectural improvements. Towards the end, we will take a look at the possible path to production. Let's dive in and try to reduce our carbon footprint!
AWS is an IaaS provider at its core, and all of the managed services it provides have state-of-the-art infrastructure as its foundation. However, once a decision is made to use either a PaaS or SaaS, most of the time, we do not have that fine-grained control of their inner workings. AWS RDS would be a good example. However, we always can control the big three of infrastructure: compute, storage, and networking. So there is always something we can do to strengthen sustainability for our workloads in these areas.
In the networking area, the most apparent improvement we can make at the start is the choice of the region. It should always be the closest one to the customers of our application. Another step that we can make to reduce the networking footprint is to take advantage of AWS PrivateLink and VPC Endpoints. With these, workloads running on AWS route directly through the AWS backbone network when connecting to services like S3 or DynamoDB. Usually, they would travel through the internet instead.
Some of the most apparent computing improvements that enhance the sustainability of our workloads would be auto-scaling and scheduled scaling. The cloud's most significant selling point from the inception of the idea was that your workloads can always have that ideal amount of computing power. So that your customers are not affected by performance issues when traffic rises, and you are not overpaying for unused resources when the traffic falls. We cannot forget about spot instances for the development environments, intermittent workloads and batch applications that are resilient to outages. There is a lot to be gained in terms of sustainability using these core techniques.
Finally, we should always take advantage of native lifecycle policies as much as possible in the storage domain. If you have an S3 bucket for storing logs and know that they could be archived after a month and after two years can be removed, create automated policies to move an object to Glacier and then make S3 erase it automatically for you. Treat the storage of images on ECR in precisely the same manner. Additionally, think about the possible usage of compression wherever possible. This will decrease your network footprint as well.
In general, if you follow most of the existing rules from the original five well-architected pillars, you should be fine in terms of sustainability in these main three infrastructural areas.
Software / Architectural Improvements
A couple of improvements in this area would be relatively quick to implement. One of these will be presented later in this article. However, some of them would be more challenging and could require modifications even on all layers of your application. Nevertheless, they could definitely contribute to reducing your carbon footprint, make your workload more performant and simply save you some money in the long run.
Processing ad hoc and reacting immediately to all incoming requests without justifiable business need will most likely result in an unnecessary overhead during startup. This is regarding possibly many areas: spin-up, establishing database connection, resource lookup, and other checks. Imagine a scenario where you develop an application for managing a fleet of trucks. One of the features is a visualization of the route for each of the trucks in the fleet. This is based on the GPS data sent by special devices located on the trucks. These devices send information every five seconds. Now, we could develop a system that would receive these requests and invoke lambdas for each of them separately to update the map. While this would be close to a real-time system, we have to ask ourselves whether we really need the data to be updated so many times a minute. After a couple of meetings with your business analysts, it is determined that once a minute should be sufficient to have a good overview of the situation. So let our system send the signals from these devices to an AWS Kinesis Stream and allow a single Lambda to process all these records in a batch. We possibly also do not need to process the data independently for each truck. If we can wait for more than a minute, let that Lambda serve records from all the trucks at once. Regardless of what we choose, we could potentially save hundreds of new environment spin-ups and significantly reduce our carbon footprint in this situation. Have that in mind when designing the architecture for your next application.
Move Compute to Front-End as Much as Possible
This is my personal favourite. It took me some time, though, to fully comprehend this way of thinking about managing data across the layers of an application. This strategy mainly applies to web-based applications, but in theory, any application with a clearly separated view layer that is not server-side rendered should be able to use this strategy. So, in essence, 90% of the software currently developed should fall under this category. I think most of us worked on those legacy applications with heavily normalized database schemas that required very complex queries or massive stored procedures to get a certain aggregate of data. A couple of these queries would later be pushed to the service layer, where a couple of thousand lines of code would take care of the processing. All that so that the presentation layer can receive a fit-for-measure data structure that it just needs to render. If this kind of application receives hundreds of requests per second, then there is a lot of computing power required for the persistence and service layers to process all this data. Without a doubt, a large footprint is there.
Back in the early 2000s, this was most likely the strategy you had to implement as web browsers and client workstations, for that matter, were nowhere near the processing power present today. So why not take advantage of that power that lies on the client-side? Considering the application I mentioned earlier, we can identify the access patterns. Based on these, we would have an overview of how we could store that data in a non-relational database like DynamoDb. DynamoDb always has a single-digit millisecond response time, and given that the data is already aggregated, our back end could literally serve as a hose to move it to the front-end as is. From now on, let the front-end take care of any required transformations and processing in addition to rendering. With the speed of current browsers and client devices/workstations, this architectural change should be transparent to the user or just slightly noticeable. For our company, though, that would mean a substantial down-scaling possibility, thus allowing for huge savings on computing and a significant increase in sustainability.
Simply Use Managed Services
"In the future, all the code you ever write will be business logic."
Werner Vogels, CTO, Amazon.com
Whether we like it or not, we are all heading in that direction. To be honest, I think we should all but not like it and not strive for reaching this kind of state. Which team lead is perfectly willing to sacrifice weeks of work of his engineers to write a custom implementation of a request/response sanitization library just so that the team has complete control over its features? In the end, just to realize it actually took months, 3rd party auditor constantly keeps finding holes in it, and it requires constant maintenance. Would it not be wiser to simply reuse a ready-made library maintained and updated daily by a software house specializing purely in the security domain? Besides that, this company has most likely mastered the tuning of performance so that the computing power required to serve each request/response is always minimal. Yes, we won't have absolutely complete control of its features, but the question is how often you would be in a critical need to deviate from a well-established standard? Not so often, in my opinion. And even then, you should actually question whether the engineering decisions eventually requiring these customizations were valid in the first place.
Most of the AWS managed services originated from real-life needs and scenarios of AWS customers. The requirement pattern started to be so common that a solution in the form of a managed service naturally emerged. AWS currently has more than two hundred services ranging from IaaS to SaaS offerings. Are you sure that you really want to implement that in-house face-recognition application? Just use AWS Rekognition. Maybe you won't need to recognize faces at some point but voice instead? Would you rewrite your (very inefficient) original program and waste even more money? Just switch to AWS Transcribe and use it on-demand. I don't think we even have to discuss the scaling of computing in the cloud vs scaling on-prem. You can be confident that AWS has all the operational and performance excellence behind their managed services. This implicitly puts the sustainability of your workloads in the highest possible registers.
Path to Production
Regarding measuring the results of your improvements in the sustainability pillar, the metrics would mostly revolve around reducing provisioned resources and resources consumed per unit of work. Before going to production, make sure you first test for these metrics on a dedicated Dev/Test environment. Ideally, you should have an automated testing suite(s) that could run end-to-end tests in parallel, simulating many users using the system simultaneously. Some manual testing is advised just to make sure the usability of our application did not suffer any downgrade.
Once you are ready to deploy to production, try not to do it a typical rolling way. Blue/Green deployment would be best suited for this kind of situation. Although it is one of the more expensive options, you have that safety net of benign able to switch back to your legacy environment in near real-time. Canary deployment might also be a good idea. We could start with a cautious 5% traffic redirected to the new environment and gradually increase that if we do not detect any significant problems.
In the long run, analysis of the AWS Cost and Usage Reports could give a lot of insight into your new deployment. This is to mainly verify whether there are still some hot spots and when do they mainly occur. In combination with these useful queries, AWS Athena might help you out in this analysis greatly.