We started discussing the meaning of Cloud Native within Codurance some time ago with a view to forming some kind of consensus around our understanding. As we looked into the subject it became clear that there was actually far more to it than one simple answer or one article. In fact, we very quickly realised that there were a series of articles to write. The result is that this article is very much the taster article which will point the way to the other subjects that we feel are an essential part of what Cloud Native means and why we should care.
In the early 2000s we heard a lot about SOLID principles. As the world of Software moved away from Windows applications to Web based functionality, we started to hear a lot about 12 factor applications. This concept was first put forward by Heroku around 2011. Ten years on and the web has matured. In particular most organisations have moved their IT estate out of their premises or the physical data centre and into the cloud. This has opened up a wealth of potential to even the smallest of businesses. But with that extra potential has come new design challenges.
How do you fully leverage the power of the cloud to unlock new value for your business? Designing your systems and solutions to take full advantage of the cloud, removing the shackles imposed by the constraints of traditional hardware, and understanding when and how to leverage existing commoditised solutions, has come to be known as Cloud Native architecture. In this article we discuss what is really meant by “Cloud Native Architecture”, how we can take advantage of it and how it might apply to your business.
A brief history of Cloud
AWS launched S3, SQS and EC2, initially in America only, during 2006. GCP launched in April 2008 and Microsoft Azure was initially released in February 2010. For many years the main functionality provided by these new cloud computing services was that they provided abstractions in the cloud that mimicked the functionality provided by the hardware that was being maintained either in private server rooms or private or public data centres. Some immediate advantages of this new hosting paradigm were:
- As all machines were now virtual, bringing new machines online or taking machines away from a cluster, became quick and easy. In May 2009 Amazon launched AWS Auto Scaling Groups, enabling a business to automatically and rapidly scale up and down in response to changing demand.
- Servers were no longer a real thing that needed investment in maintenance and thus became cheaper.
- Servers were no longer a single point of failure.
The brave new world of cloud computing was not without its difficulties. In the early days of cloud infrastructure the abstractions offered by AWS and Azure exactly mirrored the physical infrastructure that they were trying to replace. Azure offered imanly Windows based servers from 2010 to host applications while AWS offered EC2 (Elastic Compute Cloud) servers from 2006 running a variety of operating systems.
In addition to producing an abstraction of a server, both AWS and Azure created abstractions for common hardware devices. So in AWS we had ELB (Load balancer), NAT Gateway, VPC and Subnet, among other things. Whilst these abstractions no doubt made sense to infrastructure engineers they offered no new help to software delivery teams who still relied on the infrastructure engineers to deploy and run their applications.
So whilst the new flexible nature of the cloud services placed more power in the bands of development teams it was difficult to immediately harness this power and move to a DevOps mindset because the traditional split in skill set and knowledge between developers and operations was not addressed by the first wave of cloud abstractions.
Lift and Shift Cloud Migration
Many organisations consider moving their servers to the cloud as an end in itself. Often they will plan a hurried migration to the Cloud, possibly because their contract at the data centre is expiring and they will certainly be anxious to ensure that everything works in the cloud as it does in the data centre. This is a perfectly reasonable anxiety and the logical way to make sure that the business continues smoothly is to make sure that all of the servers and network infrastructure are copied into the Cloud as they are. This is the “LIft and Shift Cloud Migration” (anti) pattern.
There are two obvious problems with Lift and Shift, they are:
- Any architectural shortcomings that may have accumulated over the lifetime of the system will be replicated in the Cloud architecture because it is exactly the same.
- None of the cloud specific services that could meet your business needs more appropriately and cheaply will be leveraged.
The Cloud Promises: Scalability/Isolation/ Maintainability/Extensibility/ Decoupling Teams/ Spend Control/ Resilience
Load balancing is not a new, or Cloud, concept. The idea is that incoming traffic is received by a special type of web server called a Load Balancer. This component then passes the traffic on to one of a group of servers (generally called a cluster) and there is some mechanism by which the load balancer rotates the traffic between the servers in its cluster. In pre-Cloud days the general setup would be to have a fixed number of servers in your cluster that was capable of handling traffic at any level of the expected demand.
All cloud providers offer some kind of automatic scaling of servers. AWS introduced the concept of Auto Scaling Groups back in May 2009. In an auto scaling group the load balancer forwards traffic in the same way as a traditional load balancer but the difference is that there is a mechanism that monitors the traffic levels and either adds new nodes or removes nodes from the cluster according to the current level of demand.
A key promise of the Cloud is to isolate the components of your solution from one another. In the old days of hosted or on-prem servers, failures tended to be global (with respect to a single organisation) or at least very wide ranging in their scope. If a single server went down, this tended to take down a lot of services, both external and internal facing. Networking problems tended to lead to all sorts of problems, both inside and outside the organisation. The Cloud offers the tantalising prospect of limiting the blast radius of failures by isolating failures to single services, rather than on the level of servers or networks. It should be quite clear at this point that in order to take advantage of isolation of components you must already have separately deployable components. If your solution is, say, a monolithic Ruby on Rails application then it will only be possible to deploy it as a single unit.
The Cloud offers the promise of simplifying your operational support burden. You no longer have to carefully curate physical servers, behaving like a zoo keeper, carefully ensuring the health of each (individually named) server under your charge. Instead your operational mindset can become more akin to a sheep farmer with a large number of unnamed, largely self supporting and fungible assets which together ensure the health of the wider organisation.
The cloud should allow us to roll out new capability quickly through the use of commodity managed infrastructure. But there is a cost to this too. When we had a monolith we could afford to come up with an authentication system that was unique to our monolith, such as using active directory users in the database and controlling permissions through the database engine. It isn’t practical to roll your own security out to every new component. So we need to start to understand what a generalised service looks like. We can use service templates or similar but essentially we are talking about recognising those parts of the software stack that are commodity versus those parts that are differentiating factors for your business over your competitors.
The cloud offers ways to decouple your teams. This is important because it means that each team can own its own value stream and deploy new value when it is ready. This removes coordination overhead, freeing up people to add real value to the business, as well as ensuring that time to market is drastically reduced. However, in order to decouple the teams you need to decouple the software. This introduces overhead in organisational maturity, tooling maturity and engineering maturity.
There is also a more subtle barrier that has to be overcome. In order to fully decouple your teams you need to trust them to deliver their outcomes. This means that the business as a whole and in particular any management layers, need to stop being managers and start being enablers and facilitators. They need to understand how to trust teams to deliver on outcomes. This is a whole article in itself…
In the old metal server, server farm based infrastructure model there was a high premium on physical servers, both in terms of absolute cost and the time taken to provision any new servers. This caused difficulty in optimising cost in two different ways:
- The time lag associated with bringing new servers online forced most organisations to pay for more capacity than they needed at any given point in order to avoid customers getting a bad experience at times of high load. This was a particular difficulty for e-commerce companies based in a narrow band of time zones, such as Western Europe, where demand would be high for only a few, predictable, hours each day.
- The high cost of provisioning a new server led infrastructure teams to install many different services on each server. This made it impossible to scale by component or service as the unit of scalability was the server itself. So potentially, a cluster could require additional resources instances because one of many of its installed services experiences regular spikes in traffic. Thus extra resources will be provisioned for many services that do not require them.
A key promise of cloud computing is that by segregating your services into smaller deployable units you can optimise the cost of hosting your services.
The cloud promises that your components can still function even if functionality that they depend upon is unavailable for any reason. This is crucial to a large system. For example, it could be that you are unable to transact to a certain part of the country because the API to a courier that you need to fulfil those orders is unavailable. It would not make sense to reject transactions that don’t need that courier and it would make even less sense to reject incoming traffic to your website. The cloud promises that we can maintain service that is appropriate for the dependencies that are available. This enables us to maintain something close to 100% uptime, even when reliability of downstream dependencies is far from total.
The Cloud Native Premium
In 2015 Martin Fowler wrote about the Microservices Premium. Much of what was written about Microservices in the past could be equally applied to Cloud Native today. The observant reader will have noticed that many of the cloud promises above will depend on your overall system having a degree of separation of components. Whilst it isn’t valuable to go into the precise definition of a Microservices architecture here, it is safe to say that if you have a monolithic application at the core of your business, you will need to consider Splitting that monolith in order to consider yourself Cloud Native.
Splitting the Monolith
Many of the cloud promises mandate that you have split your monolithic application into smaller deployable units. You cannot scale individual components, deploy units separately or decouple your teams effectively if you have a monolithic solution. Note that this need not mean implementing a “true” microservices architecture. For example, you may have a large monolithic database that underlies all of your functionality and serves as a large coupling point. You could achieve a lot of promise of the cloud by accepting, at least for the time being, that this coupling point will remain while you split the codebase into its logical vertical parts. There are various strategies for splitting a monolithic application which we won’t cover in detail here.
The cloud promises that we can maintain our service even when dependencies are unavailable. Maintaining the user experience is vital as is allowing your users to do as much as they want to do while they engage with your company. In order to make this happen it is important that your solution is designed to fail gracefully and without degradation of customer experience. The solution to making this happen is a combination of strategies around ensuring continuity of service, such as caching, redundancy and replication combined with a developer culture of graceful failure. There are many ways to make sure that your solution fails in a way that doesn’t degrade customer experience but the key point is that we have to expect failure of dependencies (both internal and external), assume they will happen and make sure we deal with them appropriately. This will be the subject of an upcoming article.
If we assume that our architecture has become cloud native and we assume that we have evolved to a place where we are able to take advantage of the full potential offered by our chosen cloud provider then it is fair to assume that we are managing a large number of services or applications that are deployed across a large number of different environments. As we’ve seen, failures are not only likely but inevitable and when we accept that failures will happen we have to focus our attention on restoring service as quickly as possible and then designing out the cause of the failure.
In order to diagnose faults quickly and in order to analyse the sequence of events that led to a fault it is vital to have effective instrumentation around the whole of our solution. It will not be possible to manage the sheer number of constantly evolving assets without sophisticated and targeted monitoring, recording and visualisations of all of the relevant data. This means that an essential prerequisite for Cloud Native architecture is the existence of a solid framework to collate and visualise any relevant monitoring information and an understanding on the part of delivery teams around their responsibilities to consistently and effectively instrument their application and services.
In addition to the logging and observability solutions offered by the major cloud providers, there are several vendors, such as logz.io, that provide such. Monitoring and alerting strategies are a subject in themselves which will be discussed in a future article.
Breaking the Dev-Infrastructure Walls
It will be very difficult to take full advantage of the Cloud, even if you meet all of the recommended design choices, if you maintain a separation between developer concerns and operations concerns. We need to make sure that it is possible for delivery teams to build and run things without reliance on a separate team. This means owning the code, the deployment pipeline and the infrastructure. Note that it doesn’t mean owning the infrastructure accounts, just the piece of the infrastructure that is needed to host the deployable units that your team owns. It certainly doesn’t, for example, mean development teams should own cross cutting concerns, such as SSO infrastructure.
There is still a case to have an operations team that perhaps owns functionality that is used by all, such as the monitoring infrastructure. There is also a good case, if the size of the organization is appropriate, for there to be an “infrastructure team” that act as expert internal consultants or enablers of value, to the pure development teams. There are many ways to ensure that your infrastructure professionals add value to your business and we don’t cover those strategies here. The key point is that each delivery team must be able to deliver value into its value stream without direct input from another team.
We will be writing a forthcoming article on how to approach the adoption of DevOps engineering practices within your organisation.
Our latest podcast where we talk about the challenges and opportunities of platform engineering and DevOpsis, is now available to listen to.
Cloud Native Vendor Frameworks
The large vendors have their own ideas of what Cloud Native means but in reality, this usually means optimising your architecture to take advantage of the tools that are specific to that cloud provider. There is value in this approach and it may well help you get the most out of your chosen provider but by the very nature of optimising your architecture for a specific provider, you have risked being locked in to that vendor.
So whilst Cloud Native may mean something slightly different on AWS than it does on, say, GCP, there are a limited number of use cases where the high level considerations of what it means to be cloud native would be different enough to make a difference to what Cloud Native means to your solution. Differences in cloud providers is a subject that will be covered in a future article.
Cloud Native means a lot of different things to a lot of different people. I believe it can be summarised succinctly as:
Cloud Native is the practice of designing and evolving a solution to take full advantage of commoditised capabilities of cloud computing, thus enabling your delivery focused teams to concentrate only on those aspects of the solution that differentiate your business.
So whilst different cloud providers, consultants and product companies will have their own definition of what differentiates their business which will lead to their own tailored definition of “cloud native”, this more general definition of Cloud Native allows for the nuances in definition and should help you determine your own definition.
So Why Should I Care?
So the answer is very simple. If you want to get the full benefits of the cloud, if you want your developers to concentrate on differentiating your business and you want to be confident that your business is well set up to deal with the inevitable and unforeseeable change ahead, you NEED to care about Cloud Native.