It’s so quick and easy to deploy an application out into Microsoft Azure and make it available for anyone in the world to use. It’s even quicker if you utilize all the Platform as a Service (PaaS) services like Azure App Service (Web Apps, API Apps, Logic Apps, etc) including Azure SQL Database and Azure Cosmos DB. However, it can be a bit more tricky to make that application resilient to failure, specifically regional failure. How do you design an application to be truly globally resilient? What if a specific data center or region goes down? Will your application stay up and keep your users productive?
You can add high availability by increasing the number of instances, but that only applies to a single region. You could implement failover, but does that offer the best experience for your users? This article goes through many of the tips and techniques that can be used within Microsoft Azure to build truly globally resilient applications.
Deploying to Azure App Service
Azure App Service provides capabilities to easily deploy and host your applications using Platform as a Service (PaaS) services that offer fully managed underlying Virtual Machines (VMs). This means that you no longer need to worry about managing the Operating System (OS) updates and patches, or even the install and update of the framework runtimes for .NET, Java, Php, etc. Azure App Service really eases the process to make deployment to Dev, Test, and Production environments much easier. It even includes some manual and autoscaling features to help you handle the scalability of your applications.
Azure App Service is a really great PaaS offering, but it can’t stand alone when it comes to Global Availability and Resiliency.
However, while Azure App Service is really amazing from a PaaS perspective, it still falls short on true high availability and global resiliency. The way you achieve global scale, resiliency and very high availability is to combine Azure App Service with the Azure Traffic Manager load balancer, and other data services that offer the rest of the global resiliency stack that is needed.
Achieving Application Global Availability
Achieving Global Availability of your applications starts with the capability of any user anywhere in the world being able to access your application. This stand-alone can be done with a single Azure App Service Web App service instance. However, just accessing from anywhere in the world is not enough. There are concerns of latency, loading time, download speed, and disaster recovery / failover to just name a few that additionally need to be met.
There are 2 main services that allow for an application to achieve a much higher level of global availability. These services are truly global services within Azure and both offer a different kind of service that when used together offer amazing application availability scenarios to be built at a global scale. These services are:
- Azure Traffic Manager
- Azure CDN
Using Azure Traffic Manager with Azure App Service
Azure Traffic Manager is a DNS-based Load Balancer. It works by directing client traffic to a specific application instance in an Azure Region by resolving the DNS lookup of the domain name (like opsgility.com) to resolve to the IP Address of the specific app instance that should handle the request. This means that by working on the DNS level, Azure Traffic Manager is not a Proxy server and therefore does not add any real performance degradation in order to use. It actually will help you greatly improve the performance of you globally distributed, globally available applications.
When configured, your domain name (like opsgility.com) would be setup to go to the DNS domain name of the Azure Traffic Manager instance. Then Azure Traffic Manager would be configured to load balance instances of your application across multiple Azure Regions around the world. You would place application instances in App Service as close to your employees or users as possible; spread across 2 or more Azure Regions as necessary.
For example, use a Performance algorithm for load balancing with Azure Traffic Manager, the requests from users located in North America would get directed to your application instance hosted in the Azure East US Region. And, requests from users located in Europe would get directed to your application instance in the Azure North Europe Region.
In addition to spreading the traffic out across the application instances in different Azure Regions, the Azure Traffic Manager will also monitor the health of the instances. This allows for Traffic Manager to automatically remove unhealthy instances from the pool and stop directing traffic to those instances until the time when they become healthy again. This allows you to handle scenarios when the Azure East US Region is down and instead of the application being unavailable, the traffic from those users would simply be directed to the Azure North Europe Region automatically instead.
Using Azure CDN with Azure App Service
A Content Delivery Network (CDN) offers the capability to host cached instances of static content at multiple locations around the world, then serve that content up with lower network latency to clients from the closest location to that user. The Azure CDN service is a Platform as a Service (PaaS) service that offers this CDN capabilities within the Microsoft Azure cloud.
The Azure CDN service is a global service that utilizes many CDN edge locations around the world to offer serving up static content with the lowest latency possible. In fact, the Azure CDN locations are NOT simply all the Azure Regions, there are actually more CDN edge locations around the world than there are Azure Regions (at the time of writing this). As a result, there is likely a CDN edge location closer to your users or employees than the primary Azure Region where your application is hosted.
When you carry the Azure CDN out to a global scale of users and clients of your application being distributed globally, then it is certain that a CDN edge location will always be closer to your users than your primary Azure Region, or even your secondary Azure Regions when you’re using Azure Traffic Manager.
An additional benefit of using Azure CDN to serve up static content for your applications is that it will offload the serving of that content from your application instances to the Azure CDN service. This will mean a decrease in the amount of load your application instances will need to handle in order to service requests. In many cases this can mean an increase in performance and overall capacity of those application instance to handle requests.
Azure App Service + Traffic Manager + Azure CDN
The benefits and reasons to use Azure Traffic Manager and Azure CDN listed above sound really great, however, what does it all look like put together? To better visualize this stuff put together into the overall architecture of an application, here’s a simple diagram that offers a more visual layout to how these services can be used together.
Achieving Data Global Availability
Designing the globally resilient and hosting infrastructure as outline previously is really great, however it still doesn’t address the Data needs of the system. Achieving data global availability and resiliency isn’t quite as straight forward as the front-end application piece. How exactly can you achieve the same global availability and resiliency on the Database level?
Traditionally, you will have a single database server host your database. This could be SQL Server or Oracle on-premises for example, or even Azure SQL Database in the Microsoft Azure cloud. Scaling this single database instance generally involves just adding additional capacity to the server in the form of CPU / RAM / HDD on-premises, or adding additional DTU’s in Azure SQL Database. However, this vertical scaling by just “adding more power” has a finite limit of scalability. Also, it doesn’t solve any redundancy and global availability needs either. A single database instance is a single point of failure and a huge liability.
A single database instance is a single point of failure and a huge liability.
In the Microsoft Azure cloud, the best database options are to use PaaS services. IaaS can be used, but then you have a huge array of responsibilities to manage yourself, from the VM, to the Operating System, including updates and patches, and the database software too! With Azure PaaS services, you have a managed VM that manages all that underlying infrastructure work for you. This enables you to solely manage your data, access, and backup / geo-redundency configurations.
The 2 database services within Azure that offer the best global availability support are:
- Azure SQL Database
- Azure Cosmos DB (formerly DocumentDB)
Global Availability with SQL Database
With Azure SQL Database, you can host you database using a “relational database as a service”. This offers a fully managed VM, with additional scaling capabilities and other features built into the platform. Compared to a on-premises SQL Server or SQL Server hosted within a Virtual Machine (VM), Azure SQL Database is the best database option to choose.
FYI, Microsoft recently released MySQL and PostgreSQL as a server database offerings within Azure. However, only time will tell whether those services will be as robust and featurefull as Azure SQL Database has become.
With Azure SQL Database, you have the option to configure geo-replication or geo-redundency of your database. You can do this for up to 4 additional copies. These 4 additional copies will be read-only, while your primary database instance will be writable.
With Azure SQL Database, you have the option to configure geo-replication or geo-redundency of your database.
This helps with implementing a proper failover strategy. Basically, if the primary database goes down for some reason (regional outage, service disruption, etc.) then you can failover to one of the secondaries to make that the new primary. However, this process is NOT automatic. You need to manually failover your Azure SQL Database when necessary.
While you can implement automatic failover of your applications, as shown previously with App Service using Traffic Manager, your Azure SQL Database failover needs to be performed by you. It’s not an automatic feature within the Azure SQL Database platform.
The reasons Azure SQL Database failover is not automatic include the dependency on you reconfiguring your application(s) to connect to the new Primary database. Each Primary and Secondary database has it’s own unique endpoint, login and connection string information. This needs to be configure either manually or automated using scripts in order to perform a failover.
Fun Fact: Azure SQL Database is not the same SQL Server engine you run on-premises or in a VM. Azure SQL Database is a different SQL Database engine built for the cloud from the ground up. It is also hosted using Azure Service Fabric for the underlying managed infrastructure.
Even through the Azure SQL Database replicas are read-only, you can still use them with your various application instances across the globe. You basically just need to setup / code your system to use the nearest Secondary SQL Database for ready operations (queries, lookups, etc), then connect tot he Primary for all SQL Database write. This way your application will mostly remain functional if the Primary database goes down, and degrade their functionality gracefully. Then when the application is back up again because you performed a failover to promote a Secondary to be the new Primary and reconfigured your application instances accordingly, then your application will be back at 100% capacity from a functionality perspective.
Global Availability with Cosmos DB
Azure Cosmos DB (formerly named DocumentDB) is a truly globally available NoSQL database as a service. Initially when provisioning Azure Cosmos DB you data is stored in a single Azure Region with no redundancy. However, you can easily configure multi-region geo-replication. Additionally, the Cosmos DB geo-replication is implemented differently and better than Azure SQL Database where you have a single Cosmos DB endpoint URL / Domain Name to connect to and the platform handles automatic redirection for reading and writing to the nearest region without the need to manually failover.
Azure Cosmos DB is a globally distributed NoSQL database as a service built to natively run in the cloud.
With Azure Cosmos DB you can configure any number of Secondary regions and the service will automatically handle replicating your data out to each of those locations. The Cosmos DB service will also handle automatic failover in the event that your Primary region goes down.
The way that Cosmos DB handles the Primary and Secondary regions is that the “Primary” region is the only Writable region, and the Secondarys are Read-Only. This works basically the same as geo-replication with Azure SQL Database. One of the big differences though is the fact that Cosmos DB will automatically handle failover for you.
The automatic failover of Azure Cosmos DB is really enabled by the fact that the data replication between the instances works on an Eventually Consistent model. Once the data is written to the Writable / Primary region, it will then be replicated asynchronously to the Secondary / Read-only regions. In terms of consistency, the previously written data will be eventually consistent across all of the configured regions.
Using Azure Cosmos DB, the application as it’s geo-distributed across multiple Azure Regions and load balanced with Azure Traffic Manager, then only needs to be configured with a single database connection string to connect to the Azure Cosmos DB endpoint. Then the Cosmos DB service handles the load balancing effectively across the Writable and Read-only instances of Cosmos DB spread across the chosen regions.
Fun Fact: Azure Cosmos DB is built with a micro-services architecture using Azure Service Fabric in the underlying infrastructure that is managed for you as the PaaS service offering.
Achieving Full Stack Global Availability and Resiliency
Truly globally available and resilient systems can be build by combining the previously mentioned method and techniques. High availability and resiliency can be achieved from the front-end and API tiers of an application, all the way down to the database level. The system will then be protected adequately from isolated and regional service disruptions or outages.
Designing appropriately for the cloud means to design applications and systems on a global scale. The ease of the Microsoft Azure cloud also enables this to be done more easily and far less cost prohibitive than ever before. Budgets will go much further, and even small teams or organizations can achieve much higher levels of overall service that was possible only a few short years ago. This is all thanks to the Microsoft Azure cloud and all the amazing PaaS services and global scale that is offers.
To finish this article off, here’s a diagram that shows many of the components above put together into a single system that is truly globally available, highly available, and globally resilient against failure, service outages, and even regional outages.