An Application-consistent backup is a must when working in certain data-driven environments where both data integrity, application consistency and uptime are essential for the business. Microsoft Azure adds Linux to the list of OS supporting this feature that is especially interesting for relational databases such as MySQL.
This new feature makes much easier to automate critical tasks like deploying a new slave node, restoring databases and tables from dumps and raw files and even go back to a precise replication position recovering specific binary logs. Furthermore, the good performance of Azure networking reduces the time required to deal with replication delay issues.
All this empowers database administrators to take back control, makes more reliable their backups and means a significant advantage over simple disk snapshots.
This positive experience was one more opportunity to collaborate with Microsoft, learn first-hand more about how Azure works offstage and is proving itself to be already fruitful since this knowledge is now part of our procedures and thus of the value we add to our services to help our customers constantly grow with us.
Application-Consistent backup in Azure Linux VMS
A few days ago I read in one of our *-hacker instant messaging channels:
I have explained just 30 seconds ago a limitation in this feature and now I see they announced this has already been fixed… today!
This happens all the time!
I couldn’t help smiling and thinking “I know that feeling!” because that’s part our day-to-day here at CAPSiDE.
The Cloud (note the capital letters) is an exciting ecosystem where evolution drives the industry at an incredible pace resulting in an ever-increasing repertoire of capabilities. These competitive advantages can enable you to react faster, to use your resources more efficiently, to be more resilient and to easily overcome obsolete barriers and go from local to global in minutes. All these capabilities at your disposal, you just need to be able to keep track of the changes and constantly revamp yourself to take advantage of them. This is a continuous improvement at its best.
Sounds challenging? Well, it really is. Luckily, that’s hard-coded in CAPSiDE’s DNA and we have an excellent relationship with the main actors. And among them, undoubtedly, Microsoft Azure stands out.
As a Microsoft Gold Cloud Platform partner and Azure Infrastructure & Datacenter Transformation Partner of the Year – for the second year in a row – we collaborate with Microsoft in different ways and a very special one is testing new features in their preview stage, this is a great opportunity for us to better understand how things work behind the scenes and, most important, to help our customers to benefit from new features in a record time.
When we were offered the opportunity of testing a new feature to perform application consistent backups in Azure Linux VMs we thought it could be fun and we jumped in with both feet. It’s no secret we love Linux (and we’re not the only ones!).
First things first: what’s an application-consistent backup in Azure? In this case, to put it simply, it can be defined as a restore point that is created without having to shut down the VM, that preserves data integrity (there’s no data corruption or loss) and that specifically enables the application that uses the data to resume an state of data consistency.
If the latter premise is not satisfied then the backup would be downgraded to just having file-system consistency.
Application-consistent backups are particularly important for applications that temporarily keep the data in memory or that require transaction or I/O operation consistency.
This was already available for Windows VMs following the VSS workflow upon the backup extension’s invocation.
With the new feature for Linux VMs, the backup is also triggered by an extension but in this case, it can be configured to execute some actions before (pre actions) and after (post actions) the snapshot actually takes place. Additional parameters are available, such as if the backup should continue in case of failure in the pre actions or as the timeout and the number of retries for each action. Going a bit deeper, a fsfreeze is executed just before the backup to prevent any write operations in the disk.
Final details of this feature are provided in Microsoft’s official announcement.
Once we were granted access to the preview stage of this feature, we decided which pieces of software to use, having the following conditions:
- Linux distro supported by Azure
- good example where application consistency is a must
- the broader audience possible
- software versions within their stable branches
- testing limited to standard functionality
Given this, we chose to stick to MySQL 5.7 relational database server running on an Ubuntu 16.04 LTS server and evaluate some potential scenarios for master-slave replication.
We considered some implications of backups in master and slave nodes like:
- consistency vs uptime switching to full read-only vs using transaction isolation
- switching to full read-only vs using transaction isolation stopping all slave threads vs just stopping the SQL thread while the I/O thread keeps running to minimise catch-up
- stopping all slave threads vs just stopping the SQL thread while the I/O thread keeps running to minimise catch-up cases where significant replication delay could take place due to network latency
- cases where significant replication delay could take place due to network latency table disk usage optimisations
- table disk usage optimisation log flushing
- log flushing
and finally, we resolved to focus on reproducing a common and well-documented scenario consisting of a single master with one or more slave servers. These slave servers could either be dedicated to backups or be part of a read-only pool.
Once settled, we can move on to the tests.
I’m now going to explain one of the tests performed, trying to keep it as simple and as easy to reproduce as possible.
The goal of this particular test is to demonstrate that using the application-consistent backup from a slave node we can deploy another one that is fully operative: up-to-date and resuming replication.
Note that for clarity’s sake some secondary parts of the process are omitted.
First, we need to setup 3 VMs:
- Deploy 3 Ubuntu 16.04 LTS VMs in the same subnet, let’s call them: node-a, node-b and node-c
- Perform the basic setup for all the nodes, first make sure /bin/sh points to /bin/bash:
- And then install MySQL with the basic setup:
From Azure Portal:
- Enable the backup for all the VMs and set a daily backup policy
- Trigger a backup for all the nodes to make sure the backup extension works properly
Then, we need to establish master-slave replication between node-a and node-b:
- Create in node-a a test database (sampledb) and a test table (sampletable)
- Configure the node-a as the master including the creation of the replication user:
- Create a backup user to skip using root, add the credentials to the file _mysql-app-consistent-backup-env.sh_
- Configure node-b as the slave:
Once we have the replication up and running, we need to be able to test it.
We don’t need a complex dataset here but we do need it quickly so here’s a straightforward and fun proposal: let’s check the dictionary 😉
Install an Ubuntu system dictionary package:
and write two scripts: one to insert random words from the dictionary in a given table and another one to print the latest words inserted with their automatically assigned IDs. This will be useful to easily grow the size of the database and check the replication progress.
The example in pseudo code:
Anyway, this is only a proposal and you can use any dataset that suits this purpose. The bigger, the better.
Leave the script running for a while, the replication can be checked from node-b:
The following values deserve special attention:
The command’s documentation has plenty of details.
The way this feature has been designed grants you a great amount of freedom. The upside is that you can define a workflow perfectly tailored for your needs. The downside is that you are responsible for making it work out. Situations like these are great opportunities to automate your workflow.
As a multinational company, capsidians speak different languages (yes, memes are a language too!) but our lingua franca is Perl. Nevertheless, at this moment this extension supports Python and Bash scripting. We opted for the good classic Bash, sorry Python.
We created one wrapper script for each of the two actions. Their only purpose is to call the main script with pre or post parameter. This helps to keep the plugin configuration clean.
Also, two helper files were created: one with the variables and another one with the functions. They only come to life when invoked by the main script.
Finally, “the main script” (TM). Its pith resides in how the workflow varies depending on the action (pre or post) and the server role (master or slave).
Example in pseudo code:
Here’s where the work we did considering the different scenarios has to be put to the test.
For example, if you have a huge archive table that will be periodically rotated or truncated you might want to schedule some optimize table commands in the master node to save disk space… but, how would this impact the performance? and the uptime? Do you want to rely on master backups? On slave backups? On both? Also considering different flush operations can be an option.
As with most things in technology, things can get complicated.
We focused on a scenario where replication delay won’t be significant, keeping alive the I/O threads won’t be worth to speed up catch-up and slave nodes are either dedicated to backups or are used as a pool for read-only requests. The master node must aim at maximum uptime but slave nodes can tolerate brief service downtimes to have raw backups.
An example of the sequence for the backup of a slave node could be:
- Stop both slave threads
- Set as read-only
- Dump all tables from all databases
- Optional flush operations such as FLUSH RELAY LOGS
- Stop MySQL service
- Azure backups creates the snapshot
- Start MySQL service
- Set as read-write
- Start both slave threads
You can take a look at the code of the scripts in Microsoft’s MicrosoftAzureBackup/MySQL GitHub repository. Please, bear in mind that all the linked scripts were written and published for testing purposes; you are solely responsible for determining the appropriateness of using or redistributing it and you assume any risks associated.
The scripts should be placed under a new /op/capside/azure folder. Carefully review them and apply any customization specially to the variables defined in mysql-app-consistent-backup-env.sh.
Make sure only root has access to this folder:
Edit the plugin configuration to use the right scripts and, again, make sure only root has access to the file to prevent failure in the execution:
As previously stated, final details of this feature, and specifically of the plugin configuration, will be provided in Microsoft’s official announcement.
You can either wait for the daily backup policy to trigger a backup for a slave node-b or force it from Azure Portal.
Then check the output of the log created by the script, it should be similar to this:
As you can see, it takes only a few seconds to start the post options once the pre ones have finished.
Deployment of a new slave node
At this point, we have a two-node MySQL master-slave replication system working properly and with backups being performed on a daily basis. Let’s add a new slave node.
From Azure Portal, start the File Recovery of the VM node-b, push via scp the recovery bash script to the VM node-c and execute it to mount the new disk.
Make sure MySQL service is stopped in VM node-c because we are going to mangle its configuration and data files:
Synchronize MySQL configuration and data folders:
Now, adjust the mysqld.cnf file parameters. This could be easily automated by increasing the value of server-id and retrieving from the network interface the private IP address with commands like ifconfig:
Link the relay files to avoid hostname conflicts (see the value of Relay_Log_File in the output of the SHOW SLAVE STATUS command in case of doubts):
In order to generate a new UUID we must delete the file auto.cnf:
Otherwise, the replication of node-b will be stopped due to a duplicated server_uuid.
Start again MySQL service in VM node-c:
And verify it resumes replication and catches up, as expected. Voilà!
The default location for the binary logs is /var/log/mysql. Properly defining a policy for archiving and rotating (or purging) the master binlogs can be very helpful in order to enable a more granular recovery with mysqbinlog utility.
You can also check our training courses schedule to catalyse your transformation into an Azure pro.
We focus on designing, automating and operating 24/7 critical digital platforms for a wide range of world-class customers, don’t hesitate contacting us to know some of the ways we can help your business to harness the possibilities of the Cloud or even to join our team!