An Application-consistent backup is a must when working in certain data-driven environments where both data integrity, application consistency and uptime are essential for the business. Microsoft Azure adds Linux to the list of OS supporting this feature that is especially interesting for relational databases such as MySQL.

This new feature makes much easier to automate critical tasks like deploying a new slave node, restoring databases and tables from dumps and raw files and even go back to a precise replication position recovering specific binary logs. Furthermore, the good performance of Azure networking reduces the time required to deal with replication delay issues.

All this empowers database administrators to take back control, makes more reliable their backups and means a significant advantage over simple disk snapshots.

This positive experience was one more opportunity to collaborate with Microsoft, learn first-hand more about how Azure works offstage and is proving itself to be already fruitful since this knowledge is now part of our procedures and thus of the value we add to our services to help our customers constantly grow with us.

Application-Consistent backup in Azure Linux VMS

A few days ago I read in one of our *-hacker instant messaging channels:

I have explained just 30 seconds ago a limitation in this feature and now I see they announced this has already been fixed… today!
This happens all the time!

I couldn’t help smiling and thinking “I know that feeling!” because that’s part our day-to-day here at CAPSiDE.

The Cloud (note the capital letters) is an exciting ecosystem where evolution drives the industry at an incredible pace resulting in an ever-increasing repertoire of capabilities. These competitive advantages can enable you to react faster, to use your resources more efficiently, to be more resilient and to easily overcome obsolete barriers and go from local to global in minutes. All these capabilities at your disposal, you just need to be able to keep track of the changes and constantly revamp yourself to take advantage of them. This is a continuous improvement at its best.

Sounds challenging? Well, it really is. Luckily, that’s hard-coded in CAPSiDE’s DNA and we have an excellent relationship with the main actors. And among them, undoubtedly, Microsoft Azure stands out.

As a Microsoft Gold Cloud Platform partner and Azure Infrastructure & Datacenter Transformation Partner of the Year  – for the second year in a row – we collaborate with Microsoft in different ways and a very special one is testing new features in their preview stage, this is a great opportunity for us to better understand how things work behind the scenes and, most important, to help our customers to benefit from new features in a record time.

When we were offered the opportunity of testing a new feature to perform application consistent backups in Azure Linux VMs we thought it could be fun and we jumped in with both feet. It’s no secret we love Linux (and we’re not the only ones!).
application-consistent backup

The feature

First things first: what’s an application-consistent backup in Azure? In this case, to put it simply, it can be defined as a restore point that is created without having to shut down the VM, that preserves data integrity (there’s no data corruption or loss) and that specifically enables the application that uses the data to resume an state of data consistency.

If the latter premise is not satisfied then the backup would be downgraded to just having file-system consistency.

Application-consistent backups are particularly important for applications that temporarily keep the data in memory or that require transaction or I/O operation consistency.

This was already available for Windows VMs following the VSS workflow upon the backup extension’s invocation.

With the new feature for Linux VMs, the backup is also triggered by an extension but in this case, it can be configured to execute some actions before (pre actions) and after (post actions) the snapshot actually takes place. Additional parameters are available, such as if the backup should continue in case of failure in the pre actions or as the timeout and the number of retries for each action. Going a bit deeper, a fsfreeze is executed just before the backup to prevent any write operations in the disk.

/application-consistentbackup

Final details of this feature are provided in Microsoft’s official announcement.

Case scenario

Once we were granted access to the preview stage of this feature, we decided which pieces of software to use, having the following conditions:

Given this, we chose to stick to MySQL 5.7 relational database server running on an Ubuntu 16.04 LTS server and evaluate some potential scenarios for master-slave replication.

We considered some implications of backups in master and slave nodes like:

and finally, we resolved to focus on reproducing a common and well-documented scenario consisting of a single master with one or more slave servers. These slave servers could either be dedicated to backups or be part of a read-only pool.

Once settled, we can move on to the tests.

Tests

I’m now going to explain one of the tests performed, trying to keep it as simple and as easy to reproduce as possible.

The goal of this particular test is to demonstrate that using the application-consistent backup from a slave node we can deploy another one that is fully operative: up-to-date and resuming replication.

Note that for clarity’s sake some secondary parts of the process are omitted.

application-consistent-backup

Requirements

First, we need to setup 3 VMs:

$ sudo update-alternatives --install /bin/sh sh /bin/bash 100

$ sudo update-alternatives --install /bin/sh sh /bin/dash 50

$ sudo update-alternatives --config sh
There are 2 choices for the alternative sh (providing /bin/sh).

Selection Path Priority Status
------------------------------------------------------------
* 0 /bin/bash 100 auto mode
1 /bin/bash 100 manual mode
2 /bin/dash 50 manual mode

Press <enter> to keep the current choice[*], or type selection number:
$ sudo apt-get update && sudo apt-get upgrade
$ sudo apt-get install mysql-server
$ mysql_secure_installation

From Azure Portal:

Then, we need to establish master-slave replication between node-a and node-b:

$ sudo vim /etc/mysql/mysql.conf.d/mysqld.cnf
$ egrep "^\s*(bind-address|server-id|log_bin|binlog_do_db)\s*=" /etc/mysql/mysql.conf.d/mysqld.cnf
bind-address            = <node-a-private-address>
server-id               = 1
log_bin                 = /var/log/mysql/mysql-bin.log
binlog_do_db            = sampledb
$ sudo service mysql restart
$ mysql <dbuser-and-password> -h localhost
mysql> CREATE DATABASE sampledb;
mysql> CREATE USER 'repl'@'%' IDENTIFIED BY '<hello-I-am-a-nice-password>';
mysql> GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%';
mysql> SHOW MASTER STATUS;
+------------------+----------+--------------+------------------+-------------------+
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+------------------+----------+--------------+------------------+-------------------+
| mysql-bin.000001 |     1189 |              |                  |                   |
+------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)
mysql> EXIT
$ sudo vim /etc/mysql/mysql.conf.d/mysqld.cnf
$ egrep "^\s*(bind-address|server-id|log_bin|binlog_do_db)\s*=" /etc/mysql/mysql.conf.d/mysqld.cnf
bind-address            = <node-b-private-address>
server-id               = 2
$ sudo service mysql restart
$ mysql <dbuser-and-password> -h localhost
mysql> CHANGE MASTER TO MASTER_HOST='<node-a-private-address>', MASTER_USER='repl', MASTER_PASSWORD='<hello-I-am-a-nice-password>', MASTER_LOG_FILE='mysql-bin.000001', MASTER_LOG_POS=1189;
mysql> START SLAVE;
mysql> EXIT

Once we have the replication up and running, we need to be able to test it.

We don’t need a complex dataset here but we do need it quickly so here’s a straightforward and fun proposal: let’s check the dictionary 😉

Install an Ubuntu system dictionary package:

$ sudo apt-get install wamerican

and write two scripts: one to insert random words from the dictionary in a given table and another one to print the latest words inserted with their automatically assigned IDs. This will be useful to easily grow the size of the database and check the replication progress.

The example in pseudo code:

dictionary_file_lines = count lines of dictionary_file
repeat
  line_number = random % dictionary_file_lines
  line_content = get line line_number from dictionary_file
  insert line_content in table

Anyway, this is only a proposal and you can use any dataset that suits this purpose. The bigger, the better.

Leave the script running for a while, the replication can be checked from node-b:

mysql> SHOW SLAVE STATUS \G;

The following values deserve special attention:

The command’s documentation has plenty of details.

Custom scripts

The way this feature has been designed grants you a great amount of freedom. The upside is that you can define a workflow perfectly tailored for your needs. The downside is that you are responsible for making it work out. Situations like these are great opportunities to automate your workflow.

As a multinational company, capsidians speak different languages (yes, memes are a language too!) but our lingua franca is Perl. Nevertheless, at this moment this extension supports Python and Bash scripting. We opted for the good classic Bash, sorry Python.

We created one wrapper script for each of the two actions. Their only purpose is to call the main script with pre or post parameter. This helps to keep the plugin configuration clean.

Also, two helper files were created: one with the variables and another one with the functions. They only come to life when invoked by the main script.

Finally, “the main script” (TM). Its pith resides in how the workflow varies depending on the action (pre or post) and the server role (master or slave).

Example in pseudo code:

get server_information
load variables and functions
if requirements_are_met
  if action == pre
    if role == master
      run pre_master_actions
    else if role == slave
      run pre_slave_actions
  else if action == post
    if role == master
      run post_master_actions
    else if role == slave
      run post_slave_actions

Here’s where the work we did considering the different scenarios has to be put to the test.

For example, if you have a huge archive table that will be periodically rotated or truncated you might want to schedule some optimize table commands in the master node to save disk space… but, how would this impact the performance? and the uptime? Do you want to rely on master backups? On slave backups? On both? Also considering different flush operations can be an option.

As with most things in technology, things can get complicated.

We focused on a scenario where replication delay won’t be significant, keeping alive the I/O threads won’t be worth to speed up catch-up and slave nodes are either dedicated to backups or are used as a pool for read-only requests. The master node must aim at maximum uptime but slave nodes can tolerate brief service downtimes to have raw backups.

An example of the sequence for the backup of a slave node could be:

You can take a look at the code of the scripts in Microsoft’s MicrosoftAzureBackup/MySQL GitHub repository. Please, bear in mind that all the linked scripts were written and published for testing purposes; you are solely responsible for determining the appropriateness of using or redistributing it and you assume any risks associated.

The scripts should be placed under a new /op/capside/azure folder. Carefully review them and apply any customization specially to the variables defined in mysql-app-consistent-backup-env.sh.

Make sure only root has access to this folder:

$ sudo chmod -R 700 /op/capside
$ sudo chown -R root:root /op/capside

Plugin configuration

Edit the plugin configuration to use the right scripts and, again, make sure only root has access to the file to prevent failure in the execution:

$ sudo vim /etc/azure/PluginConfig.json
$ sudo egrep "\"(pre|post)ScriptLocation\"" /etc/azure/PluginConfig.json
        "preScriptLocation" : "/opt/capside/azure/pre-mysql-backup.sh",
        "postScriptLocation" : "/opt/capside/azure/post-mysql-backup.sh",
$ sudo chmod 700 /etc/azure/PluginConfig.json
$ sudo chown -R root:root /etc/azure/PluginConfig.json

As previously stated, final details of this feature, and specifically of the plugin configuration, will be provided in Microsoft’s official announcement.

Backup tests

You can either wait for the daily backup policy to trigger a backup for a slave node-b or force it from Azure Portal.

Then check the output of the log created by the script, it should be similar to this:

[2017/03/02 09:07:24 UTC] INFO: Starting PRE actions for role slave
[2017/03/02 09:07:24 UTC] INFO: Stopping MySQL slave...
[2017/03/02 09:07:30 UTC] INFO: MySQL slave stopped.
[2017/03/02 09:07:30 UTC] INFO: Setting read-only status
[2017/03/02 09:07:30 UTC] INFO: Starting dump...
[2017/03/02 09:07:30 UTC] INFO: Dumping database sampledb...
[2017/03/02 09:07:30 UTC] INFO: Dumping table sampledb.sampletable...
[2017/03/02 09:07:30 UTC] INFO: 88      /opt/capside/azure/dumps/mysqldump-sampledb-sampletable-20170302-0907.sql.gz
[2017/03/02 09:07:30 UTC] INFO: Dump finished
[2017/03/02 09:07:30 UTC] INFO: Stopping MySQL service...
[2017/03/02 09:07:32 UTC] INFO: MySQL service stopped.
[2017/03/02 09:07:32 UTC] INFO: PRE for role slave finished
[2017/03/02 09:07:35 UTC] INFO: Starting POST actions for role slave
[2017/03/02 09:07:35 UTC] INFO: Starting MySQL service...
[2017/03/02 09:07:36 UTC] INFO: MySQL service started.
[2017/03/02 09:07:36 UTC] INFO: Unsetting read-only status
[2017/03/02 09:07:36 UTC] INFO: POST for role slave finished

As you can see, it takes only a few seconds to start the post options once the pre ones have finished.

application-consistent backup

Deployment of a new slave node

At this point, we have a two-node MySQL master-slave replication system working properly and with backups being performed on a daily basis. Let’s add a new slave node.

From Azure Portal, start the File Recovery of the VM node-b, push via scp the recovery bash script to the VM node-c and execute it to mount the new disk.

Make sure MySQL service is stopped in VM node-c because we are going to mangle its configuration and data files:

$ sudo service mysql stop

Synchronize MySQL configuration and data folders:

# rsync -avz --stats /path/to/the/recovery-mountpoint/Volume1/etc/mysql/* /etc/mysql/.

# rsync -avz --stats /path/to/the/recovery-mountpoint/Volume1/var/lib/mysql/* /var/lib/mysql/.

Now, adjust the mysqld.cnf file parameters. This could be easily automated by increasing the value of server-id and retrieving from the network interface the private IP address with commands like ifconfig:

$ sudo egrep "^(bind-address|server-id)" /etc/mysql/mysql.conf.d/mysqld.cnf
bind-address            = <node-b-private-address>
server-id               = 2

# sudo sed -i 's/^\(\s*server-id\s*=\s*\)[0-9]*\s*$/\13/' /etc/mysql/mysql.conf.d/mysqld.cnf
# sudo sed -i 's/^\(\s*bind-address\s*=\s*\)[0-9\.]*\s*$/\1<node-c-private-address>/' /etc/mysql/mysql.conf.d/mysqld.cnf

$ sudo egrep "^(bind-address|server-id)" /etc/mysql/mysql.conf.d/mysqld.cnf
bind-address            = <node-c-private-address>
server-id               = 3

Link the relay files to avoid hostname conflicts (see the value of Relay_Log_File in the output of the SHOW SLAVE STATUS command in case of doubts):

# ln -s <node-b-hostname>-relay-bin.index <node-c-hostname>-relay-bin.index
# ln -s <node-b-hostname>-relay-bin.000103 <node-c-hostname>-relay-bin.000103

As an alternative you can also use MySQL –relay-log and –relay-log-index options.

In order to generate a new UUID we must delete the file auto.cnf:

$ sudo rm -f /var/lib/mysql/auto.cnf

Otherwise, the replication of node-b will be stopped due to a duplicated server_uuid.

Start again MySQL service in VM node-c:

$ sudo service mysql start

And verify it resumes replication and catches up, as expected. Voilà!

Bonus track

The default location for the binary logs is /var/log/mysql. Properly defining a policy for archiving and rotating (or purging) the master binlogs can be very helpful in order to enable a more granular recovery with mysqbinlog utility.

Further learning

You can learn more about the application-consistent backup of Linux VMs on Microsoft blog and Azure updates) feed makes it easier to keep track of new features as they are available.

Microsoft’s online documentation is a great reference that should be in your browser bookmarks and the plan your VM backup infrastructure in Azure article is a useful reading.

You can also check our training courses schedule to catalyse your transformation into an Azure pro.

Application-consistent backup

What’s next?

We focus on designing, automating and operating 24/7 critical digital platforms for a wide range of world-class customers, don’t hesitate contacting us to know some of the ways we can help your business to harness the possibilities of the Cloud or even to join our team!

TAGS: application-consistent-backup, azure, azure lab, linux, microsoft, vm

speech-bubble-13-icon Created with Sketch.
Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

*
*