Skip to content

Linux: Brackup – Split huge Sources with ignore directives (without rebuilding the digest_db)

November 29, 2012

Brackup is a Perl library used to make backups that are de-duplicated, and that do not use bandwidth during subsequent backup runs to re-transfer files if those files move around in the Source tree. It can be very effective for file servers. Everything here was done against Brackup 1.10 using Perl 5.16.0.

One of my Brackup Sources is getting unwieldy having recently exceeded 5TB spread over ~5.5 million files. The host it is on has 16GB of RAM and the job still runs into swap space before completing, so I’m currently prevented from running it during the workday for performance reasons. I’ve decided to split the Source, and since it has taken some figuring out how to do this without spending weeks waiting for the digest_db to rebuild I thought I’d post the procedure I’ve found. The idea is to use the ignore directive to split the Source in a way that is reasonably future proof.

I didn’t want to just hard-code ignore entries for half of the directories into each of the two new Sources created by the split. That would split the Source, but any new directories would not be excluded from either Source and would be backed up twice. It isn’t exactly an elegant solution. It gets easier to solve when you remember that the ‘ignore’ entries are Perl5 regexs (I initially forgot). The second part of the problem is how to keep from spending the next week or two waiting for the digest_dbs to rebuild.

Deciding what ignore directives to employ requires some profiling of your files. The top level of my Brackup Source consists almost entirely of numerical directory names; only a few have names with actual words. The ending digits are evenly distributed and the overall pattern will not change as directories are added, so I’m using ignore entries that will split the Source into two sets based on whether each directory ends with an odd or even number. I’ve arbitrarily grouped alpha-ending dirs with the Even directories for now, but you could get fancy and do something like Even with A-M and Odd with N-Z. This isn’t perfect but it will work for many years, and as each half grows it allows for a re-split later using the same process used here.

Here is what I ended up with in the config file for the new Sources; assuming the original was simply named ‘big’:

[SOURCE:big_Even_az]
path = /media/disk/big/
chunk_size = 1MB

ignore = ^[^/]*[13579]/

noatime = 1
merge_files_under = 0
digestdb_file = /root/.brackup-source-big_Even_az-digest.db

[SOURCE:big_Odd]
path = /media/disk/big/
chunk_size = 1MB

ignore = ^[^/]*[24680a-zA-Z]/

noatime = 1
merge_files_under = 0
digestdb_file = /root/.brackup-source-big_Odd-digest.db
Install sqlite and create the new Sources as copies:
# apt-get install sqlite3
# cp -a .brackup-source-big-digest.db .brackup-source-big_Even_az-digest.db
# cp -a .brackup-source-big-digest.db .brackup-source-big_Odd-digest.db

Open the digests in sqlite and remove the ‘Odds’ from the Even digest and vice-versa. If a few of your directories are much larger than others, you can remove them first and then VACUUM; to speed up the overall process. This is shown below, assuming 23 and 05 were known to be large:

# sqlite3 .brackup-source-big_Even_az-digest.db

sqlite> delete from digest_cache where key LIKE '[big]23/%';
sqlite> delete from digest_cache where key LIKE '[big]05/%';
sqlite> VACUUM;

sqlite> delete from digest_cache where key LIKE '[big]19/%';
sqlite> delete from digest_cache where key LIKE '[big]01/%';

#…and then do the rest but don’t forget to VACUUM at the end:

sqlite> VACUUM;

Repeat the process for the Odd digest. In my case, before the last VACUUM of the Odd digest I also needed to remove the alpha directories. This is what I found to do it (the square bracket defines char classes and also escapes characters; hence the wreck before and after the word ‘big’, where I had to escape the escape chars):

delete from digest_cache where key GLOB '[[]big[]][a-zA-Z]*';

Finally, run these two statements to change what source the entries supposedly came from. Be sure the replacement strings match the new names in the brackup config:
Run on the Odd digest:

sqlite> update digest_cache set key = replace(key, '[big]', '[big_O]') where key like '[big]%';

Run on the Even digest:
[/sourcecode]

sqlite> update digest_cache set key = replace(key, '[big]', '[big_E_az]') where key like '[big]%';

Finally, a few notes on the step shown above:

  • If you try to interrupt these replace operations I think it rolls back what was done before returning you to the sqlite prompt. If necessary, try running it in subsets such as (noting the added ’01’ at the end) “update digest_cache set key = replace(key, ‘[big]’, ‘[big_O]’) where key like ‘[big]01%’;”
  • Keep in mind that the original Source name was set in the database by Brackup and that we’re intervening and setting it to the new name. I haven’t looked in depth at the source code, but there may be characters that have special treatment when going into the database, like escaping or outright substitution. So far, upper and lower alphanumeric chars along with ‘_’ and ‘-‘ seem to work without any special treatment. If you need to check a char (like ^ or %), just name a new Source with the character in question and run a small Brackup of data that is already in your Target. Then go look what ended up in the key field to see if that character requires special treatment. Don’t forget to prune and gc that Brackup from the Target afterwards.
Advertisements
No comments yet

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

%d bloggers like this: