Skip to content

Linux: mdadm – Modify a degraded mirror so that it has only one member/device

August 25, 2013

I recently virtualized a physical server that was running a soft-raid1, managed by mdadm. It had /dev/md1 and /dev/md2 for root and swap, respectively. In our environment, depending on the VM, the storage that backs the disk images is either a SAS RAID5 or a SAN…so I didn’t want to waste space by having two virtual disks just to keep this VM dumb and happy. After the P2V, it of course began complaining at the usual interval that it was missing a disk.

I knew it wasn’t uncommon for this particular “hosting-made-easy-distro” to be installed as a single-disk soft-raid1; its automated installer does this by default when there is one disk to make it easier to switch to a two disk mirror later if the user ever desires it. So I figured I could reverse the process and things would be okay again. Be sure you have good backups before attempting something like this, especially on systems that have their own custom mdadm magic as part of the distro. In this instance it worked though…

tldr…

I found this (http://board.issociate.de/thread/505938/How-to-remove-non-existant-device.html) and tried it…

(In quoting this, I fixed what I think was a typo. The page actually says “-r faileded” [which gave an error]):

>> Try `mdadm /dev/md0 -r missing`.

Close. “missing” is only meaningful with –re-add.
You really want “-r failed” or “-r detached”

NeilBrown

…both -r failed and -r detached gave no errors but also had no effect. I think both of them are more appropriate when your host is already running and loses a device. If you run `cat /proc/mdstat` in that case (esp. if you haven’t rebooted, but maybe even if you have), I believe it not only tells you that a device is failed…but it tells you which one(s).

In my situation running `cat /proc/mdstat` only showed one underlying disk even though it also showed the array itself was degraded and expected two physical disks to be there, and /etc/mdadm.conf also mentioned there being two devices expected. So it wasn’t upset/degraded because it was looking for any particular device, it was just mad that it had a 2 stored somewhere and that isn’t the same number as 1. These commands got it:…

mdadm /dev/md1 –grow –force -n1
mdadm /dev/md2 –grow –force -n1

Finally, the relevant excerpts from `man mdadm`

MODES
mdadm has several major modes of operation:

Grow   Grow (or shrink) an array, or otherwise reshape it in some  way.
Currently  supported  growth options include changing the active
size of component devices and  changing  the  number  of  active
devices  in RAID levels 1/4/5/6, as well as adding or removing a
write-intent bitmap.

For create, build, or grow:
-n, –raid-devices=
Specify the number of active devices in the array.   This,  plus
the number of spare devices (see below) must equal the number of
component-devices (including “missing” devices) that are  listed
on the command line for –create.  Setting a value of 1 is prob-
ably a mistake and so requires that –force be specified  first.
A  value  of 1 will then be allowed for linear, multipath, raid0
and raid1.  It is never allowed for raid4 or raid5.
This number can only be changed using –grow  for  RAID1,  RAID5
and  RAID6  arrays,  and only on kernels which provide necessary
support.

Advertisements

Linux: Foilhat – A semi-paranoid bash cronjob wrapper that watches its back

March 8, 2013

I’m happy to announce/post a project I’ve been tinkering with for some time. Foilhat (foilhat.sh) is a “semi-paranoid” cronjob wrapper that currently features job locking/PID, improved output control (cleaner inbox!), and verification of required mount points on local and remote systems (via ssh) before starting work on the job. I’ve put it on github along with a list of known issues (there are one or two) and planned improvements like job grouping/run order.

I’ve been using it for several years now to great effect. In spite of the fact I’ve reached version 2 I still may make major changes like to config variables and how they are fed into the program. I’m hoping to get that kind of thing stabilized shortly.

It has full documentation (though the README is in need of newline re-formatting), the code has inline comments, and there are two included test jobs; I’m not going to waste a lot of time retyping things here. 🙂

https://github.com/mark-casey/foilhat/blob/master/README
https://github.com/mark-casey/foilhat/blob/master/foilhat.sh


#!/bin/bash

###################################################################################
# Copyright (c) 2012, Mark Casey
# All rights reserved.
# 
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met: 
# 
# 1. Redistributions of source code must retain the above copyright notice, this
#    list of conditions and the following disclaimer. 
# 2. Redistributions in binary form must reproduce the above copyright notice,
#    this list of conditions and the following disclaimer in the documentation
#    and/or other materials provided with the distribution. 
# 
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
# ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
###################################################################################


# Foilhat v2 - A semi-paranoid cron job wrapper that handles job locking/PID, improved output control, and verification of required mount points.
# More documentation is available at the end of the script.


set -e
set -u
set -o pipefail

# Job that will be wrapped
JOB="$1"
FULLCMD="$@"

# Temp storage of job's output
FH_OUT="/tmp/foilhat.out.$$"
FH_ERR="/tmp/foilhat.err.$$"

LCK_FILE="/tmp/$(basename ${JOB}).foilhat.lck"

JOB_STARTED_AT=$(date)



##############################
#  -Define functions-
##############################

function write_lock_file {
# Writes and verifies the job's lock file

	set +e; echo $$ 2>/dev/null > "${LCK_FILE}"; R=$?; set -e
	#R=1  # Debug
	if [ "${R}" != 0 ]
	then
		echo -e "\n--foilhat.sh-- Exiting on error: Could not write to lock file ["${LCK_FILE}"]." >&2
		exit 1
	fi
	
	# Block race conditions and verify lock obtained
	chmod -w "${LCK_FILE}"
	chattr +i "${LCK_FILE}"
	read -r PID_AS_READ < "${LCK_FILE}"
	#PID_AS_READ=-999  # Debug
	if [ "${PID_AS_READ}" -ne $$ ]
	then
		echo -e "\n--foilhat.sh-- Exiting on error: PID in (supposedly) new lock file does not match the PID of the job." >&2
		echo "--foilhat.sh-- The full command that WOULD have been run is: ${FULLCMD}" >&2
		echo "--foilhat.sh-- Current PID [$$] != PID in lock file '"${LCK_FILE}"' ["${PID_AS_READ}"]." >&2
		exit 1
	fi

}



function obtain_lock {
# Verifies no prior job instance is still active, then calls 'write_lock_file'

	if [ -f "${LCK_FILE}" ]
	then
		echo -e "\n--foilhat.sh-- Warning: A prior lock file exists for the job (\""${JOB}"\"); reading last known PID..." >&2
		
		LCK_FILE_LINE_COUNT=$(wc -l <"${LCK_FILE}")
		#LCK_FILE_LINE_COUNT=0  # Debug
		if [ "${LCK_FILE_LINE_COUNT}" -eq 0 -o "${LCK_FILE_LINE_COUNT}" -gt 1 -o $(stat -c%s "${LCK_FILE}") -gt 20 ]
		then
			echo "--foilhat.sh-- Exiting on error: Prior lock file exists but is either empty or contains too much data." >&2
			exit 1
		fi
		
		read -r LAST_PID < "${LCK_FILE}"
		set +e; CHECK_WITH_PS=$(ps -fp "${LAST_PID}"); R=$?; set -e
		#R=0  # Debug
		if [ "${R}" != 0 ]
		then
			echo "--foilhat.sh-- OK!: The last instance of foilhat that ran the job (\""${JOB}"\") is not running under its last known PID ["${LAST_PID}"]" >&2
			echo "--foilhat.sh-- Continuing as normal..." >&2
			
			# The next two lines are the ONLY ones in this script that can clear the lock file
			# during an error state that actually *involves* the lock file.
			chattr -i "${LCK_FILE}"
			rm -f "${LCK_FILE}"
			
			write_lock_file
			
		else
			echo "--foilhat.sh-- Exiting on error: A prior foilhat instance for the job (\""${JOB}"\") is already/still running as ["${LAST_PID}"] based on:" >&2
			echo "${CHECK_WITH_PS}" >&2
			exit 1
		fi
	else
		write_lock_file
	fi

}

function mount_check {
# NOT called in foilhat.sh; exported to job's environment. Used by job to verify required mount points on this host or on remote hosts
# See sample job scripts and documentation for use examples (should be callable from binaries too).
# Use is recommended, but not required

INPUT=("${@}")
ALL_FOUND='true'

# Set where errors come from (some non-bash jobs will preset this, so their own name will show)
if [ -z "${FH_MOUNT_CHECK_CALLER:-}" ]
then
	FH_MOUNT_CHECK_CALLER=$(basename ${0})
fi

for LINE in "${INPUT[@]}"
do
	MOUNT_NEEDED=''
	HOST=''
	PORT='22'
	KEY=''
	VERBOSE=''
	
	OPTIND=1
	LINE_AS_ARR=(${LINE})
	
	# Get arguments
	while getopts ":m:h:p:k:v" opt "${LINE_AS_ARR[@]}"
	do
		case ${opt} in
		m)
			MOUNT_NEEDED="${OPTARG}"
			;;
		h)
			HOST="${OPTARG}"
			;;
		p)
			PORT="${OPTARG}"
			;;
		k)
			KEY="${OPTARG}"
			;;
		v)
			VERBOSE='true'
			;;
		\?)
			echo "--mount_check()-- Invalid option: -$OPTARG" >&2
			#exit 1
			;;
		: )
			echo "--mount_check()-- Option -$OPTARG requires an argument." >&2
			#exit 1
			;;
		esac
	done
	
	if [ "${VERBOSE:-}" == 'true' -o -z "${MOUNT_NEEDED:-}" ]
	then
		echo >&2
		echo "Checking mount string: ${LINE}" >&2
		if [ -n "${MOUNT_NEEDED:-}" ]; then echo "    MOUNT: ${MOUNT_NEEDED}" >&2; fi
		if [ -n "${HOST:-}" ]; then echo "    HOST: ${HOST}" >&2; fi
		echo "    PORT: ${PORT} (Will always equal 22 if not set otherwise)" >&2
		if [ -n "${KEY:-}" ]; then echo "    KEY: ${KEY}" >&2; fi
		if [ -n "${VERBOSE:-}" ]; then echo "    VERBOSE: ${VERBOSE}" >&2; fi
		
		if [ -z "${MOUNT_NEEDED:-}" ]
		then
			echo >&2
			echo "--mount_check()-- Exiting on error: mountcheck() requires at least a mount point parameter." >&2
			exit 1
		fi
	fi
	
	# Check for mount
	FOUND='false'
	
	MOUNT_NEEDED="${MOUNT_NEEDED} type "  # This makes our parameter look like $(mount)'s output formatting; reduces false positives
	
	# If local
	if [ -z "${HOST:-}" -o "${HOST:-}" == 'local' ]
	then
		IFS=$'\n\b'  # Don't wordsplit on space
		for MOUNTS in $(mount)
		do
			if [[ "${MOUNTS}" == *"${MOUNT_NEEDED}"* ]]
			then
				FOUND=true
			fi
		done
		
		if [ "${FOUND}" == 'false' ]
		then
			ALL_FOUND='false'
		fi

	# Else remote
	else
		CMD_TMP='ssh'  # Begin prep of remote query
		
		if [ -n "${PORT:-}" ]
		then
			CMD_TMP="${CMD_TMP} -p ${PORT}"
		fi
		
		if [ -n "${KEY:-}" ]
		then
			CMD_TMP="${CMD_TMP} -i ${KEY}"
		fi
		
		if [ -n "${HOST:-}" ]
		then
			CMD_TMP="${CMD_TMP} ${HOST}"
		fi
		
		CMD_TMP="${CMD_TMP} 'mount'"  # Query ready
		
		CMD_TMP=( ${CMD_TMP} )  # Convert to array to avoid an 'evil use of eval' later
		
		IFS=$'\n\b'  # Don't wordsplit on space
		for MOUNTS in $("${CMD_TMP[@]}")
		do
			if [[ "${MOUNTS}" == *"${MOUNT_NEEDED}"* ]]
			then
				FOUND=true
			fi
		done
		
		if [ "${FOUND}" == false ]
		then
			ALL_FOUND='false'
		fi
	fi
	
	if [ "${VERBOSE:-}" == 'true' ]
	then
		echo "    FOUND?: ${FOUND}" >&2
		echo >&2
	fi
	
	# A mount came up missing; give up early
	if [ "${ALL_FOUND}" == 'false' ]
	then
		echo >&2
		echo "--${FH_MOUNT_CHECK_CALLER}-- Exiting on error: Failed to verify mount point. (failed to find line: ${LINE})" >&2
		echo >&2
		exit 1
	fi

	unset IFS	
done

}
export -f mount_check



##############################
#  -Begin main block-
##############################

# Don't run if we aren't root
if [ $(id -u) -ne 0 ]; then
	echo -e "\n--foilhat.sh-- Exiting on error: Sorry, you are not root." >&2
	exit 1
fi

obtain_lock

# Run job; capture outputs and exit status
set +e
eval ${FULLCMD} >$FH_OUT 2>$FH_ERR
RESULT=$?
set -e

JOB_ENDED_AT=$(date)

# Calculate job duration (get time in seconds, divide out number of days, mod off leftover seconds...ditto with hours/minutes)
SECONDS=$(( $(date --date "${JOB_ENDED_AT}" +%s) - $(date --date "${JOB_STARTED_AT}" +%s) ))
DAYS=$((SECONDS / 86400)); SECONDS=$((SECONDS % 86400))
HOURS=$((SECONDS / 3600)); SECONDS=$((SECONDS % 3600))
MINUTES=$((SECONDS / 60))
SECONDS=$((SECONDS % 60))
DURATION="${DAYS} days, ${HOURS} hours, ${MINUTES} minutes, ${SECONDS} seconds"

# Disable STDOUT if no failures or errors
if [ $RESULT -eq 0 -a ! -s "$FH_ERR" ]
then
	exec > /dev/null
fi

# Check whether the job requested output to logfile
FH_OUTOPTS="/tmp/foilhat.outopts.$$"

if [ -r "${FH_OUTOPTS}" ]
then
	source "${FH_OUTOPTS}"

	if [ "${OUT_TO_LOG:-}" == 'true' -a -n "${LOGFILE:-}" ]
	then
		if [ "${APPEND_TO_LOG:-}" == 'false' -a "${OVERWRITE_LOG:-}" == 'true' ]
		then
			exec > >(tee "${LOGFILE}")
		else
			exec > >(tee -a "${LOGFILE}")
		fi
	fi
fi


# Write output
echo "Foilhat report for job:"
echo "${FULLCMD}"
echo
echo "Exit Status: ${RESULT}"
echo "Start time: ${JOB_STARTED_AT}"
echo "End time: ${JOB_ENDED_AT}"
echo "Duration: ${DURATION}"
echo
echo "STDERR:"
echo "-----------------"
cat "${FH_ERR}" | sed 's/^/   /'
echo "-----------------"
echo
echo
echo "STDOUT:"
echo "-----------------"
cat "${FH_OUT}" | sed 's/^/   /'
echo "-----------------"
echo
echo "===END FOILHAT REPORT $(date)==="
echo

rm -f "${FH_OUT}"
rm -f "${FH_ERR}"
rm -f "${FH_OUTOPTS}" || true  # Might not exist

chattr -i "${LCK_FILE}"
rm -f "${LCK_FILE}"


Windows Server: GPO logon batch script for ultravnc install/update

March 8, 2013

This script will install ultravnc when called as a GPO logon script. It is fairly robust (x86 and x64, handles updates, you can exclude hosts; with auto-removal upon detecting the ‘norun’ file) and the code is pretty well commented inline. You’ll need to set the paths to the ultravnc program directory in the script as well as the path to filever.exe (you want the one from the WinXP version of the support tools, as it is the most portable). The easiest way to get the winvnc program dir is to install the full server and configure it then copy the whole thing with the config to your deployments share (you need x86 and x64 versions but config files can be the same for both).

If the script discovers that the version on the file server is newer it will update to it. I’m using it in production with ultravnc’s AD plugin (which also requires the acl text file configured; you can see it being imported in the script) and it works very well.

When there are updates to ultravnc sometimes there are full copies of the binaries with dependencies and sometimes the only bins zip you can get is a changed file set. In either case watch out for your config files and other customized files, as blanks of them are sometimes included in the update files. If the default ones overwrite, funny things happen such as the normally silent install popping up a config window for the user.

::--------------------------------------------------::
::  UltraVNC installer. Called by GPO logon script  ::
::--------------------------------------------------::

::@echo off

:: Will we exitEarly before (re)installing? (possibly already installed?); exitEarly is initialized true, but may be changed during tests below
set exitEarly=true

:: Assume the architecture is x64 unless tests below show it is x86
set arch=x64

:: See if this is an x86 platform; if either test suggests it is x86, set arch appropriately
reg query HKLM\SOFTWARE\Wow6432Node || set arch=x86
if not exist "C:\Program Files (x86)" set arch=x86

:: Tests to see if the service executable exists in Program Files\UltraVNC and also whether the registry entries for the service are present
:: If either are missing, change the value of the exit variable to keep from exiting/trigger (re)install
if not exist "C:\Program Files\UltraVNC\winvnc.exe" set exitEarly=false
reg query HKLM\SYSTEM\CurrentControlSet\services\uvnc_service || set exitEarly=false

:: Find version of winvnc.exe on the deployments share and put the value in the 'serverVer' variable
for /f "tokens=5 delims= " %%x in ('\\server\deployments\suptools\filever.exe \\server\deployments\uvnc\%arch%\winvnc.exe') do @set serverVer=%%x

:: If exitEarly is still true here, then the service executable must exist; get its version and put the value in the 'localVer' variable
if "%exitEarly%" == "true" (for /f "tokens=5 delims= " %%y in ('\\server\deployments\suptools\filever.exe "C:\Program Files\UltraVNC\winvnc.exe"') do @set localVer=%%y)

:: If the version available on the server and the local version are not the same, an update is available (or something bad has happened); trigger (re)install
if not "%serverver%" == "%localver%" set exitEarly=false

:: If this file exists we exit no matter what and uninstall any existing installation
:: This single file (its contents irrelevant) can be added to sensitive hosts to prevent UltraVNC installation (granted, this GPO is not usually linked to such OUs)
if exist "C:\Program Files\UltraVNC\norun" (
	set exitEarly=true
	if exist "C:\Program Files\UltraVNC\winvnc.exe" (
		"C:\Program Files\UltraVNC\winvnc.exe" -uninstall
		ping -n 1 -w 2000 1.1.1.1 >nul :: delay
		rd /s /q "C:\Program Files\UltraVNC"
		ping -n 1 -w 2000 1.1.1.1 >nul :: delay
		md "C:\Program Files\UltraVNC"
		echo norun &gt; "C:\Program Files\UltraVNC\norun"
	)
)

if "%exitEarly%" == "true" goto :eof

:: Install (or reinstall)
:: -----------------------------
:: ...Start with uninstall in case the service is running, so we'll be able to remove the existing directory in Program Files
"C:\Program Files\UltraVNC\winvnc.exe" -uninstall
c:
cd \
cd "%ProgramFiles%"
rd /s /q UltraVNC

md UltraVNC
cd UltraVNC
xcopy /e \\server\deployments\uvnc\%arch%\*.* .

:: Add parent key in registry (not proven to be strictly required)
reg add HKLM\SOFTWARE\ORL\WinVNC3 /f

:: Import ACL for Active Directory integration
MSLogonACL.exe /i /o acl.txt

:: Redo uninstall for safe measure, and then install; this part does not affect copied files, only adds/removes (and starts/stops) the Windows service
winvnc -uninstall
winvnc -install

Linux: Brackup – Split huge Sources with ignore directives (without rebuilding the digest_db)

November 29, 2012

Brackup is a Perl library used to make backups that are de-duplicated, and that do not use bandwidth during subsequent backup runs to re-transfer files if those files move around in the Source tree. It can be very effective for file servers. Everything here was done against Brackup 1.10 using Perl 5.16.0.

One of my Brackup Sources is getting unwieldy having recently exceeded 5TB spread over ~5.5 million files. The host it is on has 16GB of RAM and the job still runs into swap space before completing, so I’m currently prevented from running it during the workday for performance reasons. I’ve decided to split the Source, and since it has taken some figuring out how to do this without spending weeks waiting for the digest_db to rebuild I thought I’d post the procedure I’ve found. The idea is to use the ignore directive to split the Source in a way that is reasonably future proof.

I didn’t want to just hard-code ignore entries for half of the directories into each of the two new Sources created by the split. That would split the Source, but any new directories would not be excluded from either Source and would be backed up twice. It isn’t exactly an elegant solution. It gets easier to solve when you remember that the ‘ignore’ entries are Perl5 regexs (I initially forgot). The second part of the problem is how to keep from spending the next week or two waiting for the digest_dbs to rebuild.

Deciding what ignore directives to employ requires some profiling of your files. The top level of my Brackup Source consists almost entirely of numerical directory names; only a few have names with actual words. The ending digits are evenly distributed and the overall pattern will not change as directories are added, so I’m using ignore entries that will split the Source into two sets based on whether each directory ends with an odd or even number. I’ve arbitrarily grouped alpha-ending dirs with the Even directories for now, but you could get fancy and do something like Even with A-M and Odd with N-Z. This isn’t perfect but it will work for many years, and as each half grows it allows for a re-split later using the same process used here.

Here is what I ended up with in the config file for the new Sources; assuming the original was simply named ‘big’:

[SOURCE:big_Even_az]
path = /media/disk/big/
chunk_size = 1MB

ignore = ^[^/]*[13579]/

noatime = 1
merge_files_under = 0
digestdb_file = /root/.brackup-source-big_Even_az-digest.db

[SOURCE:big_Odd]
path = /media/disk/big/
chunk_size = 1MB

ignore = ^[^/]*[24680a-zA-Z]/

noatime = 1
merge_files_under = 0
digestdb_file = /root/.brackup-source-big_Odd-digest.db
Install sqlite and create the new Sources as copies:
# apt-get install sqlite3
# cp -a .brackup-source-big-digest.db .brackup-source-big_Even_az-digest.db
# cp -a .brackup-source-big-digest.db .brackup-source-big_Odd-digest.db

Open the digests in sqlite and remove the ‘Odds’ from the Even digest and vice-versa. If a few of your directories are much larger than others, you can remove them first and then VACUUM; to speed up the overall process. This is shown below, assuming 23 and 05 were known to be large:

# sqlite3 .brackup-source-big_Even_az-digest.db

sqlite> delete from digest_cache where key LIKE '[big]23/%';
sqlite> delete from digest_cache where key LIKE '[big]05/%';
sqlite> VACUUM;

sqlite> delete from digest_cache where key LIKE '[big]19/%';
sqlite> delete from digest_cache where key LIKE '[big]01/%';

#…and then do the rest but don’t forget to VACUUM at the end:

sqlite> VACUUM;

Repeat the process for the Odd digest. In my case, before the last VACUUM of the Odd digest I also needed to remove the alpha directories. This is what I found to do it (the square bracket defines char classes and also escapes characters; hence the wreck before and after the word ‘big’, where I had to escape the escape chars):

delete from digest_cache where key GLOB '[[]big[]][a-zA-Z]*';

Finally, run these two statements to change what source the entries supposedly came from. Be sure the replacement strings match the new names in the brackup config:
Run on the Odd digest:

sqlite> update digest_cache set key = replace(key, '[big]', '[big_O]') where key like '[big]%';

Run on the Even digest:
[/sourcecode]

sqlite> update digest_cache set key = replace(key, '[big]', '[big_E_az]') where key like '[big]%';

Finally, a few notes on the step shown above:

  • If you try to interrupt these replace operations I think it rolls back what was done before returning you to the sqlite prompt. If necessary, try running it in subsets such as (noting the added ’01’ at the end) “update digest_cache set key = replace(key, ‘[big]’, ‘[big_O]’) where key like ‘[big]01%’;”
  • Keep in mind that the original Source name was set in the database by Brackup and that we’re intervening and setting it to the new name. I haven’t looked in depth at the source code, but there may be characters that have special treatment when going into the database, like escaping or outright substitution. So far, upper and lower alphanumeric chars along with ‘_’ and ‘-‘ seem to work without any special treatment. If you need to check a char (like ^ or %), just name a new Source with the character in question and run a small Brackup of data that is already in your Target. Then go look what ended up in the key field to see if that character requires special treatment. Don’t forget to prune and gc that Brackup from the Target afterwards.

Linux: Brackup – Convert existing Target to not use colons in chunk names

October 18, 2012

I haven’t posted in a while. Going to try and fix that, maybe post shorter stuff more often.

I’ve been using Brackup lately, which seems to be a good but little known backup/dedupe tool. When it goes to back up a file it stores it in chunks. Some of its storage backends use colons in the chunk names and some don’t.

From: http://www.openfusion.net/net/brackup_tips_and_tricks

There are a few gotchas when doing this, though. One is that Target::Filesystem backups default to using colons in their chunk file names on Unix-like filesystems (for backwards-compatibility reasons), while Target::Ftp and Target::Sftp ones don’t. The safest thing to do is just to turn off colons altogether on Filesystem targets:

[TARGET:server_fs_home]
type = Filesystem
path = /export/brackup/nox/home
no_filename_colons = 1

I didn’t see that until after I’d done some backing up and I didn’t want to start over. I wanted to know if it is possible to just rename the chunks and change the config and have everything work right. Looks like you can. This worked for me but YMMV!

find /media/backups/brackup_targets/server_fs_home/ -type f | \
   grep : | sed -e 's/\(.*\)/\1 \1/' -e 's/\(\ .*\):/\1./' -e 's/^/mv /' | sh

The idea is to get a list of the chunks (i.e.: /media/backups/brackup_targets/sharedfile_backups/a0/00/sha1:SUM .chunk) and create a list of commands to pipe to sh that will change the ‘:’ in the names to ‘.’ (i.e.: mv sha1:SUM.chunk sha1.SUM.chunk but with the full paths)

Linux: Setting speed and duplex on NIC

March 17, 2012

TLDR? Click to jump.

At the office, one of our two Internet connections gives roughly 15mbps of symmetric throughput (it’s metro Ethernet). The other Internet connection (a cable line) only gives 7/2.5 so I’d like to use the faster option for site-site and road warrior VPNs. Unfortunately that metro-E line is also our most reliable and lowest latency option, so we need to use it for some VoIP traffic too. We found these goals were mutually exclusive without some sort of rate limiting on the VPN, because when usage went up the call quality would suffer (the provider does highly prioritize RTP, but depending on what else you’re trying to do it isn’t always enough). Our LAN gateway has an interface for both WAN connections and the VPN server stays behind its firewall, but the voice server is not on the LAN and is only connected directly to the metro-E.

My co-worker had the idea that we could set our gateway interface for the metro-E connection to 10mbps instead of the auto-negotiated 100mbps, and it would be a ridiculously simple rate limiter that would leave plenty of capacity to the VoIP server. This is working well so far, but I did learn that there is a right and wrong way to do it and that most recommendations out there seem to advocate the wrong way.

Without stepping too far into one of the many IT ‘holy wars’, it should suffice to say that for quite a while after auto-negotiation came about, most mission critical connections still had their speed and duplex manually configured on both ends. This was to help avoid cross-vendor issues on the still changing auto-neg standards. There are still a number of old hands that would give countless examples from the “Old Days”(tm) in which auto-neg caused problems and would swear that manual is still the way to go. Then there are newer guys, like myself, who swear they haven’t seen that happen in over a decade (but as most admit, some of us were just finishing up high school a decade ago).

The problem with this is that a lot of connections end up with one side on auto-neg and the other side on manual, whether it is due to a different person configuring each end or because one device or the other does not allow specifying manual settings. The thing I previously did not know is that when only one side is manually configured, the other end will connect at the ‘forced’ speed but will consider the auto-negotiation to have failed and will drop to half duplex. Again, I won’t delve too deep into the details, partially because I’m still reading about them, but if you force full duplex on one end and the other end fails back to half it can change the link status to ‘Suck’ real quick-like. Furthermore, gigE connections use the auto-negotiation process to work through many of the extra details needed to maintain 1000mbps, and so auto-neg is required for that speed. I did read in one place that some vendor implementations still let you turn off auto-neg for gigE via one sort of voodoo or another, but that doesn’t make it a good idea.

My solution (for Linux hosts):
Tell ethtool to make the NIC only advertise the auto-negotiation rates that you want it to.

From man ethtool:

advertise N
Sets the speed and duplex advertised by autonegotiation. The argument is a hexidecimal value using one or a combination of the following values:
0x001 10 Half
0x002 10 Full
0x004 100 Half
0x008 100 Full
0x010 1000 Half(not supported by IEEE standards)
0x020 1000 Full
0x8000 2500 Full(not supported by IEEE standards)
0x1000 10000 Full
0x03F Auto

So we ran the command:
ethtool -s eth1 advertise 0x003

…and then tacked this onto /etc/sysconfig/network-scripts/ifcfg-eth1:
ETHTOOL_OPTS="advertise 0x003"

I was wondering for a while on a theoretical level whether 10mbps full duplex would mean the line could still pull 20 (making this attempt futile), but I’m pretty sure now that the full duplex speed refers to traffic going in both directions. In any case it has been a month now and there have been no issues with call quality, and the remote users whose capacity from us to them increased 5X have only recently stopped offering to name their children and pets after us. Ok, so that last thing isn’t true but they were really happy.

Windows: The best way(s) to sleep or wait in a batch (cmd) script

October 26, 2011

UPDATE: This also works and is probably even better than  the conclusion the article comes to (the delay being the -w in msec):

ping -n 1 -w 2000 1.1.1.1 >NUL 2>&1

Here is the original article and conclusions:

For the longest time if you wanted to sleep in a batch script you either had to use a number of pings, call an outside program, or do some voodoo with a command like ‘choice’. Often times the solution is not reliable and/or not portable between Windows versions.

However I was just revisiting the issue and found a pretty impressive option (shown below) that uses only a conditional goto loop with some timestamps to determine if the correct amount of time has passed. Presumably, the accuracy of this option is roughly that of the system clock with a resolution of one second. The only source I have for this is a link to the stackoverflow.com page I saw it on: http://stackoverflow.com/questions/166044/sleeping-in-a-batch-file. The user who posted it there was “Aacini” but I am not sure if he/she was the original author.

The only real problem with the solution that I can see is that it pegs the CPU. In a virtual machine with a single 2.0ghz core, it easily forces 100% use for the duration of the delay. This is really only a problem if there are other things on the host that are getting slowed down by the delay code (which is likely). So, I’ve mixed it with the old ping solution to get the best of both worlds. Each pass of the goto loop issues two pings to localhost, which blocks the script for around one second each time. By doing this, the loop is slowed considerably and CPU utilization stays below 10%.

There are a few downsides to the change. The first is that if for some reason all of the ping commands terminate immediately, the delay will work as before and will likely cause very high CPU usage. The other is that if something happens to make each ping take longer (perhaps a completely shot network stack), you could be delayed for longer than intended.

I’ve done some testing and although it is usually accurate to within one second for any delay length, trying to use short delays of <=5 seconds is possibly less accurate. However for delays of 10 seconds or longer, the delay has (so far) been within one second of the correct delay 100% of the time. I would estimate the absolute worst case for any delay over 15 seconds would show 4 additional seconds elapsing. So 104 instead of 100, 79 instead of 75, etc., etc. Note that the loop will continue as long as the target time has not been found, so it shouldn’t be possible to get a shorter than expected delay…only a slightly longer one.

The upside here is that it works on XP and Win7 (and probably others that I haven’t tried), it makes no calls to outside programs, and it doesn’t eat up the CPU too much. If the CPU load isn’t an issue for you, be sure and use the version without the ping line in there!

@ECHO OFF
REM DELAY seconds

REM GET ENDING SECOND
FOR /F "TOKENS=1-3 DELIMS=:." %%A IN ("%TIME%") DO SET /A H=%%A, M=1%%B%%100, S=1%%C%%100, ENDING=(H*60+M)*60+S+%1

REM WAIT FOR SUCH A SECOND
:WAIT
ping 127.0.0.1 -n 2 > nul
FOR /F "TOKENS=1-3 DELIMS=:." %%A IN ("%TIME%") DO SET /A H=%%A, M=1%%B%%100, S=1%%C%%100, CURRENT=(H*60+M)*60+S
IF %CURRENT% LSS %ENDING% GOTO WAIT

P.S.: batch’s for loops are hard to read… Generally speaking it is just getting the start time and converting it to seconds (since the start of each day), and then running a loop of ping+checking the time (again, in seconds) before continuing. The ‘%1’ on the end of the 5th line (ignoring word wrap) is where the delay goes. Normally you’ll save this in a file such as ‘delay.bat’ and then call ‘delay.bat 15’ to sleep for 15 seconds. You can remove the %1 and put a number after the ‘+’ to always sleep for that length, at which point you could drop this whole thing into another script. There may be modifications needed to delay across midnight from one day to another, which I believe is addressed on the stackoverflow page I linked.