Barbarous Dissonance

scp-resume for downloading multiple files

March 28, 2008 at 9:06 pm · Filed under Coding

You need to transfer a lot of files across a slightly temperamental ssl connection. You want something like a recursive scp command that supports resuming and will keep on trying until it gets the job done.

rsync is ideal for this purpose – however, I find it quite dodgy under cygwin, especially when transferring large files.

A sweet alternative is Unison, for synchronizing filesets over ssh.

However, I often find myself falling back on a nice script called scp-resume.sh designed for resuming the transfer of large files using dd over ssh. We can invoke this script inside a loop to transfer lots of files at a time.

One problem with the script is the use of the construct below to determine file sizes:

localsize=`ls -l "${localfile}" | awk '{ print $5 }'`

This will fail if there are spaces in the username of the file owner. Most likely you’ll get:

Resuming download of [file] at byte None
...
dd: invalid number `None'

where the group owning the file is reported by cygwin as ‘None’. The fix is to replace every instance of this ls -l construct with something like localsize=`ls -g "${localfile}" | awk '{ print $4 }'`. The -g option displays the file size but not the owner name, so you should be safe from spaces confusing awk. I don’t know if the -g option is POSIX, but it’s in GNU ls anyway.

You might be tempted to use ls -s, but this reports the amount of disk space used, rather than the actual length of the file (i.e. it will be a multiple of the allocation blocks). You can see the difference using ls -ls:

Hugh Denman@gpplap3 ~
$ cat > asd.txt
fre
hschui
huernui

Hugh Denman@gpplap3 ~
$ ls -ls --block-size=1 ./asd.txt
1024 -rw-r--r-- 1 Hugh Denman None 19 Mar 28 17:59 ./asd.txt

Here my 19-byte text file is taking up 1024 bytes of disk space.

Two other possibilities, suggested by Erik Jan Taal, are perl -e "print -s '$filename'" and ls -l | sed -n 's/.* [^0-9]*$[0-9]\+$ .*/\1/ p'. These will work on FreeBSD, for example, which does not support ls -g.

To use the scp-resume script, we’ll need a text file containing the filenames to transfer from the remote machine. Here’s one way to generate this list.
$ ssh remote-user@remote.machine.ip.addr "/bin/find /cygdrive/d -type f" | grep -vi i386 > ./filelist.txt
In this example, the remote drive contains the OS installation files in /cygdrive/d/I386, which we don’t want to transfer.

With a fixed scp-resume script, and the list of files to transfer present, all that’s left to do is iterate over each file in the list and tell scp-resume to download it. We use the cat filelist.txt | while read FILE approach because it will preserve spaces in the filename (unlike for file in `cat filelist.txt`).

cat filelist.txt | while read FILE ; do
DIR=`dirname "$FILE"`;
mkdir -p "./$DIR" ;
./scp-resume.sh -d "remote-user@remote.machine.ip.addr:$FILE" "./$FILE" ;
done

This very nearly works – the only trouble is that it will only transfer the first file in the list, and then inexplicably stops without an error! This is a difficulty that arises whenever you use the cat [file] | while read VAR idea, with a shell invocation inside the while loop: whenever a shell is started, it gets STDIN, which kills the pipe (I found that out in a Usenet post). So we have to modify scp-resume one last time, changing the download command

ssh -C -c arcfour "$userhost" "dd bs=1 skip=$localsize \"if=${remotefile}\"" >> $localfile < /dev/null

With this change, you can’t enter the ssh password in manually – but you’d have to have automatic authentication setup anyway really, as you don’t want to enter your password for every file. A simple way to set up automatic authentication is described here.

Lastly, you can wrap the whole command above in a for loop with a few iterations so that if the connection is dropped on a few transfers, the file can be resumed in a subsequent pass:

for i in `seq 0 100`; do
cat filelist.txt | while read FILE ; do
DIR=`dirname "$FILE"`;  mkdir -p "./$DIR" ;
./scp-resume.sh -d "remote-user@remote.machine.ip.addr:$FILE" "./$FILE" ;
done; done

This whole process is hideously inefficient for large numbers of files, alas. But it seems to get the job done. Here’s my edited version of scp-resume, using the redirect from /dev/null for ssh and using ls -g instead of ls -l to query the file size. Note that I’ve only tested the downloading functionality, never the uploading bits.

#!/bin/sh
#
# scp-resume - by erik jan taal
# http://ejtaal.net/scripts-showcase/#scp-resume
# Speed improvements by using blocks by nitro.tm@gmail.com
# Fixed by Hugh Denman to use ls -g (safe with usernames containing spaces)
#   this versions assumes that ssh is setup for automatic authentication rather than manual password entry
#
# This script assumes that you have access to the 'dd' utility
# on both the local and remote host.

# dd transfer blocksize (8192 by default)
blocksize=8192

usage()
{
  echo
  echo "Usage: `basename $0` -u(pload)   $localfile  $remotefile [$sshargs]"
  echo "       `basename $0` -d(ownload) $remotefile $localfile  [$sshargs]"
  echo
  echo "  $remotefile should be in the scp format, i.e.: [user@]host:filename"
  echo "  $sshargs are option further ssh options such as a port specification"
  echo "     (-p 1234) or use of compression (-C)"
  echo
  echo "  -u:"
  echo "     $remotefile may be [user@]host: for uploading to your remote home directory"
  echo "  -d:"
  echo "     $localfile may be a period (.) when downloading a remote file to the"
  echo "       current working directory."
  echo
  exit 1
}

[ -z "$1" -o -z "$2" -o -z "$3" ] && usage

option=$1
case $option in
  -[uU]*)
    localfile=$2
    remote=$3
    shift 3
    sshargs="$*"

    userhost=${remote%:*}
    remotefile=${remote#*:}

    if [ ! -f "$localfile" ]; then
      echo "!! File not found: $localfile"
      usage
    fi
    if [ x"$userhost" = x"$remote" ]; then usage; fi
    if [ x"$remotefile" = x"$remote" -o -z "$remotefile" ]; then remotefile=`basename "$localfile"`; fi

    echo "==>> Getting size of remote file:"
    localsize=`ls -g "${localfile}" | awk '{ print $4 }'`
    remotesize=`ssh $sshargs "$userhost" "[ -f \"${remotefile}\" ] && ls -g \"${remotefile}\"" | awk '{ print $4 }' < /dev/null`

    [ -z "$remotesize" ] && remotesize=0
    echo "=> Remote filesize: $remotesize bytes"

    if [ $localsize -eq $remotesize ]; then
      echo "=> Local size equals remote size, nothing to transfer."
      exit 0;
    fi

    remainder=$((remotesize % blocksize))
    restartpoint=$((remotesize - remainder))
    blockstransferred=$((remotesize / blocksize))

    echo "=> Resuming upload of '$localfile'"
    echo "  at byte: $restartpoint ($blockstransferred blocks x $blocksize bytes/block),"
    echo "  will overwrite the trailing $remainder bytes."

    dd bs=$blocksize skip=$blockstransferred "if=${localfile}" |
      ssh $sshargs "$userhost" "dd bs=$blocksize seek=$blockstransferred of=\"$remotefile\"" < /dev/null

    echo "done."
    ;;
  -[dD]*)
    localfile=$3
    remote=$2
    shift 3
    sshargs="$*"

    userhost=${remote%:*}
    remotefile=${remote#*:}

    if [ x"$localfile" = x"." ]; then localfile=`basename "$remotefile"`; fi
    if [ ! -f "$localfile" ]; then
      localsize=0;
    else
      localsize=`ls -g "${localfile}" | awk '{ print $4 }'`
    fi
    [ x"$remotefile" = x"$remote" ] && usage
    [ -z "$localsize" ] && localsize=0

    remainder=$((localsize % blocksize))
    restartpoint=$((localsize - remainder))
    blockstransferred=$((localsize / blocksize))

    echo "=> Resuming download of '$localfile'"
    echo "  at byte: $restartpoint ($blockstransferred blocks x $blocksize bytes/block)"
    echo "  filesize: $localsize; will overwrite the trailing $remainder bytes."
    ssh $sshargs "$userhost" "dd bs=$blocksize skip=$blockstransferred \"if=${remotefile}\"" < /dev/null |
      dd bs=$blocksize seek=$blockstransferred "of=$localfile"

    ;;
  *)
    usage
    ;;
esac

Second real post exactly one year after the first! Prolific.

Permalink

scp-resume for downloading multiple files

Leave a Comment

Archives

Categories

Meta