Barbarous Dissonance

Archive for Coding

scp-resume for downloading multiple files

March 28, 2008 at 9:06 pm · Filed under Coding

You need to transfer a lot of files across a slightly temperamental ssl connection. You want something like a recursive scp command that supports resuming and will keep on trying until it gets the job done.

rsync is ideal for this purpose – however, I find it quite dodgy under cygwin, especially when transferring large files.

A sweet alternative is Unison, for synchronizing filesets over ssh.

However, I often find myself falling back on a nice script called scp-resume.sh designed for resuming the transfer of large files using dd over ssh. We can invoke this script inside a loop to transfer lots of files at a time.

One problem with the script is the use of the construct below to determine file sizes:

localsize=`ls -l "${localfile}" | awk '{ print $5 }'`

This will fail if there are spaces in the username of the file owner. Most likely you’ll get:

Resuming download of [file] at byte None
...
dd: invalid number `None'

where the group owning the file is reported by cygwin as ‘None’. The fix is to replace every instance of this ls -l construct with something like localsize=`ls -g "${localfile}" | awk '{ print $4 }'`. The -g option displays the file size but not the owner name, so you should be safe from spaces confusing awk. I don’t know if the -g option is POSIX, but it’s in GNU ls anyway.

You might be tempted to use ls -s, but this reports the amount of disk space used, rather than the actual length of the file (i.e. it will be a multiple of the allocation blocks). You can see the difference using ls -ls:

Hugh Denman@gpplap3 ~
$ cat > asd.txt
fre
hschui
huernui

Hugh Denman@gpplap3 ~
$ ls -ls --block-size=1 ./asd.txt
1024 -rw-r--r-- 1 Hugh Denman None 19 Mar 28 17:59 ./asd.txt

Here my 19-byte text file is taking up 1024 bytes of disk space.

Two other possibilities, suggested by Erik Jan Taal, are perl -e "print -s '$filename'" and ls -l | sed -n 's/.* [^0-9]*$[0-9]\+$ .*/\1/ p'. These will work on FreeBSD, for example, which does not support ls -g.

To use the scp-resume script, we’ll need a text file containing the filenames to transfer from the remote machine. Here’s one way to generate this list.
$ ssh remote-user@remote.machine.ip.addr "/bin/find /cygdrive/d -type f" | grep -vi i386 > ./filelist.txt
In this example, the remote drive contains the OS installation files in /cygdrive/d/I386, which we don’t want to transfer.

With a fixed scp-resume script, and the list of files to transfer present, all that’s left to do is iterate over each file in the list and tell scp-resume to download it. We use the cat filelist.txt | while read FILE approach because it will preserve spaces in the filename (unlike for file in `cat filelist.txt`).

cat filelist.txt | while read FILE ; do
DIR=`dirname "$FILE"`;
mkdir -p "./$DIR" ;
./scp-resume.sh -d "remote-user@remote.machine.ip.addr:$FILE" "./$FILE" ;
done

This very nearly works – the only trouble is that it will only transfer the first file in the list, and then inexplicably stops without an error! This is a difficulty that arises whenever you use the cat [file] | while read VAR idea, with a shell invocation inside the while loop: whenever a shell is started, it gets STDIN, which kills the pipe (I found that out in a Usenet post). So we have to modify scp-resume one last time, changing the download command

ssh -C -c arcfour "$userhost" "dd bs=1 skip=$localsize \"if=${remotefile}\"" >> $localfile < /dev/null

With this change, you can’t enter the ssh password in manually – but you’d have to have automatic authentication setup anyway really, as you don’t want to enter your password for every file. A simple way to set up automatic authentication is described here.

Lastly, you can wrap the whole command above in a for loop with a few iterations so that if the connection is dropped on a few transfers, the file can be resumed in a subsequent pass:

for i in `seq 0 100`; do
cat filelist.txt | while read FILE ; do
DIR=`dirname "$FILE"`;  mkdir -p "./$DIR" ;
./scp-resume.sh -d "remote-user@remote.machine.ip.addr:$FILE" "./$FILE" ;
done; done

This whole process is hideously inefficient for large numbers of files, alas. But it seems to get the job done. Here’s my edited version of scp-resume, using the redirect from /dev/null for ssh and using ls -g instead of ls -l to query the file size. Note that I’ve only tested the downloading functionality, never the uploading bits.

#!/bin/sh
#
# scp-resume - by erik jan taal
# http://ejtaal.net/scripts-showcase/#scp-resume
# Speed improvements by using blocks by nitro.tm@gmail.com
# Fixed by Hugh Denman to use ls -g (safe with usernames containing spaces)
#   this versions assumes that ssh is setup for automatic authentication rather than manual password entry
#
# This script assumes that you have access to the 'dd' utility
# on both the local and remote host.

# dd transfer blocksize (8192 by default)
blocksize=8192

usage()
{
  echo
  echo "Usage: `basename $0` -u(pload)   $localfile  $remotefile [$sshargs]"
  echo "       `basename $0` -d(ownload) $remotefile $localfile  [$sshargs]"
  echo
  echo "  $remotefile should be in the scp format, i.e.: [user@]host:filename"
  echo "  $sshargs are option further ssh options such as a port specification"
  echo "     (-p 1234) or use of compression (-C)"
  echo
  echo "  -u:"
  echo "     $remotefile may be [user@]host: for uploading to your remote home directory"
  echo "  -d:"
  echo "     $localfile may be a period (.) when downloading a remote file to the"
  echo "       current working directory."
  echo
  exit 1
}

[ -z "$1" -o -z "$2" -o -z "$3" ] && usage

option=$1
case $option in
  -[uU]*)
    localfile=$2
    remote=$3
    shift 3
    sshargs="$*"

    userhost=${remote%:*}
    remotefile=${remote#*:}

    if [ ! -f "$localfile" ]; then
      echo "!! File not found: $localfile"
      usage
    fi
    if [ x"$userhost" = x"$remote" ]; then usage; fi
    if [ x"$remotefile" = x"$remote" -o -z "$remotefile" ]; then remotefile=`basename "$localfile"`; fi

    echo "==>> Getting size of remote file:"
    localsize=`ls -g "${localfile}" | awk '{ print $4 }'`
    remotesize=`ssh $sshargs "$userhost" "[ -f \"${remotefile}\" ] && ls -g \"${remotefile}\"" | awk '{ print $4 }' < /dev/null`

    [ -z "$remotesize" ] && remotesize=0
    echo "=> Remote filesize: $remotesize bytes"

    if [ $localsize -eq $remotesize ]; then
      echo "=> Local size equals remote size, nothing to transfer."
      exit 0;
    fi

    remainder=$((remotesize % blocksize))
    restartpoint=$((remotesize - remainder))
    blockstransferred=$((remotesize / blocksize))

    echo "=> Resuming upload of '$localfile'"
    echo "  at byte: $restartpoint ($blockstransferred blocks x $blocksize bytes/block),"
    echo "  will overwrite the trailing $remainder bytes."

    dd bs=$blocksize skip=$blockstransferred "if=${localfile}" |
      ssh $sshargs "$userhost" "dd bs=$blocksize seek=$blockstransferred of=\"$remotefile\"" < /dev/null

    echo "done."
    ;;
  -[dD]*)
    localfile=$3
    remote=$2
    shift 3
    sshargs="$*"

    userhost=${remote%:*}
    remotefile=${remote#*:}

    if [ x"$localfile" = x"." ]; then localfile=`basename "$remotefile"`; fi
    if [ ! -f "$localfile" ]; then
      localsize=0;
    else
      localsize=`ls -g "${localfile}" | awk '{ print $4 }'`
    fi
    [ x"$remotefile" = x"$remote" ] && usage
    [ -z "$localsize" ] && localsize=0

    remainder=$((localsize % blocksize))
    restartpoint=$((localsize - remainder))
    blockstransferred=$((localsize / blocksize))

    echo "=> Resuming download of '$localfile'"
    echo "  at byte: $restartpoint ($blockstransferred blocks x $blocksize bytes/block)"
    echo "  filesize: $localsize; will overwrite the trailing $remainder bytes."
    ssh $sshargs "$userhost" "dd bs=$blocksize skip=$blockstransferred \"if=${remotefile}\"" < /dev/null |
      dd bs=$blocksize seek=$blockstransferred "of=$localfile"

    ;;
  *)
    usage
    ;;
esac

Second real post exactly one year after the first! Prolific.

Permalink Comments

Debugging JSP pages on GoDaddy

March 28, 2007 at 9:20 pm · Filed under Coding

I was recently hired to look into some problems with a fairly simple web-based software licensing application hosted on GoDaddy. The server application consists of a number of servlets in some JAR files, and some .jsp pages invoking those servlets.

One of the problems was that email messages originated by the server application weren’t being delivered. There were two issues affecting email delivery. Firstly, the SMTP server had been hard-coded as ‘localhost’; this was easily changed to GoDaddy’s mail server. After this change was made, the second problem appeared: the qmail server was complaining about raw linefeed characters (LF vs CRLF). The app uses an email servlet from CoolServlets.com (now apparently defunct), which uses PrintWriter.println to issue each SMTP command. The email messages constructed by the app used \n newline characters. Neither of these approaches is correct: PrintWriter.println emits the system-specific newline, and no Java IO function will ever do C-style translation of \n into \r\n. A succinct little summary of newlines in programming languages can be had from Wikipedia.

The fix is to change every out.println(stuff) in com.coolServlets.email.Transport.java to out.print(stuff + "\r\n"), and similarly to construct email messages using \r\n instead of \n. If you need to make a similar fix, the source for the CoolServlets.com email servlet is, at the time of writing, available here.

So far, so straightforward. The only issue here is that GoDaddy’s Tomcat server only restarts once a day (1am Arizona time). And because of servlet caching, updates to a jar (or standalone servlet) only come into effect when the server is bounced. So if you’re trying to debug a server application that depends intimately on the host’s (GoDaddy’s) email server & MySQL databases, the naive MO is basically patch, compile, upload, wait til 1am AZ time, repeat until done. Nightmare. Other people have encountered similar GoDaddy Gotchas.

However, while an update to an existing servlet has to wait for the bounce, a new servlet (standalone or in a .jar, but not in a WAR) is available immediately on upload. So for a faster turnaround time, you can rename the class each time you upload it, and invoke it straight away. Of course, this gets cumbersome pretty quickly, as every reference to the updated class in every .java and .jsp file has to be updated as well.

So the fix is: automation by shell script. For the application I was working on, I had two servlet source trees: /WEB-INF/lib/com.consultant/ (consultant were the crowd that originally developed the application), and /WEB-INF/lib/com.coolservlets/. I also had a bunch of .jsp files in /jsp/. I figured the simplest thing was to update the package names, specifically changing com. to comTestNNN. Don’t use comN, as com1 and com2 are serial ports under Cygwin.

The shell script (update.sh) appears below. As it stands, you invoke it as update.sh com comTest1 after the first fix. If the problem’s not sorted, make another change and go update.sh comTest1 comTest2. And so on, with a new basename for the updated packages every time. You can test the updated application immediately it’s uploaded.

The main flaw in the system is that you have to close all the source files in your editor before each build—even if the build fails—as the root directory name changes each time the script is invoked, so the file paths are invalidated. I suspect that this can easily be worked around using directory links, however, where the directory link is renamed each time while the code sits inside the ‘real’ directory. Also, it would be smarter to have the script generate the new name automatically, rather than have the user specify the old and new names.

The four stages are:

Rename the package root directory from $OLDNAME to $NEWNAME
Replace every occurence of $OLDNAME to $NEWNAME in all .java and .jsp files, if referring to one of the packages we’re updating
Package up the files in a .jar
Upload everything to the server

The script uses the Java tools (same version as GoDaddy, rather than the most recent), perl, and ncftp. It’s developed under cygwin, but should work on any bash-alike.

#!/bin/bash

set -o nounset
set -o errexit
set -o verbose

JAVAC=/cygdrive/c/Program Files/Java/jdk1.5.0_11/bin/javac.exe
JAR=/cygdrive/c/Program Files/Java/jdk1.5.0_11/bin/jar.exe

[ $# -lt 2 ] && echo "Need OLDNAME / NEWNAME arguments!" && exit -1

OLDNAME=$1
NEWNAME=$2

BASEDIR="e:/code/www.sampleserver.com" #can't use /cygdrive/e for win32 java tools
FTPUSER=someguy
FTPPASS=crazycrypticpassword

JAVADIR=${BASEDIR}/WEB-INF/lib/${NEWNAME}/
JSPDIR=${BASEDIR}/jsp

cd ${BASEDIR}/WEB-INF/lib

mv $OLDNAME $NEWNAME

perl -pi -w -e "s#${OLDNAME}.consultant#${NEWNAME}.consultant#g" `find $JAVADIR -name *.java`
perl -pi -w -e "s#${OLDNAME}.consultant#${NEWNAME}.consultant#g" `find $JSPDIR -name *.jsp`

perl -pi -w -e "s#${OLDNAME}.coolservlets#${NEWNAME}.coolservlets#g" `find $JAVADIR -name *.java`
perl -pi -w -e "s#${OLDNAME}.coolservlets#${NEWNAME}.coolservlets#g" `find $JSPDIR -name *.jsp`

"$JAVAC" -cp .;servlet.jar `find $JAVADIR -name *.java`
#Can't use $JAVADIR here; puts full path in the jar
"$JAR" -cf stuff.jar `find $NEWNAME -name *java -o -name *class` #-uvf: add files; verbose, see files as added

ncftpput -u $FTPUSER -p $FTPPASS www.sampleserver.com WEB-INF/lib stuff.jar
ncftpput -u $FTPUSER -p $FTPPASS www.sampleserver.com jsp `find $JSPDIR -name *.jsp`

For more on debugging servlets, check here.

Permalink Comments (6)

Barbarous Dissonance

Archive for Coding

scp-resume for downloading multiple files

Debugging JSP pages on GoDaddy

Archives

Categories

Meta