Archive for the 'shell' Category

testing files integrity across an archive of data

I currently cooking a system, largely based on bash, to collect remote sensing data from the web. Since I’m using my personal ADSL connection, I can expect to have many corrupted downloads. I made a little script to check files integrity, and trigger again their download.

First, how to know if a file is corrupted or not? Two technics: either your collect the error code from you download software (ftp here) and log them somewhere to try again or you are able to assess the integrity of a file simply by scanning it. Let’s consider the second case.

The central problem is that you may have various kind of file, so there is not a method per kind of file to check. For example, if we download modis files we have an xml and an hdf file. For a given file, the script must first guess the type of file, then choose an ad-hoc function for checking this file. We assume here that the file type is found by considering the file extension.

To get the file extension from a full file path, simply remove the file path and keep what you find after the last ‘.’, which is a job for sed:

echo $fileFullPath | sed 's/^.*\///' | sed 's/^.*\.//' | tr '[:upper:]' '[:lower:]'

Command

sed 's/^.*\///'

removes the file path, by substituting repetitions (*) of any characters (.*) with nothing, starting from the string beginning (^). Then anything up to the last point is removed (think to escape the point \.).
Note that the regular expression of your system may give different result: give it some tries.

Now we need to call the appropriate test routine as a function of the detected file extension: a simple case function will do this job. Finally, let’s wrap-up everything in a single function (selectFunc): call it with a file name, and it returns the test function to call.

function selectFunc(){
# receives a file name and decides which integrity test function corresponds
    if [ $# -ne 1 ]; then
	echo "selectFunc is missing a single parameter. Exit."
	return -1
    fi
    selector=$(echo $1 | sed 's/^.*\///' | sed 's/^.*\.//' | tr '[:upper:]' '[:lower:]')
    case $selector in
	xml) selectFunc='doTestXML';;
	hdf) selectFunc='doTestImg';;
	tif | tiff ) selectFunc='doTestImg';;
    esac
    # return the selected function
    echo $selectFunc
}

We can see that there is a pending problem with respect to file without an extension, like ENVI native file format (it does not require an extension, only a companion text file). To improve this situation, you can either force an extension to this kind of files (like .bil or .bsq for ENVI files), or handle the case of missing extension with additional tests. For example, one could image to call gdalinfo in this case.

Now we just have to write some test functions.

xml files are rather easy to test. Your OS should have a function for that. For Linux, consider xml Starlet, which command line is

xml -val $file

For images, you should be able to test most of them with gdalinfo.
The function return 0 is everything was ok, 1 else. Actually, the tests functions return the return code of xml starlet and gdalinfo. If you use other kind of test, you may need to translate their exit codes.
At the end, we’ve got something like:

function doTestXML(){
    if [ $# -ne 1 ]; then
	echo "doTestXML is missing a single parameter. Exit."
	exit -1
    fi
    xmlwf $1 >& /dev/null
    return $?
}

function doTestImg(){
    if [ $# -ne 1 ]; then
	echo "doTestXML is missing a single parameter. Exit."
	exit -1
    fi
    gdalinfo $1 >& /dev/null
    return $?
}

Note we sent to null any output of the function and take care only of the return code ($?).

Now, all to use this code:
get the name of the test function to call:

myTest=$(selectFunc $file)

and call the script:

$myTest $file

A functional copy of the code is found there.

Advertisements

Removing 10 first lines of a text file

How to remove the 10 first lines of a text file? Easy job with sed (Stream EDitor):

sed '1,10d' myFile

Example: removing 10 first lines of all text files in a directory.

for file in *.txt
do
sed '1,10d' $file > output_dir/new_${file}
done

Further readings: 

http://sed.sourceforge.net/sedfaq.html

http://www.gnu.org/software/sed/manual/sed.html

install gdal on Mac OS X


Mac OS X is built on Unix BSD and thus offers terminal and bash for good old scripting.

To add gdal and its executables (gdal_translate, gdal_merge.py etc.), go first downloading the framework prepared by KyngChaos

http://www.kyngchaos.com/software:frameworks

and download Gdal Complete.dmg.

The adaptation done for Mac OS X  is just perfect! To make it run:

mount the dmg, then double click on the installation package. Once the installation is done, you must set the PATH variable for bash.

Launch Terminal, then type

export PATH=/Library/Frameworks/GDAL.framework/Programs:$PATH

This command adds /Library/Frameworks/GDAL.framework/Programs to the search PATH.

If you want to save this setting, edit the hidden file .bash_profile in your home directory, and add the above command line; then save. For beginners: the dot ‘.’ before bash_profile corresponds to a hidden file. To immediately set the path, type

source .bash_profile

Anyway, .bash_profile will be read again next time you run a terminal.

Enjoy!

Making an image layer stack for time series processing

One often need a layer stack image, at least for displaying time series of values for a pixel. Have you seen that Envi make_layerstack function provided with the GUI sorts the file name in reverse order? Quite irritating. If you’ve got gdal, simply make the layer stack in one command lines.

gdal_merge.py -of gtiff -separate -o layerstackname.tif file1 file2 file3 file4

I use to make layer stack of hundreds images for processing time series, which I got in two command lines:

list=($(ls *_ndvi.tif))

gdal_merge.py -of gtiff -separate -o timeSeries.tif ${list[@]}

The list variable is actually an array (the two external parenthesis). You can be more selective in the array definition by playing around with the ls command or adding some grep command.

Is a script running?

On a *nix system, it is often desirable to ensure that a script is running as a singleton (I borrow the C++ term). For example, if a process is scheduled to run every 10min and for some reason last for 11, for sure you don’t want it to be run again.

There are some tricks, like locking a socket and checking its status before running, but I guess it impose quite a lot of programmation, maybe a bit difficult for a shell script. A simpler way is to trigger the script from another one (let’s call it singleton) that is charged to check if the script (cunningly named myCommand) is already running or not, like:

singleton myCommand

What should do singleton? Well, there are two cases: either myCommand is an executable directly run by the system or it is a script. If it is an executable, then pidof let’s you check if it is running:

pidof -s -x myComnand

Now, if myCommand is a script, starting with a shabang (#!/bin/bash or #!/bin/env python, etc) pidof won’t find anything since the program actually running is bash or python and myCommand is only a parameter. Check it with ps -eaf | grep myCommand and you’ll see:

/bin/bash myPath/myCommand.sh

To check if myCommand is executed, the simplest is to list processes, grep lines with myCommand.sh. Don’t forget to add an inverse grep on grep itself (grep -v grep):

result=$(ps -eaf | grep -v grep | grep myCommand)

if [ -z “$result” ]; then

myCommand

fi

Of course, you can be less restrictive when selecting processes from ps. In this case, singleton won’t run anything if you are editing the script (example vi myCommand.sh) or simpling displaying it (e.g. cat myCommand.sh). It is up to you to set the restriction.

DIY backup system

I’ve got tons of satellite images and GIS data that I don’t want to loose. But hard drives are not eternal.

My home made backup system is a simple external usb2 hard drive: PC hard drive are becoming cheap, and you can by for some bucks an usb2 external case to plug it in. Now, the question is only how to manage that.

On any *nix system (Unix, Linux, Mac OS X, …) you can use rsync to make a fast backup of your repositories.
My backup hard drive is mounted on /Volumes/Archive on my Mac.
I wrote a very simple bash script:

#!/bin/sh -l
rsync -E -a -x -S --delete --progress --exclude-from=/Users/bubuitalia/exclude_from_rsync.txt /Users/bubuitalia /Volumes/archive/save

This script was saved in my home directory (/Users/bubuitalia) as backup.sh (do not forget to do a chmod u+x backup.sh this script to make it executable). You just have to change the path for your own installation. To run it, type ./backup.sh

This rsync command synchronizes the data in /Users/bubuitalia/ with the archive directory (/Volumes/archive/save).
The rsync commands line has the following options:

  • -E : copies extended files attributes.
  • -a : archive mode
  • -x : don’t cross file system boundaries (omits all mount-point directories from the copy)
  • -S : try to handle sparse file in an efficient way
  • –delete : delete extraneous files on the receiving system: if you delete something on your original data set, it will be deleted on the archive at the next synchronization. Use this function if you want to maintain a mirror copy of your system. It is worth to use it to avoid your archive size to get too large with time.
  • –progress : show progress during transfer.
  • –exclude-from=FILE : read exclude pattern from FILE

Don’t forget: if you run the script, any change in the original data will be applied to the backup. So, if you delete a file and want to restore it, do not run the script! First retrieve the data from the archive (any data deleted on the source, will be deleted on the archive after each synchronization).

I made another file, /Users/bubuitalia/exclude_from_rsync.txt, where I listed (1 entry per line) the directories I don’t want to save:

Music/
Movies/
Library/Caches/
.Trash/

To adapt this example to your own system, simply change the source and target directories.
The rsync page is on http://rsync.samba.org/
and you can find some other examples on http://rsync.samba.org/examples.html

Creating dummy (empty) files

For testing shell tips (linux and cygwin) it is often handy to

  • work a test directory
  • make some (tons of!) files
  • Ok, I suppose you can create directories (mkdir dirname). Now, you can use touch to create (empty) files:

    touch a b c d

    will create files a, b, c and d.

    To create 200 file starting their names with file_, followed with a number and ending with .img, do

    mkdir source
    cd source
    for ((num=0;num<200;num+=1)); do touch file_${num}.img ; done

    Now you can tests the shell tips (example).

    200 files made with touch command and a bash loop