Archive for the 'scripts' Category

testing files integrity across an archive of data

I currently cooking a system, largely based on bash, to collect remote sensing data from the web. Since I’m using my personal ADSL connection, I can expect to have many corrupted downloads. I made a little script to check files integrity, and trigger again their download.

First, how to know if a file is corrupted or not? Two technics: either your collect the error code from you download software (ftp here) and log them somewhere to try again or you are able to assess the integrity of a file simply by scanning it. Let’s consider the second case.

The central problem is that you may have various kind of file, so there is not a method per kind of file to check. For example, if we download modis files we have an xml and an hdf file. For a given file, the script must first guess the type of file, then choose an ad-hoc function for checking this file. We assume here that the file type is found by considering the file extension.

To get the file extension from a full file path, simply remove the file path and keep what you find after the last ‘.’, which is a job for sed:

echo $fileFullPath | sed 's/^.*\///' | sed 's/^.*\.//' | tr '[:upper:]' '[:lower:]'

Command

sed 's/^.*\///'

removes the file path, by substituting repetitions (*) of any characters (.*) with nothing, starting from the string beginning (^). Then anything up to the last point is removed (think to escape the point \.).
Note that the regular expression of your system may give different result: give it some tries.

Now we need to call the appropriate test routine as a function of the detected file extension: a simple case function will do this job. Finally, let’s wrap-up everything in a single function (selectFunc): call it with a file name, and it returns the test function to call.

function selectFunc(){
# receives a file name and decides which integrity test function corresponds
    if [ $# -ne 1 ]; then
	echo "selectFunc is missing a single parameter. Exit."
	return -1
    fi
    selector=$(echo $1 | sed 's/^.*\///' | sed 's/^.*\.//' | tr '[:upper:]' '[:lower:]')
    case $selector in
	xml) selectFunc='doTestXML';;
	hdf) selectFunc='doTestImg';;
	tif | tiff ) selectFunc='doTestImg';;
    esac
    # return the selected function
    echo $selectFunc
}

We can see that there is a pending problem with respect to file without an extension, like ENVI native file format (it does not require an extension, only a companion text file). To improve this situation, you can either force an extension to this kind of files (like .bil or .bsq for ENVI files), or handle the case of missing extension with additional tests. For example, one could image to call gdalinfo in this case.

Now we just have to write some test functions.

xml files are rather easy to test. Your OS should have a function for that. For Linux, consider xml Starlet, which command line is

xml -val $file

For images, you should be able to test most of them with gdalinfo.
The function return 0 is everything was ok, 1 else. Actually, the tests functions return the return code of xml starlet and gdalinfo. If you use other kind of test, you may need to translate their exit codes.
At the end, we’ve got something like:

function doTestXML(){
    if [ $# -ne 1 ]; then
	echo "doTestXML is missing a single parameter. Exit."
	exit -1
    fi
    xmlwf $1 >& /dev/null
    return $?
}

function doTestImg(){
    if [ $# -ne 1 ]; then
	echo "doTestXML is missing a single parameter. Exit."
	exit -1
    fi
    gdalinfo $1 >& /dev/null
    return $?
}

Note we sent to null any output of the function and take care only of the return code ($?).

Now, all to use this code:
get the name of the test function to call:

myTest=$(selectFunc $file)

and call the script:

$myTest $file

A functional copy of the code is found there.

Building global gtopo30

GTOPO30, completed in late 1996, was developed othrough a collaborative effort led by the U.S. Geological Survey’s Center for Earth Resources Observation and Science (EROS). GTOPO30 is a global digital elevation model (DEM) with a horizontal grid spacing of 30 arc seconds (approximately 1 kilometer).

Gtopo30 is provided in tiles. Here is how to mosaic them into a single mosaic. First download from GTOPO30 the tiles you need to mosaic. Then write a script like the one I describe below (save it and change its permission to make it executable, chmod u+x make_gtopo30.sh)

First let’s define parameters.

inDir=/Volumes/TimeMachine/data/gtopo30/in
outDir=/Volumes/TimeMachine/data/gtopo30/out
outFile=$outDir/gtopo30_global.tif
tmpDir=/Volumes/TimeMachine/data/gtopo30/tmp
# ensure outDir and tmpDir are created
mkdir -p $outDir
mkdir -p $tmpDir

You must adapt those variables to the actual places where you saved the tiles (inDir), and where you want to have the result saved (outDir) and the temporary directory (tmpDir).

Then extract all files, with tar xvf (x=extract, v=verbose, f=file). tar command is a bit dull or old fashion. To force it to extract files into tmpDir while the tar files are in inDir, I saved the current directory (orgDir=$(pwd)) then moved to the target dir (tmpDir) and once all extracted, returned to the original directory:

orgDir=$(pwd)
cd $tmpDir
for tarFile in $inDir/*.tar
do
tar xvf $tarFile
done
cd $orgDir
Now, we are ready to build a mosaic. First get names of the files to process: DEM files have the .DEM extension. Save the list of these files into a variable, and pass it to gdal_merge.py
fileList=$(ls $tmpDir/*.DEM)
gdal_merge.py -o $outFile -of gtiff -co "compress=lzw" $fileList
Job done! Do  not forget to delete the temporary directory
\rm -r $tmpDir

For a global mosaic, I ended with a 2.2Gb file (with internal compression). You can of course force the output resolution by using the -ps option in gdal_merge.py.

You can download the script from here.

Text file processing with IDL

A friend recently asked me how to process a text file. It was a collection of measurements of plants roots (size and mass) made at various depths from different pits; looking like (after some formatting):

id     depth_name     mass     size
1     30-50              0.5         10
1     30-50              0.45       3
1     30-50              0.3         4.2
1     50-70              0.7         5
1     50-70              0.72       5.3
...    ...                     ...           ...
2     30-50             0.8         4

This friend needed to sum up roots lengths for a same pit and depth.
First you have to format well your file. By well formatted, I mean: use a separator like a comma ‘,’ or a space (if not ambiguous). Then read the file with the READ_ASCII command:

pro process_file
textfile='path/to_my/textfile.txt'
data=read_ascii(textfile, template=ascii_template(textfile))
end
read_ascii

excepts a file template (you can indicate to jump some heading line, choose a column separator and indicate the data type per column). In this case, I choose a string format for the second column.
Alright. The read_ascii command stores the text file value in the data structure. In this case, data has the following fields:
data.field1 stores pits ids
data.field2 stores depths labels
data.field3 stores the mass column
data.field4 stores the roots length column.
Note the structure is ‘field’ par idl, you can’t change it.

Le’s say we want the list of pits ids:
listpitds=data.field1(uniq(data.field1))
uniq function returns ending position of continuous series of values in an array.
Let’s loop over the list of pit ids, and for each pit id, get the list of unique depths-labels (again!) and for these selected lines, sum up roots lengths:

for ipid=0, n_elements(listpits)-1 do begin
  ; now get list of depths NAMES
  wherePits = where(data.field1 EQ listpits[ipid])
  allDepths=data.field2(wherePits)
  listDepths=allDepths[uniq(allDepths)]

  ; now for this pit id, sum for each type of depths
  for idepths=0, n_elements(listDepths)-1 do begin
    ; get position of data to sum
    wts = where( (data.field1 eq listpits[ipid]) AND (
           strcmp(data.field2, listDepths[idepths]) ) )
    print, 'pits: ',strcompress(listpits[ipid]), ', at depth: ', 
           listDepths[idepths],' cm total root length is ',
           total(data.field5[wts])
  endfor
endfor

Easy no?
Complete program is below:

pro process_file

textfile='E:\field_data.csv'

;myTemplate=ascii_template(textfile)
;save, myTemplate,filename='E:\myTemplate.sav'
restore, 'E:\myTemplate.sav'
data = read_ascii(textfile, template=myTemplate)

; now data are in data.field1, data.field2 etc.
; to see: help,data,/structure
data.field2=strcompress(data.field2)
; what is the list of id (i.e. of pits)?
; Caution: assume list already sorted, i.e. pits id are not mixed...

; get list of pits id
listpits=data.field1[uniq(data.field1)]

; loop over the list of pits
for ipid=0, n_elements(listpits)-1 do begin
  ; now get list of depths NAMES
  wherePits = where(data.field1 EQ listpits[ipid])
  allDepths=data.field2(wherePits)
  listDepths=allDepths[uniq(allDepths)]

  ; now for this pit id, sum for each type of depths
  for idepths=0, n_elements(listDepths)-1 do begin
    ; get position of data to sum
    wts = where( (data.field1 eq listpits[ipid]) AND (
          strcmp(data.field2, listDepths[idepths]) ) )
    print, 'pits: ',strcompress(listpits[ipid]), ', at depth: ', 
         listDepths[idepths],' cm total root length is ',
         total(data.field5[wts])
  endfor
endfor
end

python-gdal: simple file processing

Here is a simple example of image processing with python and gdal.

Say we want to make a mask from an image, 0 if data<10, 1 else.
The strategy is the following:

  • open the input file
  • create an output file, geotiff, byte
  • read the input file line by line
  • test values and write to the output

First, let’s import some libraries. I use the following snippet in any of my codes, so it must (may?) work with most of the gdal/numpy libraries settings:

try:
    from osgeo import gdal
    from osgeo.gdalconst import *
    gdal.TermProgress = gdal.TermProgress_nocb
except ImportError:
    import gdal
    from gdalconst import *
try:
    import numpy as N
    N.arrayrange = N.arange
except ImportError:
    import Numeric as N
try:
    from osgeo import gdal_array as gdalnumeric
except ImportError:
    import gdalnumeric

Now let’s define file names, and the threshold value. You can read something else than a geotiff file in input, gdal should do the job for you.

file='/data/demo/input.tif'
outfile='/data/demo/output.tif'
value=1

The lines below open the input file, and get from it its width and height.

fid = gdal.Open(file, GA_ReadOnly)
ns = fid.RasterXSize
nl = fid.RasterYSize

The output file is created, with the same dimensions, 1 single band, its type is set to Byte (GDT_Byte), format forced to geotiff, options are left empty. Note that options is an array of strings, we’ll see in another post what can be done with it. The output file get same projection and geotransformation as the input. Geo transformation is an array of float indicating the upper left corner coordinates and pixel size.

outDrv = gdal.GetDriverByName('GTiff')
options=[]
outDs  = outDrv.Create(outfile, ns, nl, 1, GDT_Byte, options)
outDs.SetProjection(fid.GetProjection())
outDs.SetGeoTransform(fid.GetGeoTransform())

Preparation is done, it is now enough to parse the input file line by line.
At each iteration the output array is set to 0 (default value), places where data is found greater than the given threshold being set to 1.

Note that ReadAsArray returns a 2-D array. N.ravel transforms it in a single array on which operations like dataOut[data>value]=1 can be applied. To write this array to a file you MUST give it 2 dimensions, which is achieved with dataOut.shape=(1,-1).

datazeros=N.zeros(ns)
for il in range(nl):
data = N.ravel(fid.GetRasterBand(1).ReadAsArray(0, il, ns, 1))
dataOut = datazeros
dataOut[data>value]=1
dataOut.shape=(1,-1)
outDs.GetRasterBand(1).WriteArray(N.array(dataOut), 0, il)

Job done!

Is a script running?

On a *nix system, it is often desirable to ensure that a script is running as a singleton (I borrow the C++ term). For example, if a process is scheduled to run every 10min and for some reason last for 11, for sure you don’t want it to be run again.

There are some tricks, like locking a socket and checking its status before running, but I guess it impose quite a lot of programmation, maybe a bit difficult for a shell script. A simpler way is to trigger the script from another one (let’s call it singleton) that is charged to check if the script (cunningly named myCommand) is already running or not, like:

singleton myCommand

What should do singleton? Well, there are two cases: either myCommand is an executable directly run by the system or it is a script. If it is an executable, then pidof let’s you check if it is running:

pidof -s -x myComnand

Now, if myCommand is a script, starting with a shabang (#!/bin/bash or #!/bin/env python, etc) pidof won’t find anything since the program actually running is bash or python and myCommand is only a parameter. Check it with ps -eaf | grep myCommand and you’ll see:

/bin/bash myPath/myCommand.sh

To check if myCommand is executed, the simplest is to list processes, grep lines with myCommand.sh. Don’t forget to add an inverse grep on grep itself (grep -v grep):

result=$(ps -eaf | grep -v grep | grep myCommand)

if [ -z “$result” ]; then

myCommand

fi

Of course, you can be less restrictive when selecting processes from ps. In this case, singleton won’t run anything if you are editing the script (example vi myCommand.sh) or simpling displaying it (e.g. cat myCommand.sh). It is up to you to set the restriction.