Guide to using CreateBatchFiles


#1

This module creates a set of Matlab scripts (m-files) or mat-files that
can be submitted in parallel to a cluster for faster processing. This
module should be placed at the end of an image processing pipeline.

Before using this module, you should read Help -> Getting Started ->
Batch Processing. Here you will learn how to set up your cluster for
batch processing.

Settings:
Scripts or Files: If your cluster has MatLab licenses for every node, you
can produce script files for each batch of images. If it does not, you
can produce mat-files which will be read by the compiled CPCluster
program. For more information, please read Help -> Getting Started ->
Batch Processing.

Batch Size: This determines how many images will be analyzed in each set.
If you you do not want to split a job but want to send it to the cluster,
so it does not use a computer you might be using, you can set the batch
size to a very large number (more than the total number of cycles) and
this will create one large job which can be submitted to the cluster. In
general, you do not want your batch size to be too large. If one image
fails, the whole analysis will stop and you will have to start from the
beginning. If you have a smaller batch size, the job that failed will not
take as long to re-run.

Batch Prefix: This determines the prefix for all the batch files.

CellProfiler Path: Here you must specify the exact location of
CellProfiler files as seen by the cluster computers.

Other Paths: You can either specify the exact paths as seen by the
cluster computers, or you can leave a period (.) to use the default image
and output folders. The last two parameters allow you to use the default
image and output folders but switch the beginning path. For example, when
starting with a PC computer and going to a Linux machine, the path may be
the same except the first notation:

PC: \remoteserver1\cluster\project
Linux: /remoteserver2/cluster/project

In this case, for the local machine you would type “\remoteserver1” and
for the remote machine you would type “/remoteserver2”. As of now, this
is hardcoded to always end in Linux and Macintosh format using forward
slashes (/).

How it works:
After the first cycle is processed on your local computer, batch files
are created and saved at the pathname you specify. Each batch file is of
the form Batch_X_to_Y.m (The prefix can be changed from Batch_ by the
user), where X is the first cycle to be processed in the particular batch
file, and Y is the last. There is also a Batch_data.mat file that each
script needs access to in order to initialize the processing.

After the batch files are created, they can be submitted individually to
the remote machines. Note that the batch files and Batch_data.mat file
might have to be copied to the remote machines in order for them to have
access to the data. The output files will be written in the directory
where the batch files are running, which may or may not be the directory
where the batch scripts are located. Details of how remote jobs will be
started vary from location to location. Please consult your local cluster
experts.

After batch processing is complete, the output files can be merged by the
Merge Batch Output module. This is not recommended of course if your
output files are huge and will result in a file that is too large to be
opened on your computer. For the simplest behavior in merging, it is best
to save output files to a unique and initially empty directory.

If the batch processing fails for some reason, the handles structure in
the output file will have a field BatchError, and the error will also be
written to standard out. Check the output from the batch processes to
make sure all batches complete. Batches that fail for transient reasons
can be resubmitted.