SaveImages in CPCluster


#1

Hallo, I am as excited about CP than I was the first day I used that program!
My question today: I am using a CellProfiler pipeline on a Linux cluster. I always like to save a few outlined images as visual controls of the image analysis.
the SaveImages module allows to define the path where the images will be saved. Now, from the Cluster this path is obviously different than from the local machine. At least in the CellProfiler-version June 08 the job then crashed, since the pathnames did not make sense…

I always helped myself by modifying the CreateBatchFile-module. I made the same modification with the pathnames for the SaveImages module than the modifications which are done with the other directories (strrep…). This of course worked fine. I was wondering, whether there was an “official” solution to the problem of saving images from the cluster… Today I downloaded the newest CellProfiler version, I probably missed something, but I did not find it… Of course, if there was no “official” solution, I can modify the CreateBatchFiles module again, but I think am not the only user who wants to save images from the cluster.

thank you very much, Benjamin


#2

Hi Benjamin,

Indeed a solution is in place for just this occasion. In SaveImages, there is an oft-overlooked setting toward the bottom, namely “Update file names within CellProfiler?”. Set this to Yes, and CP will save the FileList and Pathname variables to handles.Pipeline, and if you have set up CreateBatchFiles to translate the local<->cluster names, then SaveImages should work fine on the cluster too.

An added benefit of setting this to Yes, is that if you use ExportToDatabase, then the SaveImages file and pathnames will be saved as measurements in your database.

Let us know how it goes.
Best,
David


#3

Hi David,

thank you for the quick reply! Before I can test, I have to work through some cluster issues. We are working to install SciPy on the cluster. As a temporary workaround I commented the lines dealing with the Batch_data.mat variable and just entered the number_sets variable. Now the script fininshed, however, the “print” command just “prints” out the line with the bsub command on the screen and no job is submitted. … It looks to me, that this is unlikely to be an issue with the missing SciPy library… I hope I am not missing something very obvious…
Best, Benjamin


#4

Hi Benjamin,

You’re almost there – it sounds like you just need to type “| sh” (i.e. pipe the output to shell) after the BatchRunner command.

e.g.:

will just display the bsub commands, whereas

will actually submit the bsub commands to the shell.

David


#5

Hi David,

the last solution worked! the batchrunner now submitted the jobs. Just to be sure: all OUT.mat files will now be saved in a “status” folder, together with the DONE.mat files. This is a bit counterintuitive, since they are actually not “status” but output (of course, I can change the BatchRunner file, I just want to understand how this is supposed to work). One more question: no matter how the outputfile is called, there is always a “Batch_data.mat” file and all “…OUT.mat” file are called “Batch…something”. That way one is completely missing any information about the analysis in the filenames, I miss a bit the option to enter the name of the Batch-files which was in place in the CreateBatchFiles-module with CP 6 months ago (I am just trying to give friendly feedback).
Back to my first question: SaveImages with the “Update Image within CellProfiler”-option set to “Yes”. Many things worked:

  • in the field handles.Pipeline there is a new field with the right images (PathNameCellOutlineImage) generated. This field contains the correct pathnames
  • upon activity of the CreateBatchFiles-module this pathname is correctly converted to a name which would be understandable by the cluster
  • saving the first image from the local machine
    what did not work is the SaveImage-module itself on the cluster, crashed: pathnames not found.
    I entered (later this will be done by a matlab-script) the whole pathname (‘FileDirectory’) in the right field when starting CP. I checked the code for SaveImages: no matter, whether this field in handles.Pipeline is there or not, FileDirectory is always used, and not the converted path in .Pipeline. I realise I could have entered ‘.CellsOutline’ in the FileDirectory-field and it would have worked, but I do not want to save my images in the output-folder, but (stubbornly) at a completely different place (and the “Update Image within CellProfiler”-option also does not seem to be necessary for this). Of course, I am probably missing something really obvious again…
    Thank you very much, Benjamin

#6

Hi Benjamin,

I believe you have come across a limitation in the SaveImages code, where the local-to-cluster substitution in CreateBatchFiles is only performed for the default input and output directories; if you request a different directory (not including a period or a ampersand), it will remain referenced to the local machine.

One way around this is to use relative paths, i.e, double periods to indicate the directory immediately above the current directory. So if you had (for example), “…/CellsOutline”, the first period indicates the default directory, and the next two will move you one directory above and then over to the CellsOutline directory.

You could use combination of these periods to place you elsewhere on the file system, but you’ll always need to to start with respect to the default input or output directory.

Regards,
-Mark


#7

Hi Benjamin

Following on Mark’s answer for your main question re: using relative paths (which does seem your best bet), if you’re on a Mac you can also be creative with symbolic links. If you create some links for directories on your local machine that mimic the cluster machine’s path, that may work too. We’ve not found a comparable solution on the PC.

[quote=“Benjamin”]

the last solution worked! the batchrunner now submitted the jobs. Just to be sure: all OUT.mat files will now be saved in a “status” folder, together with the DONE.mat files. This is a bit counterintuitive, since they are actually not “status” but output (of course, I can change the BatchRunner file, I just want to understand how this is supposed to work). [/quote]

Glad it’s working. And yes, the OUT.mat files’ placement is counterintuitive, and perhaps we should make a separate output folder for them. In practice however, these OUT.mat files are only needed if you want to use, say, Excel or Matlab to inspect the output, and are not useful if you ExportToDatabase. So in most cases in which you need to utilize a computing cluster, and hence CreateBatchFiles, the output will be large enough that a database is necessary to handle the output, and you can opt to not output these files at all. Though I understand that there are some intermediate cases in which it is useful to use these OUT.mat files.

[quote=“Benjamin”]
One more question: no matter how the outputfile is called, there is always a “Batch_data.mat” file and all “…OUT.mat” file are called “Batch…something”. That way one is completely missing any information about the analysis in the filenames, I miss a bit the option to enter the name of the Batch-files which was in place in the CreateBatchFiles-module with CP 6 months ago (I am just trying to give friendly feedback).[/quote]

We do appreciate the feedback, but we felt that there were too many options in CreateBatchFiles. And around the same time, we added a new feature (see the DataTool “SubmitBatch”) which simplifies the whole batch submission process substantially anyway, using a GUI and webserver to submit and handle jobs. You (or your IT dept) would need to set up the webserver for your image files first however, so there is some overhead cost. But for new users, we have found that this process is much more straightforward and user-friendly than resorting to the command-line (as much as this is preferable in many cases).

Best,
David