ExportToSpreadsheet output location headless command line

exporttospreadsheet
createbatchfiles

#1

Hi,

I’m running CP2.1.0 via setting up the pipeline on windows and mac, then running the batch file headless on a bunch of groups in parallel on centos 6, and it generally works fine aside from a few issues I’ve worked around. One of those is now causing me some difficulties in eliminating some of the most egregious manual intervention steps from our processing pipeline. In particular, I have yet to find a way to successfully export csv output in headless mode in such a way that the files don’t overwrite each other. The input and output command line location specifiers don’t ever seem to have an effect for me, and I can’t find any other way so far of getting the csv files not to all save in the same place and thus overwrite each other. So far I’ve been reading results from the .h5 file, which has a number of benefits, including that it’s updated incrementally during the run and so often is usable even if a run is incomplete or is terminated partway through. But there are a number of benefits of the csv output that I can’t seem to get so far from the h5 file, including the metadata for images being attached to the object-level data.

Here’s an example command that I run on centos:

If I include an output specifier like this, it’s consistently ignored–the csv files are still saved in whatever the “default output folder” was that was specified in the gui when saving the batch file (even if I’ve previously created this set of specified output folders before running the command):

If it’s helpful to get a test case with a pipeline we’re trying to run, the batch file, and the image files to recreate the problem on your side, I can provide that but it will take me a bit more time to get that together. We’re on the support plan, and I’m more than happy to use hours to resolve this–just wanted to handle it in public on the forum if that makes sense, to benefit future users who might encounter this.

Cheers,
Blake


CP parallel processing -- re-defining the output directory
#2

Hi Blake,

Thanks for the good question, and for all the detail. I passed it on to Lee, our main software engineer, who is best positioned to answer.

Cheers,
David


#3

Hi Blake,
Two suggestions. First of all, if you choose to save your .csv files to the default output folder, you can specify the location of the default output folder on the command-line using the -o switch. For instance:

for p in $(seq 2); \
    do for s in $(seq 9);\
    do echo $p $s ;\
    mkdir ..../output/pilot2_plate${p}_site${s};\
    cellprofiler -p ..../screens/exp/runX/Batch.h5 -c -r --jvm-heap-size=2g -g "Metadata_Plate=$p,Metadata_Site=$s" -o ..../output/pilot2_plate${p}_site${s} ..../output/pilot2_plate${p}_site${s}.h5 & done; done

The second method would be to use custom names for the CSV files and include the plate and site metadata as part of the name. ExportToSpreadsheet has some help on how to do this, below is a screenshot of an ExportToSpreadsheet module that does it. If you use grouping (as you have), and include the grouping keys in the file name via metadata, each of your jobs will write their CSV to a different location.


#4

Excellent, thanks.

The -o flag has never worked for me as I mentioned in the post (the default output folder remains unchanged no matter what I pass in the -o flag), and I have no idea why. But your suggestion of using metadata in the exported csv filenames looks like a winner–I just had never explored selecting the “No” option for exporting all data. And I now see the instructions on this approach are very thorough online in the module guide.

I do wish I could get the -o flag working, as specifying the output for each object type is a little cumbersome as our pipeline is under pretty heavy development, which means the exported objects change from time to time. My current approach is to run the pipeline on a test set outputting all data as csv and see what the csv files are, then to go back and set up the appropriate exporting of objects with metadata names to run the pipeline on the full dataset. Have you run into any other folks with issues changing the input or output directories on the command line, and if so, do you recall any steps they used to resolve their issues?

Thanks,
Blake


#5

I’ve confirmed that 2.1.0 (the release) that using a batch file ignores the -o flag. However, you can save the pipeline to contain the filelist (so you don’t need the -i flag), and running that will respect the -o flag. You can save the pipeline this way by selecting “CellProfiler pipeline with filelist” from the “Save as type” drop-down.

Note that this isn’t the same as a batch file since it doesn’t have the path conversion via CreateBatchFiles (so it may not run on the cluster), but it gets you a bit closer. I’ve filed a bug report in this here: github.com/CellProfiler/CellPro … ssues/1146
-Mark


#6

Hi all,

I am reviving this thread because I have a very similar question. I am trying to analyse 2268 images, grouped into 108 wells (21 images/ sites per well), headless. I have specified to save spreadsheet to default output folder subfolder, named by Well metadata (to prevent output files being overwritten when run in parallel on a cluster). However, I have getting the output folders with no files in them.

The code I used was:
output="/home/kjf/projects_kjf/Output_Donor1/well$idx" && mkdir $output
cellprofiler -r -c -i /home/kjf/projects_kjf/Donor1 -o $output -g Metadata_Well=$idx -p /home/kjf/Cell_Painting/FastAnalysis_Donor1.cpproj $range

This gave me output folders 1-108 each with a folder for each well within them… with no output files in them

I am now trying:

cellprofiler -r -c -i /home/kjf/projects_kjf/Donor1 -o /home/kjf/projects_kjf/Output_Donor1 -g Metadata_Well=$w -p /home/kjf/Cell_Painting/FastAnalysis_Donor1.cpproj for set ranges of 21 images per node.

Does anyone have suggestions how I can specify my output folder to allow the pipeline to created a folder for each well within my default output folder Output_Donor1?

Thanks!!


#7

Did your CP crash? Were there any (other than the usual) errors? Have you run this pipeline successfully in the GUI before? Have you run CP successfully from the command line before? It’ll be easier to help troubleshoot this with more information. Thanks!


#8

No it did not crash, and there were no unusal errors. The pipeline has worked successfully in GUI and headless before. However in batch it overwrites the output .csv spreadsheet, hence adding grouping and the output folder subfolder being named per well - but now we dont get output files…


#9

Can you upload the pipeline? Thanks.


#10

FastAnalysis_Donor1.cpproj (7.8 MB)


#11

thanks :slight_smile:


#12

Hmmm, I don’t see anything overtly wrong in the pipeline set up; Groups seems to be set up reasonably, and there’s no CreateBatchFiles or anything else that’s a common source of trouble.

These are therefore my best guesses for things to try/places where there might be issues-
-What happens if rather than using the cpproj file, you run the .cppipe instead?
-What happens if you omit the range, since you shouldn’t need to use it with the -g flag?
-What is the value of your $idx variable, an index or a well name? If you’re using it with Metadata_Well, it should be the latter.

Really, I’m not sure though- the pipeline as you’ve sent it to me looks like running it should create the well subfolders as instructed due to the ‘Default Output Folder subfolder’ setting in ExportToSpreadsheet, even if you didn’t run it with the -g flag or any ranges. If none of that helps, you may want to create an issue in our GitHub so our software engineers can take a look. If some of it helps, please let us know what so that the next person can benefit from this! Thanks again, and sorry.


#13

Thanks for taking a look. The file type shouldn’t make a difference as the pipeline runs but I will try it.
I have tried removing the g flag and just using range and that didnt work.
The idx variable is the metadata well, to create those subfolders in the default output dir named by well metadata, as specified in export to spreadsheet. I think this is correct then?

Do you know if I am perhaps using the wrong headless code?

Thanks again!


#14

Do you know if I am perhaps using the wrong headless code?

That should the correct code AFAICT.

I have tried removing the g flag and just using range and that didnt work.

Have you tried removing the range and keeping just the -g flag?

The idx variable is the metadata well

Good, just wanted to make sure.

I’m sorry, I wish I had a better idea of how to help, but I’m out of ideas.