Memory errors in CP 2.1.1


#1

Dear all,

I am trying to setup environment for CP on our cluster [1]. After solving many problems and mysterious behavior our version of CP finally works in headless mode. However when I send jobs to scheduler many of jobs finish with memory errors. I would be very happy if somebody can share with me some impressions of setting up CP 2.X in the cluster environment.

My current workflow looks in the following way.

  1. I am interested in HCS data analysis. We use 384 well plates, with 9 images per well, 3 channels. Primary analysis usually require processing of ~50 plates. This means ~600k images (in 200k sets). Processing of one set takes about 30 sec. It means ~70 days of computation on single machine (this is the reason we need a cluster).
  2. I have a CP pipeline which computes features, writes the results to the database and mask files and some additional CSVs on the hard drive for each site (processing a site is independent of each other and can happen then in parallel).
  3. I created a python script which reads directories and creates CSV file read by LoadData in CP (it sets up correct metadata, etc.). This file contains ~200k lines
  4. I tried many different scenarios of calling CP and now I decided to use quite old way (without grouping by metadata - I use first and last image set to be processed). It looks something like:

python /cluster/apps/cellprofiler/2.1.1/x86_64/CellProfiler/CellProfiler.py --jvm-h eap-size=1g --do-not-fetch -c -r -b -i /cluster/work/scr3/sstoma/analysis/TEST07/in put/ -t /cluster/work/scr3/sstoma/tmp/ --project=/cluster/work/scr3/sstoma/analysis /TEST07/input/01.cpproj -o /cluster/work/scr3/sstoma/analysis/TEST07/output/ -f 611 28 -l 61343
5. I submit processes to our queue system with different -f -l values so they are computed in parallel.

Now I have few questions:

  1. To adapt CP to our cluster we had to configure JAVA to run with:

Why? When a java virtual machine is started it tries to allocate a huge chunk of the memory (often exceeding physical memory - it is called memory over-commitment; some argue that it improves efficiency). On a workstation with 4 GB of physical memory this is not a problem. However, on a cluster, memory over-commit is often not enabled (it is indeed explicitly disabled on the cluster I use). One can only allocate as much memory as you have physical memory. The two above lines allow us to limit JAVA appetite for RAM and limit memory-overcommitment behavior (and they should be compliant with --jvm-heap-size=1g which we pass to CP).
a) Does anyone of you make similar things? Does it work reliably? Are there any other options which can help in managing RAM consumption of headless CP?
b) My pipeline does not need ImageJ. I guess the bioformat’s file readers are the only beneficent of this memory. What is the rule of thumb for setting --jvm-heap-size? My images are 3x ~10MB tiffs, number of objects goes into few thousand max.
c) I often get indeterministic errors:

Version: 2014-08-07T16:22:21 02e67c8 / 20140807162221 Picked up _JAVA_OPTIONS: -XX:ParallelGCThreads=1 -Xmx8000m -Xms4000m Error: Could not create the Java Virtual Machine. Error: A fatal exception has occurred. Program will exit. ... from bioformats.omexml import OMEXML ... File "/cluster/apps/cellprofiler/2.1.1/x86_64/lib/python2.7/ctypes/__init__.py", line 353, in __init__ self._handle = _dlopen(self._name, mode) OSError: Error occurred during initialization of VM Could not reserve enough space for object heap/libjvm.so: cannot open shared object file: No such file or directory
I guess this is due not enough resources on the cluster for a job. Any ideas what I can change in memory management to get rid of these? This happens only in the beginning of job creation. It might be the reason that I need to request this gigantic heap and 16GB of memory per job (it is kind of problem when I submit 1000 jobs).
c) When jobs are finished I see that:

[code]Resource usage summary:

CPU time   :   8726.82 sec.
Max Memory :      1329 MB
Max Swap   :     13849 MB

Max Processes  :         3
Max Threads    :        63

The output (if any) follows:

Version: 2014-08-07T16:22:21 02e67c8 / 20140807162221
Picked up _JAVA_OPTIONS: -XX:ParallelGCThreads=1 -Xmx8000m -Xms4000m[/code]
The swap consumption scares me. Why my jobs needed ~14GB of RAM (again I am processing sets of 3x images 10MB each… number of objects is ~1000).
d) In the code above: why a job uses 63 threads? I thought that CP run in headless mode uses only one thread. Am I wrong? If not, why 63 threads seems to be initialized (I see in a log these lines - look below):

stopping worker thread 43 stopping worker thread 44 stopping worker thread 45 stopping worker thread 46 stopping worker thread 47 stopping worker thread 48Exiting the JVM monitor thread

  1. I tried to create small CSV files stored in different subdirectories (my CSV is 200k lines…) for each process and use -i parameter for specifying input dir (then I did not use -f -l params – all lines are processed). It works, but records in database get overwritten (image indices…). Is there anyway to go around this problem? I observe that processing of 200k CSV file takes a significant amount of time for each process.

  2. What is the advantage of using .h5 generated by CreateBatchFiles instead of .cpproj? What is inside of this file? My problem is that to create it I need to use GUI to drag files into the CP file import modules. It does not work very reliable when I have 300k images in the directories. Also I use my script to exchange the paths from local ones to ftp. What are the recommendations?

[1] brutuswiki.ethz.ch/


#2

[quote=“szymon.stoma”]Dear all,

I am trying to setup environment for CP on our cluster [1]. After solving many problems and mysterious behavior our version of CP finally works in headless mode. However when I send jobs to scheduler many of jobs finish with memory errors. I would be very happy if somebody can share with me some impressions of setting up CP 2.X in the cluster environment.

My current workflow looks in the following way.

  1. I am interested in HCS data analysis. We use 384 well plates, with 9 images per well, 3 channels. Primary analysis usually require processing of ~50 plates. This means ~600k images (in 200k sets). Processing of one set takes about 30 sec. It means ~70 days of computation on single machine (this is the reason we need a cluster).
  2. I have a CP pipeline which computes features, writes the results to the database and mask files and some additional CSVs on the hard drive for each site (processing a site is independent of each other and can happen then in parallel).
  3. I created a python script which reads directories and creates CSV file read by LoadData in CP (it sets up correct metadata, etc.). This file contains ~200k lines
  4. I tried many different scenarios of calling CP and now I decided to use quite old way (without grouping by metadata - I use first and last image set to be processed). It looks something like:

python /cluster/apps/cellprofiler/2.1.1/x86_64/CellProfiler/CellProfiler.py --jvm-h eap-size=1g --do-not-fetch -c -r -b -i /cluster/work/scr3/sstoma/analysis/TEST07/in put/ -t /cluster/work/scr3/sstoma/tmp/ --project=/cluster/work/scr3/sstoma/analysis /TEST07/input/01.cpproj -o /cluster/work/scr3/sstoma/analysis/TEST07/output/ -f 611 28 -l 61343
5. I submit processes to our queue system with different -f -l values so they are computed in parallel.
[/quote]

All of the above is good practice - pretty much how we do it in our group and what we suggest.

We usually run the JVM with the default for CellProfiler which is 512m. Our typical images are on the order of 1000x1000. I think we could run the JVM with much less memory than that for pipelines without ImageJ modules in them, even for images of an order of magnitude larger size.

A rule of thumb might be 256m for overhead + 2x the size of your largest file. This is pretty conservative and for you, I am pretty sure the default value should always work.

We use some memory for compiling the image set lists, but if you’re using LoadData, that shouldn’t apply for you.

If I read that right, you are requesting 8G of memory for Java. That’s far too much. I googled for _JAVA_OPTIONS; it looks like you may have _JAVA_OPTIONS defined as an environment variable and that’s overriding your request to CellProfiler. Could you check for that and remove the environment variable if it’s present?

My guess is that Python CP is asking for about 2g and your Java options is asking for 12g. Even 2g seems large, but possible I guess.

That’s the Ilastik classifier - it’s starting one thread per core. These threads should be inactive unless you’re using Ilastik. It’s a little heavy-handed, but if you know where your build put your site-packages directory, you can find the file, “ilastik/core/jobMachine.py” and edit the function, detectCPUs() (line 18 on my copy)

def detectCPUs():
    return 1

We could do a better job in LoadData of skipping the lines. Some IT groups will ask users to write their pipelines using ExportToSpreadsheet instead of ExportToDatabase and then you can write a script to concatenate the spreadsheet outputs. Also ExportToDatabase has a mode where it will output the measurements to a text file for later upload. Finally, you can use the CreateBatchFiles module to create a Batch_Data.h5 file which you can use in place of the pipeline file on the command-line. The Batch_Data file contains the results of the prepratory phase of pipeline processing, so using that might speed things up. But unfortunately, there is room for improvement here.

(see above) You can use CreateBatchFiles in conjunction with LoadData. That works well and you won’t need to use the GUI to make the file list. If you have 300K images, you most certainly should use LoadData instead of the GUI.

[quote]
[1] brutuswiki.ethz.ch/[/quote]