Porting CP to GPU question for CP developers


I just came across the Anaconda Accelerate compiler suite which can port python applications from CPU only execution to mixed CPU+GPU execution paths with potential for massive runtime improvements. Does anyone have a perspective on how feasible it would be to use Anaconda’s Intel MKL numpy library optimizations and CUDA GPU offloading to increase CP performance? Appears to me like GPUs might be a really fast alternative to CPU based object segmentation which seems to be the rate limiting step in CP image processing pipelines.



At this point, the cost currently outweighs the benefits to adapt CellProfiler to GPUs, but we anticipate that in the next few years, it may be reasonable to enable use of newer technologies like these.

in the meantime, among the big changes in our upcoming release is multiprocessing capability, so CellProfiler can take advantage of multiple cores when running a pipeline. Stay tuned…



Hi Mark, I saw this old post and we are looking for single box solution for the current version of cell profiler. I am fairly well versed in the software and use it frequently. Many thanks to all of you at Broad and the NIH for making CP and CPA happen! It was very much needed. First question, has your group moved any closer to optimizing CP code to run on GPU/CUDA enabled platforms? These systems are becoming very affordable. It would be nice to be able to run large image stacks on a single HPC desktop.

Second thing. We have a small lab and we are looking at purchasing a Xeon multi core system with high levels of RAM to achieve greater throughput on a single desktop dedicated to image processing and data analysis. Could you tell me if CP will scale well across 20+ cores with hyperthreading enabled? If so, do you have a recommendation for the amount of RAM we should dedicate to a single core? In other words how man GB/Core should we consider for the system to work fluidly? We are running 4-5MB/image over tens of thousands of images. Currently we run, smaller batches of images on a hyperthreading enabled quad core with 8GB/Core. The memory and processing speed become saturated when dealing with larger image stacks but CP seems to distribute the workload efficiently over the 4 cores.

Thanks for your help,


I think our engineering team has worked on (or at least thought about working on) installing CP on a GPU in the recent past; I’ll check in with them and try to have someone get back to you.

In the meantime I know at least one other person who works at CP who was using a 32 core system on Amazon Web Services to run large CP datasets and was getting it to work well for him; I believe it was their r3.8xlarge, which has 32 cores and 244GB of RAM; I believe he assigned 4GB to each core and then had the rest as a temporary ramdisk.

What computer specs is best to run Cell Profiler?

I just configured a HP Z840 workstation to work on large confocal image datasets. The system has a dual socket 2x22 core configuration with 128GB of ram. So, with hyperthreading it comes to 88 max workers. You will just need to configure the CellProfiler GUI to use all the possible resources. The performance increase over a 4 core workstation comes in at roughly a 10 fold reduction in run time. Hope this information helps you with your workstation purchase.

Max number of processors and RAM

Thanks Derek, very helpful. Exactly what I was looking for!



CellProfiler uses scikit-image for computer vision operations (i.e. image processing and analysis) and scikit-image uses NumPy and SciPy for algebraic and optimization operations, so CellProfiler won’t support CUDA (or OpenCL) until there’s downstream support (i.e. NumPy and SciPy support them).

If you’re adventurous, NVIDIA recently released cuBLAS, so it might be possible to link NumPy and cuBLAS and run CellProfiler with your cuBLAS-linked NumPy and receive some performance advantage. However, this is completely theoretical.

Alternatively, you could fork CellProfiler and replace scikit-image calls with equivalent or similar calls from OpenCV’s CUDA library. It’d require some work, but you could potentially receive dramatic speedups (e.g. OpenCV claims CUDA functions have a 30x speedup for image processing functions).

Xeon Phi

It’s a good question, but it’s completely dependent on your assay. Ideally, CellProfiler will keep an image and a copy in memory. CellProfiler also needs a few hundred megabytes or so of constant overhead.


Hi everyone!

I would like to retake this post because I am in need to speed up our image analysis.

We have a HP Z840 with 2 CPUs (12 cores each), 64Gb of RAM and 2x 512Gb SSD. When we run analysis we see that there are 24 workers active, which is nice, but then I saw derek’s post and I realised that we could maybe have a better setting with hyper threading (which is enabled in the computer).

The question is, how could I modify CellProfiler’s GUI to take full advantage of the hyper threading? Could I also get 88 cores to work like in derek’s setting?

Otherwise, we are also considering buying a server. What server configuration would you advise?

Thanks in advance!




Hi puigvert,

The Z840 computer workstation that you have is a different CPU configuration. It appears from your post that you are using half of the available CPU resources since you have 24 total cores x2 for HT which should yield ~48 total workers. Have you set the “maximum number of workers” field in the CP preferences menu to be 46? I always leave at least 2 threads free for the system or the computer becomes extremely unresponsive. My Z840 workstation has two Xeon E5-2699 v4 CPUs which brings the total core count to 2sockets x 22cores x2 (HT) = 88 threads total capacity. So, unfortunately if you want to have 88 threads you will need to upgrade to a CPU with a larger number of cores. Hope this information helps.


Hi Derek!

Thanks a lot for your answer. Indeed, I guess that our Z840s are different, ours has the E5-2620 v3 CPUs, with 12 cores each, which should indeed yield 48 cores when hyperthreading.

The problem is that I only see 24 workers, although HT is enabled (I saw it in the BIOS).

Interesting question about setting the maximum number of cores in CP…I don’t see this option in my CP…I’m using the 64bit Windows version (latest), with GUI (not command). Where can I set this?

Thanks again!




Click on “File” in the dialog menu at the top of the CP window then select “Preferences” from the list of choices. The field named “maximum number of workers” is the one you need to set to ~46. Then just restart CP and all your CPU cores will be used for running image processing jobs.


I cannot believe I have been using CP for years and I never saw this option… :grin:

Thanks Derek!


Hi again,

I have now set up the number of cores to 46, I see that there are 46 workers activated, and I see much more RAM being used, but surprisingly the processing time is either the same or almost longer…Any idea? :confused:


Hi Jordi,

I have found through testing that CPU utilization scales with the size of the image batch and the grouping configuration. I usually batch 5000-20000 images at a time so the CPU utilization stays around 90-95%. I would not be surprised that you are observing minimal performance scaling if you are processing small batches of under 100 images. For example, take a 96 well plate experiment each well image must be processed sequentially in a pipeline so you could potentially extract more parallelism by using the “groups” option to group by well so all wells in an experiment are executed in parallel on different worker threads. If you have multiple plates then grouping by plates and wells should help you to fully utilize your CPU resources. Hope this info helps you out.



Hi Derek,

This is a really good advise. Our experiments have usually 384x4 images (4 fields per well), I have used the grouping function and this has reduced the analysis time from 2h45min to 2h15min, with 22 workers. I will try to increase the number of workers and test again.

Thanks a lot!




Hi guys,
I am testing a new server which has all the characteristics of Derek’s machine (2 processors, 22 cores each for a total of 44, 88 with hyper threading) and 128GB of RAM. First try was a success as compared to my previous system, however I think the machine can do better. CP recognizes 44 processors but I can correct this in preferences like Derek suggested. My question deals with Java max memory because it seems to be blocked at 64GB whereas the suggestion here is to give AT LEAST 2GB RAM to each processor…how do you get past the 64GB block in preferences?