Loading large dataset



Hi, I have very large set of images I would like to analyze (~35000 images). Since I cannot us the input module, I have understood that the LoadImage/loadData module should be used. My problem starts with making CSV file. I have wanted to use the ExportToSpreadsheet module but i can’t load the data since the software crashes if I am dragging my 35000 images. What is your suggestion for making CSV file for using the LoadDATA.

Regarding the LoadImage module, I don’t quite understand the rationale of how to match the files. My files have the same well psotion/site/PlateCode however they differ in their w.v. collected.
I have selected Text regular expression method
I am not sure How should I matche the data
Here is an exemple of two different wells from the same plate:

Y - 9(fld 4 wv HQ620_60x - HQ700_75m)-H1050270.tif
Y - 9(fld 4 wv HQ535_50x - HQ620_60m)-H1050270.tif
Y - 9(fld 4 wv HQ480_40x - HQ535_50m)-H1050270.tif
Y - 9(fld 4 wv D360_40x - HQ535_50m)-H1050270.tif

X - 43(fld 4 wv HQ620_60x - HQ700_75m)-H1050270.tif
X - 43(fld 4 wv HQ535_50x - HQ620_60m)-H1050270.tif
X - 43(fld 4 wv HQ480_40x - HQ535_50m)-H1050270.tif
X - 43(fld 4 wv D360_40x - HQ535_50m)-H1050270.tif

I have used this regular expression:
^(?P[A-Z]) - (?P[0-9])(fld (?P[0-9]).-(?P\w*)

Thank you,
I hope I was clear enough :slight_smile:


Does the software actually crash, or does it just hang for a long time? I’ve used CP to make image sets bigger than 35K, and it can take several hours (or even overnight) to finish loading them all, but it can in fact be done, at least on my machine.

I think what you should do is load a small subset of your data (maybe a couple of wells each from a couple of plates) into CP in a brand new pipeline, then play around with the Metadata and NamesAndTypes modules to figure out how you want to do metadata extraction of well name, plate name, channel name, etc, and how you want to match the channels together in NamesAndTypes. Once you’re happy with it and it works across all your images in your small subset, save that pipeline and export the CSV of the image set (File -> Export -> Image Set Listing). From this point, you have two options:

  1. See if you can now load the full image set into CellProfiler (it’ll probably take many hours!) and run the same export command to get the full image set listing.


  1. Use the CSV you have from your small subset experiment as a template of what these CSVs should look like; you can now write a script to read the image and/or folder names and spit out a CSV in the correct format in your favorite scripting language- almost any scripting language should be able to do this. This is more work up front, but if you’ll be doing many large experiments like this it will almost certainly save you time in the long run.

Once you have done either 1) or 2), you’ll have a CSV you can use for LoadData for your “main” pipeline that you’re doing to measure your cells.

Good luck!


Thank for your support.
I was able to create a CSV of small representative example files from my data. Then I have dragged the all folder to cell profiler window and filter for Images only.

In Metadata extraction module I have extracted from File/folder name based on a regular expression which works well. Then when I am updating the metadata filed, I am receiving the notification as can be seen in the image below.

The second option will be much more elaborate for me since I have insufficient knowledge of coding. The CPA learning algorithm works so well with my images (in small scale). I will ask for some help with this part, but it might take long until I get it to work. If you have an example script for a similar task, it might facilitate my confusion in the programming world (I know some R).

Thank you


Can you show/copy the details of the error? Just knowing that there is an error isn’t really sufficient.

Post edited because I realized your metadata extraction was mangled in the forum text before- I think what you meant your metadata extraction was

^(?P<Row>[A-Z]*) - (?P<Column>[0-9]*)\(fld (?P<Site>[0-9]*).*-(?P<Plate>\w*)


Here it is


It’s hard to know without actually playing with your computer, but it sounds indeed like a Memory error- can you check how much Java memory you have allocated in the Preferences section, and see if increasing it allows you to add the full dataset (You’ll want to restart after increasing it)? If not, can you do it in halves or quarters and then concatenate the CSVs at the end if scripting really isn’t an option?


I have again some obstacles I cannot overcome.
I am using PC 192 Giga RAM and working on cellprofiler 2.0 for windows.
I would like to analyze data contain ~240,000 images. I have tasted small subset of my data and successfully analyzed it using CPA. ImageAqusition_08092017.cppipe (10.7 KB)
loadImage_smallscale_08102017.cppipe (6.0 KB)
Images.zip (2.3 MB)

I have created an Image table (attached CSV of cropped subset of the data) but I am not able to perform the CP pipeline, it fails to load the data (screenshot attached). The pipeline for creating data, grouping metadata extraction from the file name and a spreadsheet, is attached to this email. I have also upload set of images.
The java memory used is 64000 (screenshot attached), the max images I can analyse are 5000 sets (which are 20,000 images).

Thank you so much for your help, and for this excellent tool you developed



It looks like from your error message that you’re running out of disk space to create the temporary files needed to run CP. You should check how much disk space is available on the C:\ drive or set the CP preferences to use a different temporary directory.


Hi all,

Back in the day, @LeeKamentsky had some custom (shell?) scripts that he wrote for scraping a directory and outputting a file readable by LoadData. It possibly even omitted fields that had missing images. Does anyone have these? I thought they may have been on his Broad personal directory, but that seems to be gone now unsurprisingly.

I usually do the workaround that @bcimini cited (drag lots of images into Images, setup metadata, etc, then export csv). Trying for a more efficient method!