HCS workflow organization: how to handle terabytes of images?


#1

Hello CP team,
I have a few questions related to handling massive amounts of images. As we are starting to ramp up screens we are running into all kinds of shortages so were wondering has worked out workflow protocols /software or is familiar with commercial tools to systematically: transfer images from capturing workstation to server, backing up, retrieval for processing, automatic deletion of processing steps or permanent store of final high-quality images once processed etc, etc.

We now generate:

  1. 80 images of 25Mb each per well, thus 25Mb x 80 x 96 ~200Gb/plate .
  2. 96 stitched and compressed well images (5Mb total per well) ~0.5Gb/plate. Otherwise CP chokes up: is this ~5Mb/well a limitation of our processing workstation or is a CP limitation?
  3. 96x2 processed (outlined, straighten) images ~1Gb/plate.

The plan is to screen 240,000 wells ~500Tb

We are for now doing things very manually which is time consuming and risky.

  1. capture with Nikon Elements into 8Tb local SSD
  2. stitch using Elements. Do you know of fast stitching tools that can be used in a high-processing computer (no windows or OS operating system?
  3. ‘immediately after’ (depending on the student promptness) we transfer tiles and stitched images to institutional 1PB server, but this needs to be done using the backup function of Elements (one plate at a time) so that we can repopulate Elements HCA visualization tools if needed. This server contents are backed up to cloud daily.
  4. When we start processing with CP we normally need to transfer back images to processing computer, this generates CP databases whose paths are linked to that computer so sharing results of the analysis is challenging. We would prefer to analyze from any personal computer without moving images back and forth and having the results centralized so that every team member can open the database in the version of CPA for training and scoring (since different people may be looking at different phenotypes).
  5. transfer back all processed images to the server.

Keeping track of all this processing and transfer of data is difficult and we fear that as we increase the volume we will start losing data or losing track of the data.

So, do you have suggestions or tools on how to improve the workflow?

Thanks


#2

Otherwise CP chokes up: is this ~5Mb/well a limitation of our processing workstation or is a CP limitation?

5MB per well shouldn’t max out your system, my guess is that it’s a limit of your processing computer. Best way to test though would be to try the bigger ones on a more powerful computer.

Do you know of fast stitching tools that can be used in a high-processing computer (no windows or OS operating system?

I’m assuming you mean some sort of Linux cluster? If you need something that interfaces with a point-and-click program, FijiArchipelago apparently allows use of ImageJ (which has good fast stitching algorithms) on clusters (though I’ve not personally tried it); alternately if you’ve got someone code-savvy you could just use something like sci-kit image on a cluster installation of python.

When we start processing with CP we normally need to transfer back images to processing computer, this generates CP databases whose paths are linked to that computer so sharing results of the analysis is challenging.

You can change the paths in the database pretty easily with a simple SQL call like the one below (I do this all the time to send collaborators data or after I move images):

UPDATE Experiment_Per_Image
SET Image_PathName_Chan1 = REPLACE(Image_PathName_Chan1,'/path/from/processing' , '\\central\server' )
WHERE Image_PathName_Chan1 LIKE '%/path/from/processing%'

Keeping track of all this processing and transfer of data is difficult and we fear that as we increase the volume we will start losing data or losing track of the data.

So, do you have suggestions or tools on how to improve the workflow?

We haven’t “officially-officially” launched this yet (likely by the end of the year), but we’ve been working on a tool called Distributed CellProfiler which allows you to easily run CellProfiler on Amazon Web Services; the advantage would be that the images are centrally located and any member of the lab could run the analysis and/or pull the results from their own personal computer. Take a look at the wiki for it, and let us know if you have questions/comments.


#3

Thanks for your answers.
We will give distributed CP a try as soon as it is out.

5MB/well does not choke up the system. That is way we reduce the well images to that size. However, the non compressed images are ~125Mb each and chikes CP up. Thus we were wondering how large image files can be, and which tricks are available to process minimally compressed images.

We may come back with more questions.


#4

Thus we were wondering how large image files can be

It totally depends on your local computer, sorry.

and which tricks are available to process minimally compressed images

There is a program called Orbit designed for analyzing stitched images that can interface with CellProfiler pipelines; I’ve never used it personally but it’s an option.

We will give distributed CP a try as soon as it is out.

It’s ready to be used now, we’re just doing final proofreading of the documentation before we launch it. Feel free to try it, and let us know if you have issues or where the documentation isn’t clear!