Distributed CellProfiler DCP continued... problem with loading files


#1

Hi,

After just recently getting DCP working on my amazon network I’ve been taking small steps to bring my initial example around to something more useful. If anyone is trying this at home the attached files worked for me. Put the pipeline, images, the “csv” list of images and metadata on aws S3. Put the config.py, job.json and fleet.json files on the “CP Control node”. The config.py, job.json and fleet.json have aws specific information that will have to be specified for your aws network. I used the examples provided on the DCP instance to get started. sorry my zip has the same name as the example provided on the examples page.

ExampleSBSImages.zip (4.4 MB)

Just as a side note it is not necessary to have permissions on s3 enabled for “Everyone” or “Any authenticated AWS user”. In addition to the configuration suggested for the ecsInstanceRole and the aws-ec2-spot-fleet-role I have added AmazonS3FullAccess. I haven’t tried to remove that and see if I can get an analysis to run.

With the above analysis all the images are in one folder (images) and the output for each analyzed image goes into it’s own folder in the output folder (you may need to make an empty folder on aws before starting). This output works for me, each image is analyzed separately and output to it’s own csv file.

Ok, here are my questions:

When I run the above analysis a “Spot Fleet” of two instances are started in addition to the one instance started that is associated with the service/task in the cluster (newcluster in this case). This instance is visible in the EC2 container service. The spot fleet instances don’t appear to be doing anything and are terminated when the monitor winds the analysis down. How should I be configuring/organizing/initiating the spot fleet so it actually does something?

What would be convenient would be if I could have multiple input folders of images. For example, an experiment with multiple plates would have a folder of images for each one. I don’t want to “group” the analysis so all the data from a plate goes into one csv or for illumination correction. I can live with separate folders for each analyzed image. but it would be nice not to have to one big input folder with all my images from one screening experiment. I’ve tried a bunch of different things to get this to work and haven’t been able to do it. It seemed like this should have worked, but it did not:

test_06.zip (4.4 MB)

My last question at the moment is if the CP group can change and post new docker images which get incorporated when fab or config is run how can I avoid having a working system stop working if a change/update is made that isn’t for the better?

Thanks for your time! John


#2

When I run the above analysis a “Spot Fleet” of two instances are started in addition to the one instance started that is associated with the service/task in the cluster (newcluster in this case). This instance is visible in the EC2 container service. The spot fleet instances don’t appear to be doing anything and are terminated when the monitor winds the analysis down. How should I be configuring/organizing/initiating the spot fleet so it actually does something?

You’re telling DCP in your spot fleet json that you want 2 instances (m4.xlarge), but in your config file you’re saying you only want 1 Docker container (both CLUSTER_MACHINES and TASKS_PER_MACHINE are set to 1). You’ve said there’s already another instance in this cluster- if it’s got sufficient free memory and CPUs, your docker container is probably running on that instance before your newly-spun-up-instances are ready to start making dockers. If you’re going to start 2 instances, you should set your CLUSTER_MACHINES to 2; if you’re worried about docker containers going on instances they weren’t meant to be on, you can start DCP up in a separate cluster.


What would be convenient would be if I could have multiple input folders of images. For example, an experiment with multiple plates would have a folder of images for each one. I don’t want to “group” the analysis so all the data from a plate goes into one csv or for illumination correction. I can live with separate folders for each analyzed image. but it would be nice not to have to one big input folder with all my images from one screening experiment. I’ve tried a bunch of different things to get this to work and haven’t been able to do it. It seemed like this should have worked, but it did not:

You absolutely can. Here’s a CSV from an experiment I ran on 2 plates from a publicly available data set- note that there are multiple paths represented in the PathName_ columns. BBBC022__AWS.csv (9.9 MB) I’m not entirely sure why yours didn’t work, but I suspect if you try making the path names absolute (starting with /home/ubuntu/bucket) you may have better success; I’ll update the wiki accordingly to reflect that. FWIW, how the output folders are created depends entirely on your groupings- if instead of grouping by plate and image you grouped by just plate or by plate and well, etc, the output folders would be created accordingly.


My last question at the moment is if the CP group can change and post new docker images which get incorporated when fab or config is run how can I avoid having a working system stop working if a change/update is made that isn’t for the better?

I have a bad answer, a hacky answer, and a good answer for this :

  • The bad answer is we don’t update the working docker (latest) until we’ve debugged any changes we make in branches and we’re going to do everything within reason to ensure backwards compatability, though of course like anyone we occasionally make mistakes.

  • The hacky answer is that you can always make your own version of the Docker as it currently stands as a safety backup- just edit the Makefile to reflect your own Dockerhub account, cd into the worker folder, and type make to create your own version of the docker. Then in the config file, call your copy. We’ve intentionally left you this ability so that you can make changes to how DCP works; you can check the advanced configuration section of the wiki for ideas on tweaks you may decide you want to make.

  • The good answer is that because this started as just a lab-internal project, we honestly didn’t spend a ton of time thinking about release strategies because we weren’t sure anyone outside the lab would ever use it (though we certainly hoped they would!). There now seem to be at least a couple of people who are trying to, so clearly we need to revisit that question- rather than latest, we’ll probably introduce some kind of either date based or semantic versioning system (so either DCP_20170216 or DCP 1.0.0.). I’ll talk with the other people on the team today and try to get back to you with an answer very soon.


#3

Update: we chatted as a team and decided that while we’ll keep pushing new versions to the latest tag of DCP, we’ll also push versioned releases in tandem- the current version is available with tag 1.0.0. Once we push our next version (likely in the next couple of weeks), we’ll add a page to the wiki detailing the changes at each release.


#4

Thanks! I see you’ve specified the paths for every image. If you are specifying it here what do you have for input in your job.json file?

I’ll try your other suggestions as well. thanks! -John


#5

Input only matters if any of your modules are looking for Default Input Folder (or Default Input Folder SubFolder); in practice for me that usually only comes up if I’m passing a text file to FlagImage or FilterObjects or something like that, though of course that may not be true for you. If none of the modules are looking for it, it is essentially a dummy variable that is passed but not actually used by CellProfiler.

ETA: it doesn’t matter if LoadData is looking for Default Input Folder or not- it just takes whatever CSV you give it as an override. So it’s really only specifications of Default Input Folder in actual analysis modules for which you need to worry about what Input is set to.


#6

So my bucket name specified in the config.py replaces this part “/home/ubuntu/bucket” of the path in the csv. Ok, I’ll try that. -John


#7

hmm, your suggestion for specifying paths didn’t seem to work.

here are my job.json file and my csv input files:

and here is the log file output from aws cloudwatch.

seems like I get that message “IOError: Could not find file, /home/ubuntu/…/ExampleSBS_06.cppipe” even if the pipeline is fine but any other file can’t be found.

-John


#8

Hmmm, that’s a new one on me- can you try logging into the docker like before and try running
cat {pathtopipeline} | head
(with the path to your pipeline, copied directly from the log, WITHOUT brackets), just to rule out a bucket mounting issue? Otherwise I’ll have to think about it.


#9

[ec2-user@ip-172-31-45-189 ~]$ docker exec -i -t 24e74b123e21 /bin/bash
root@24e74b123e21:/home/ubuntu# cat /home/ubuntu/bucket/ExampleSBSImages/ExampleSBS_06.cppipe | head
cat: /home/ubuntu/bucket/ExampleSBSImages/ExampleSBS_06.cppipe: Not a directory


#10

root@24e74b123e21:/home/ubuntu# cat /home/ubuntu/bucket | head
cat: /home/ubuntu/bucket: Is a directory
root@24e74b123e21:/home/ubuntu# cat /home/ubuntu/bucket/ExampleSBSImages | head
root@24e74b123e21:/home/ubuntu# cat /home/ubuntu/bucket/ExampleSBSImages/ | head
cat: /home/ubuntu/bucket/ExampleSBSImages/: Not a directory


#11

When you do ls /home/ubuntu/bucket from inside a docker container, what do you get?


#12

this is showing what is there:
root@6b09adf55183:/home/ubuntu# ls /home/ubuntu/bucket
ExampleSBSImages ecsconfigs exportedlogs working_03 working_04 working_05

this does not:
root@6b09adf55183:/home/ubuntu# ls /home/ubuntu/bucket/ExampleSBSImages
/home/ubuntu/bucket/ExampleSBSImages


#13

Huh, that’s very weird- my best guess for that is some kind of permissions error for that directory (or the files therein). Can you ls the other directories?


#14

hi, this isn’t really making sense. I’ve been moving the files (images, pipeline, csv) in and out of one project folder while I’ve been trying things over the last few days. Yesterday I had a set of files that worked and today I tried moving them back into the project folder and now they don’t work. I tried making a new bucket and setting open permissions (everyone and authenticated aws users) on files that worked yesterday, but the analysis did not run. This is the output when I log into the running container:

root@9bb70abd5ec7:/home/ubuntu# ls /home/ubuntu/bucket
ExampleSBSImages ecsconfigs
root@9bb70abd5ec7:/home/ubuntu# ls /home/ubuntu/bucket/ecsconfigs/
CP05_ecs.config
root@9bb70abd5ec7:/home/ubuntu# ls /home/ubuntu/bucket/ExampleSBSImages
/home/ubuntu/bucket/ExampleSBSImages
root@9bb70abd5ec7:/home/ubuntu# ls /home/ubuntu/bucket/ExampleSBSImages/images
ls: cannot access /home/ubuntu/bucket/ExampleSBSImages/images: Not a directory

what’s funny is this file “CP05_ecs.config” has even less permissions (only me, not authenticated or everyone), but is visible by ls whereas the pipeline in ExampleSBSImages is completely open. I can download the pipeline to my computer while not logged into aws and I cannot do that to CP05_ecs.config. -John


#15

Did you move them or create the files with a different tool than you were using the last time it worked? I had an issue a week or two back where files I’d uploaded through the S3 browser tool weren’t working in CP (but ones I’d uploaded via command line or my FTP tool worked just fine)- I’d assumed it was user error in me setting the permissions wrong in the browser upload but maybe there was something greater at work. FWIW, in that situation I had to give a completely different path name in the end, going back and trying to upload the files via command line to the old path didn’t allow them to be accessed by DCP, only when I put them in a fresh location was I able to access them again.

I’m grasping at straws here a bit because you’re right, this doesn’t make a tremendous amount of sense. Sorry!


#16

whew, trying a lot and it’s not working. I made a new bucket and cluster, copied input files that worked two days ago into s3 via command line. I triple checked my config files (which worked two days ago) and the workers appear to start, but don’t do anything or even give IOError’s. A single logging file gets copied to s3 (aws-logs-write-test: “Permission Check Successful”), but the logging on sqs does not get uploaded. I’m attaching screen shots of the worker instance log. If I log on to the worker this is the output:

root@cf915b24507a:/home/ubuntu# ls /home/ubuntu/bucket
ExampleSBSImages ecsconfigs not_working_06 working_03 working_04 working_05
root@cf915b24507a:/home/ubuntu# ls /home/ubuntu/bucket/ecsconfigs/
CP04_ecs.config
root@cf915b24507a:/home/ubuntu# ls /home/ubuntu/bucket/ExampleSBSImages/
ls: cannot access /home/ubuntu/bucket/ExampleSBSImages/: Not a directory

the messages in the last screenshot just start repeating:






#17

Looks like there are two issues here- one looks like an old bug where something got hardcoded, and one in a backwards compatibility thing with for 1.1.0- I just pushed a fix that should take care of both to the 1.1.0 and latest tags, and I’ll try to push a bugfix for 1.0.0 as well.

Sorry again!

ETA: I should add though that while what I just pushed should have ironed out those two bugs I just mentioned, neither explains why you can’t ls /home/ubuntu/bucket/ExampleSBSImages/, and I’m not sure what’s going on with that. If you can’t get into the directory you need on your bucket, you still won’t be able to run DCP successfully.

ETA2: Actually, I think the hardcoded bug may actually have inadvertently been a clue- there’s a strange issue we found in S3FS (which mounts the bucket) that you have to sometimes ls directories (without a trailing slash) before you can actually access the files inside- the hardcoded line was a way to get around that issue. The fix I just pushed a few minutes ago then may actually solve your inability to access files- can’t say for sure, there may still be a permissions thing, but it’s worth a try.