Getting DCP to run on AWS continued... problem with 'startCluster'


#1

Hi,

I’ve been working on getting DCP running on aws. I opened a new account and configured everything up to the point of running the “startCluster” command. I recorded all the aws and other various steps with screenshots and text and compiled that into a .pdf. Unfortunately the forum doesn’t upload pdf’s. I submitted it to the CP admin folks. Hopefully they will post it somewhere.

My problem is this: after getting everything configured I tried running a test analysis. fab setup and submitJob seem to work. I see a ecs.config file appear in s3 and the jobs in SQS, but when I try to run the next command this is the output:

~/Distributed-CellProfiler$ python run.py startCluster files/fleet03.json
Traceback (most recent call last):
File “run.py”, line 338, in
startCluster()
File “run.py”, line 199, in startCluster
userData=generateUserData(ecsConfigFile,DOCKER_BASE_SIZE)
File “run.py”, line 78, in generateUserData
subprocess.Popen(cmd.split())
File “/usr/lib/python2.7/subprocess.py”, line 711, in init
errread, errwrite)
File “/usr/lib/python2.7/subprocess.py”, line 1343, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory

Here is a zip with all the relevant files: project.zip (4.4 MB)

I obscured my numbers in a few places (e.g. xxxxx). I tried changing a bunch of things, but keep getting the same error “No such file or directory”.

In case there was something weird going on with permissions I did all my file manipulations on the ubuntu CP control node. I downloaded from cellprofiler:

$ wget http://d1zymp9ayga15t.cloudfront.net/content/Examplezips/ExampleSBSImages.zip

and did all my editing with vi. I uploaded to s3 like this:
$ sudo aws s3 sync project/ s3://cp.project.bucket/project

The comment earlier about the CP Lab having set up their aws services a long time ago seems like a red flag or at least a clue to what seems like a permissions issue.

Again thanks for your time. -john


#2

Do you have cloud image utils installed? It was added as a dependency a couple of weeks ago- try

sudo apt-get install cloud-image-utils

in your control node then running the startCluster step again.

If that doesn’t work or if you know it’s already installed, can you check if the two temp files created in the immediately preceding lines of the code (temp_config.txt and temp_boothook.txt) are there?


#3

ah, the last requirement, missed that, sorry. different error now:

$ python run.py startCluster files/fleet03.json
Traceback (most recent call last):
File “run.py”, line 338, in
startCluster()
File “run.py”, line 206, in startCluster
requestInfo = ec2client.request_spot_fleet(SpotFleetRequestConfig=spotfleetConfig)
File “/usr/local/lib/python2.7/dist-packages/botocore/client.py”, line 253, in _api_call
return self._make_api_call(operation_name, kwargs)
File “/usr/local/lib/python2.7/dist-packages/botocore/client.py”, line 543, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (InvalidSpotFleetRequestConfig) when calling the RequestSpotFleet operation: Parameter: SpotFleetRequestConfig.IamFleetRole is invalid.

I used your exampleFleet.json to make mine. When I had this working last week for a day I had searched around and found other values for the instance type “ami-xxxx” and for the “snap-xxxx”. Could these be an issue?

I checked on my control node and on s3 and don’t see those temp files. where would they be located?

sudo find / -name 'temp_config.txt' sudo find / -name ‘temp_boothook.txt’

(no output in the terminal) -John


#4

I checked on my control node and on s3 and don’t see those temp files. where would they be located?

The files would have been cleaned up already by the point of your second error, so they should be gone now.

I used your exampleFleet.json to make mine. When I had this working last week for a day I had searched around and found other values for the instance type “ami-xxxx” and for the “snap-xxxx”. Could these be an issue?

I’m not sure, but I don’t think so- the exact error seems to be SpotFleetRequestConfig.IamFleetRole is invalid - can you confirm that whatever role you have set as your IamFleetRole in your exampleFleet.json is set up correctly? You removed the XXXXXX string and replaced it with your account credentials, your role doesn’t have a slightly different name, etc? Safest thing to do is to go into the console and directly copy the role ARNs directly into the json file.


#5

I looked more closely and it turns out the aws console IAM configuration I have for the platform that worked briefly last week is quite different then my new one. I added the roles we’ve been working with already to the Administrator group:

And I changed the “Trusted Entities” for the spotfleet role:

The ecsinstance role has the default trusted entity “ec2.amazonaws.com”. Now my error is something I got the last time I tried this and I had to root out some values that worked:

$ python run.py startCluster files/fleet03.json
Traceback (most recent call last):
File “run.py”, line 338, in
startCluster()
File “run.py”, line 206, in startCluster
requestInfo = ec2client.request_spot_fleet(SpotFleetRequestConfig=spotfleetConfig)
File “/usr/local/lib/python2.7/dist-packages/botocore/client.py”, line 253, in _api_call
return self._make_api_call(operation_name, kwargs)
File “/usr/local/lib/python2.7/dist-packages/botocore/client.py”, line 543, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (InvalidSpotFleetRequestConfig) when calling the RequestSpotFleet operation: Invalid Amazon Machine Image(s) specified: The image id ‘[ami-03562b14]’ does not exist (Service: AmazonEC2; Status Code: 400; Error Code: InvalidAMIID.NotFound; Request ID: 6acafd93-c5fd-4d2b-b131-c1d2dee1b18f).

So, I used these values last time and they appear to work this time too:

“LaunchSpecifications”: [
{
“ImageId”: “ami-5ec1673e”,
“InstanceType”: “m4.xlarge”,
“KeyName”: “id_rsa”,
“IamInstanceProfile”: {
“Arn”: “arn:aws:iam::8xxxxxxxxxx3:instance-profile/ecsInstanceRole”
},
“BlockDeviceMappings”: [
{
“DeviceName”: “/dev/xvda”,
“Ebs”: {
“DeleteOnTermination”: true,
“VolumeType”: “gp2”,
“VolumeSize”: 8,
“SnapshotId”: “snap-15cfb226”
}
},

Now… Some spotfleet instances started but they aren’t doing anything. I notice that there is a service running in the ec2 contaniner service, but there isn’t a task or an instance. That is different from my other aws account where an instance is running or starts automatically and gives itself a name in the ec2 management console that is the app name.

I’m going to post this before continuing to troubleshoot… -John


#6

It appears the ImageId and SnapshotId fields are region-specific, which I did not realize and will update the wiki to reflect.

I’m not certain why your instances weren’t assigned correctly to your cluster- did any error messages come up when you were running the startCluster script? It’s possible that it’s the ami you used, which isn’t the Amazon ECS ami (it’s not clear to me whether this was the ami you used the time that it worked or not)- if you’re in region west-2, try running with ami-492ffd29 and snap-ef8c2213. If none of that helps, please send me the config file here (sanitized as much as you need of course) or via email.


#7

hmm, now the startCluster seems to be hung:
$ python run.py startCluster files/fleet03.json
Request in process. Wait until your machines are available in the cluster.
SpotFleetRequestId sfr-a803ce5d-498d-457f-84dc-5bcc3ab45af3

nothing’s happening and usually something does by now…

to answer your question about errors, everything seemed normal before I just made these ami changes:

$ python run.py startCluster files/fleet03.json
Request in process. Wait until your machines are available in the cluster.
SpotFleetRequestId sfr-57d46a45-a6bf-4877-8783-b5761a4bb4f7
. . .
Cluster ready
Updating service
Service updated. Your job should start in a few minutes.

except that it wasn’t of course. Now, spotfleet instances do not start. SQS logs are created, but there aren’t any entries and the cluster is the same; one service but no instances or tasks.


#8

You can look at the EC2 spot fleet console to find out if there are errors in your configuration or if you’re just priced out of a given machine at the moment.

What’s strange is that the listed AMI IDs in Amazon’s own instance store for US-West-2 doesn’t come up when I search. If I look for the same release version of the Amazon ECS ami that we use in your region, I get ami-1ccd1f7c, which has a corresponding snap of snap-b2c2989c.


#9

ok hang on, last time I had to create a new cluster in order to get an instance running. for some reason the default doesn’t suffice.

This is what that looked like:

Note down at the bottom I chose the spotfleet role. I notice it says an ecsinstance role automatically. On my other aws account I only have the ecsinstance role…

waiting for the startCluster to finish… gosh, I thought I had it but it’s still hanging on this command:

$ python run.py startCluster files/fleet03.json
Request in process. Wait until your machines are available in the cluster.
SpotFleetRequestId sfr-e1498129-4b6f-494f-880f-020037aa6960

maybe it’s the price… I’m at 0.3 but I’ll restart higher and see what happens.


#10

hmm, I noticed that when I purged the SQS queue the new cluster I created was deleted… I restarted the process with a higher price and notice that now a new cluster is created when the fab setup is run.

But… startCluster is still hung. i.e. no new instances, logs etc.

$ python run.py startCluster files/fleet03.json
Request in process. Wait until your machines are available in the cluster.
SpotFleetRequestId sfr-8cebc8fc-60c9-463b-a5d0-f2ce43d45cac