DC-CP AWS Set-up Question


#1

I’m trying to get DC-CP set up on AWS and am running into some issues. This is my first time using the AWS environment, so it’s probably something silly that I’m not doing correctly. I followed the wiki guide and believed I have deployed the fleet requests.

I have the following message in my default cluster:

service DistributedCPService was unable to place a task because no container instance met all of its requirements. Reason: No Container Instances were found in your cluster. For more information, see the Troubleshooting section.

I have an EC2 spot fleet role and an EC2 instance role set up as per the guide, which seems like it’d be a cause for the problem. My fleet json is as follows:

{
“IamFleetRole”: “arn:aws:iam::xxxxxxxxxxxx:role/aws-ec2-spot-fleet-role”,
“AllocationStrategy”: “lowestPrice”,
“TargetCapacity”: 3,
“SpotPrice”: “0.02”,
“ValidFrom”: “2018-01-20T20:28:54Z”,
“ValidUntil”: “2018-07-20T20:28:54Z”,
“TerminateInstancesWithExpiration”: true,
“LaunchSpecifications”: [
{
“ImageId”: “ami-c9c87cb1”,
“InstanceType”: “m4.xlarge”,
“KeyName”: “cp-01.pem”,
“IamInstanceProfile”: {
“Arn”: “arn:aws:iam::xxxxxxxxxxxx:instance-profile/ecsInstanceRole”
},
“BlockDeviceMappings”: [
{
“DeviceName”: “/dev/xvda”,
“Ebs”: {
“DeleteOnTermination”: true,
“VolumeType”: “gp2”,
“VolumeSize”: 8,
“SnapshotId”: “snap-0b52be5bdbda1ac5f”
}
},
{
“DeviceName”: “/dev/xvdcz”,
“Ebs”: {
“DeleteOnTermination”: true,
“VolumeType”: “gp2”
}
}
],
“NetworkInterfaces”: [
{
“DeviceIndex”: 0,
“SubnetId”: “subnet-90808cd8”,
“DeleteOnTermination”: true,
“AssociatePublicIpAddress”: true,
“Groups”: [
“sg-d8718da7”
]
}
]
}
],
“Type”: “maintain”
}

(the xxxxxxxxxxxxx has my account in the actual file)

Any help would be appreciated, thanks!


#2

Hi,

Can you try removing the “.pem” extension from your key file and see if that work? We’ve got that as a possible trouble point on the “troubleshooting” page of the wiki but I’ll make sure to add it to the “Step 3” wiki page too.


#3

Hi,

Sorry, I didn’t see the troubleshooting page. I just tried removing the pem extension and re-sending everything, and it still displays the same message. Is it normal for the terminal to just continuously display Service Updated after running fleet.json?


#4

The startCluster function will just say ‘ServiceUpdated’ and then finish- if want to follow the processing interactively, you need to run the monitor function.

When you say it still displayed the same message, do you mean in the terminal or that when you login to AWS you still get the same error message about bad config? If the latter, can you confirm you are indeed in the us-west-2 (Oregon) region?


#5

Sorry, yea, I mean in the terminal. The script runs and then doesn’t provide me with a new prompt to enter any more commands.

I do indeed have everything set up in Oregon. I’ll double check all the config files and roles tonight to make sure I didn’t screw something up in there. Thanks for your help!


#6

To be clear- if you look in the AWS console, is your cluster now in fact launching and the problem is just the startCluster script not terminating properly, or is your cluster not launching at all?

If it’s the latter, you may want to roll back the AMI and snapshot updates I just pushed a couple of weeks ago; the older versions should DEFINITELY work, I think the new ones should too but let’s start with something we know works.


#7

Oh! I think the problem was that I had set the bid price to 0.02 for each instance when the going rate was 0.06. This seems to have solved the problem, 3 spot instances are created and it’s now attempting to do the analysis. Sorry, but some additional questions:

  1. The monitor script is currently outputting:

2018-01-30 11:59:11.338124 In process: 0 Pending 12
2018-01-30 12:02:12.549290 In process: 11 Pending 1
2018-01-30 12:03:12.620662 In process: 2 Pending 10
2018-01-30 12:04:12.663350 In process: 0 Pending 12

Does that seem normal? Is there a way to properly go about aborting the job/fleet with a command? I’ve been just delete/purging the queues and cluster.

  1. Would you happen to have a public data set and sample dc-cp pipeline/job/fleet files available for testing purposes? I’m not 100% sure if the test pipeline that I’m using is properly configured.

Thanks for your help!


#8

Does that seem normal?

Nope, it seems like your jobs aren’t processing properly. If you log into AWS on your web browser and go into CloudWatch->Logs, you can see what the problem is, but my guess is if they’re dying that fast it’s a bad path to the CSV, bad path WITHIN the CSV to the images, bad path to the pipeline, OR bad metadata configuration. Check out the logs and see what it says.

Would you happen to have a public data set and sample dc-cp pipeline/job/fleet files available for testing purposes? I’m not 100% sure if the test pipeline that I’m using is properly configured.

We don’t, but that’s a really good idea; I’ll put an issue in the DCP repo to create one. It unfortunately will not be right away. I’d definitely recommend looking over the job submission and troubleshooting wiki pages in the meantime.


#9

Also

Is there a way to properly go about aborting the job/fleet with a command? I’ve been just delete/purging the queues and cluster.

If you purge the queue while the monitor is running, it should kill the fleet, tear down the queue, etc.


#10

Okay, great thanks! I’ll continue playing with it.