Skip to content

cluster authentication issues after register-key --force #175

Open
@Jeltje

Description

@Jeltje
Contributor

I followed the READMEs for cgcloud-core and cgcloud-toil to set up on my (firewalled) podcloud VM.

Because I already had a key registered (from my old VM, which crashed and took its id_rsa.pub with it), I used cgcloud register-key --force ~/.ssh/id_rsa.pub

cgcloud create-cluster --leader-instance-type m3.medium --instance-type c3.8xlarge --share shared/ --spot-bid 1.0 -s 1 toil failed at the rsync step to copy from shared/, so I tried the same command without that option.
The cluster was created:
cgcloud list toil-leader

INFO: Using zone 'us-west-2a' and namespace '/jeltje.van.baren/'
i-abcb3770      jeltje.van.baren_toil-leader    0       172.31.31.92    52.40.118.17    i-abcb3770      2016-05-26T17:48:29.000Z        running

However, cgcloud ssh toil-leader gets an ssh error (full error pasted below)
I can't ping the machine either.

Ping and ssh to other machines work fine from the VM, so I'm assuming the authentication at EC2 is somehow messed up?

Full error:

INFO: Using zone 'us-west-2a' and namespace '/jeltje.van.baren/'
INFO: Binding to instance ...
INFO: ... waiting for instance i-abcb3770 ...
INFO: ... running, waiting for assignment of public IP ...
INFO: ... assigned, waiting for SSH port ...
INFO: ... open ...
INFO: ... instance ready.
Permission denied (publickey).
Traceback (most recent call last):
  File "/home/ubuntu/cgcloud/bin/cgcloud", line 9, in <module>
    load_entry_point('cgcloud-core==1.3.8', 'console_scripts', 'cgcloud')()
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/cli.py", line 49, in main
    app.run( args )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/lib/util.py", line 300, in run
    command.run( options )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 81, in run
    return self.run_in_ctx( options, ctx )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 105, in run_in_ctx
    return self.run_on_role( options, ctx, role )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 124, in run_on_role
    return self.run_on_box( options, box )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 164, in run_on_box
    self.run_on_instance( options, box )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 232, in run_on_instance
    self.ssh( options, box )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 219, in ssh
    status = box.ssh( user=self._user( box, options ), command=options.command )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/box.py", line 1050, in ssh
    raise RuntimeError( 'ssh failed' )
RuntimeError: ssh failed

Activity

hannes-ucsc

hannes-ucsc commented on May 26, 2016

@hannes-ucsc
Contributor

Delete your instances. Delete your key pair in the EC2 console and try register-key again, but without --force.

Jeltje

Jeltje commented on May 26, 2016

@Jeltje
ContributorAuthor

I tried it. Same error:

INFO: === Copying the contents of /home/ubuntu/production/shared/ to ~/shared on leader ===
Connection closed by 52.40.186.164
hannes-ucsc

hannes-ucsc commented on May 26, 2016

@hannes-ucsc
Contributor

You didn't delete the key pair because I can still see the old one.

hannes-ucsc

hannes-ucsc commented on May 26, 2016

@hannes-ucsc
Contributor

You may also want to start from scratch with a new SSH key pair locally. Maybe the private key doesn't match the public key.

Jeltje

Jeltje commented on May 27, 2016

@Jeltje
ContributorAuthor

I tried a few new key pairs, with and without password protection. I verified that the key pair fingerprint changed on EC2 after running register-key. Below is the error I get from trying to create a cluster using --shared

INFO: .
INFO: ... cloud-init done.
INFO: === Copying the contents of /home/ubuntu/production/shared/ to ~/shared on leader ===
Connection closed by 52.40.25.136
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: error in rsync protocol data stream (code 12) at io.c(226) [sender=3.1.1]
INFO: Terminating instance ...
Traceback (most recent call last):
  File "/home/ubuntu/cgcloud/bin/cgcloud", line 9, in <module>
    load_entry_point('cgcloud-core==1.3.8', 'console_scripts', 'cgcloud')()
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/cli.py", line 49, in main
    app.run( args )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/lib/util.py", line 300, in run
    command.run( options )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/cluster_commands.py", line 115, in run
    super( CreateClusterCommand, self ).run( options )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 81, in run
    return self.run_in_ctx( options, ctx )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/cluster_commands.py", line 37, in run_in_ctx
    self.run_on_cluster_type( ctx, options, cluster_type )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/cluster_commands.py", line 121, in run_on_cluster_type
    self.run_on_role( options, ctx, self.cluster.leader_role )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 124, in run_on_role
    return self.run_on_box( options, box )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 471, in run_on_box
    box.terminate( wait=False )
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 467, in run_on_box
    self.run_on_creation( box, options )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/cluster_commands.py", line 128, in run_on_creation
    leader.rsync( args=[ '-r', local_path, ":shared/" ], ssh_opts=options.ssh_opts )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/box.py", line 1057, in rsync
    subprocess.check_call( [ 'rsync', '-e', ' '.join( ssh_args ) ] + args )
  File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['rsync', '-e', u'ssh mesosbox@ec2-52-40-25-136.us-west-2.compute.amazonaws.com -A', '-r', '/home/ubuntu/production/shared/', ':shared/']' returned non-zero exit status 12
Jeltje

Jeltje commented on May 27, 2016

@Jeltje
ContributorAuthor

When I start the cluster without --shared, I can ssh ubuntu@52.40.39.137 just fine. But ssh mesosbox@52.40.39.137 gets Permission denied (publickey).

ssh -vvv mesosbox@52.40.39.137 full log output here

hannes-ucsc

hannes-ucsc commented on May 27, 2016

@hannes-ucsc
Contributor

What's CGCLOUD_KEYPAIRS set to?

Jeltje

Jeltje commented on May 27, 2016

@Jeltje
ContributorAuthor

on the toil-leader, cat /home/ubuntu/.ssh/authorized_keys shows two different ssh-rsa keys, both ending with my email. The second key matches my id_rsa.pub.

/home/mesosbox/.ssh/authorized_keys shows only the first key, which explains why it won't let me log on.

Jeltje

Jeltje commented on May 27, 2016

@Jeltje
ContributorAuthor

CGCLOUD_KEYPAIRS on the master? Or on my VM? echo $CGCLOUD_KEYPAIRS gives nothing on either.

hannes-ucsc

hannes-ucsc commented on May 27, 2016

@hannes-ucsc
Contributor

Then you don't have it set.

hannes-ucsc

hannes-ucsc commented on May 27, 2016

@hannes-ucsc
Contributor

Upon investigation on the actual box, it turns out that dots in the namespace prevented cgcloudagent from creating the SQS queue. We should tweak the __me__ derivation to strip dots. We should also tighten the regex that validates namespaces to disallow dots.

Workaround for now is to CGCLOUD_NAMESPACE=/foo/

Jeltje

Jeltje commented on May 27, 2016

@Jeltje
ContributorAuthor

Changing the namespace hasn't fixed the problem.
export CGCLOUD_NAMESPACE=/jeltje/
cgcloud create -IT toil-box
cgcloud create-cluster --leader-instance-type m3.medium --instance-type c3.8xlarge --spot-bid 1.0 -s 1 toil

cgcloud list toil-leader

INFO: Using zone 'us-west-2a' and namespace '/jeltje/'
i-19eef3b5      jeltje_toil-leader      0       172.31.46.57    52.34.135.67    i-19eef3b5      2016-05-27T16:40:23.000Z        running

But I can't ssh to it. Yesterday I was at least able to ssh ubuntu@52.34.135.67 (but not ssh mesosbox@52.34.135.67) but that no longer works either. So I can't see what's going on with the ssh keys

hannes-ucsc

hannes-ucsc commented on May 27, 2016

@hannes-ucsc
Contributor

Most recent failure was the result of misconfiguration on user's end (multiple SSH agent instances).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @hannes-ucsc@Jeltje

        Issue actions

          cluster authentication issues after register-key --force · Issue #175 · BD2KGenomics/cgcloud