Solved: 1.3 upgrade and core services failure

Joshua Warcop · ‎10-26-2016

Lost a lot of time so far today doing all the troubleshooting I can. A few of the core services won't stay running so localhost:14141 locally or remotely doesn't work. TAC or anyone want to run with it from here?

router=4.0.0.14944 goes into FATAL state and some of the others just exit and backoff.

Since services are not coming up I can't evacuate any host or deal with the grape hosts, services, etc.

$ sudo service grapevine status

grapevine is running

grapevine_capacity_manager RUNNING pid 4372, uptime 0:13:25

grapevine_capacity_manager_lxc_plugin RUNNING pid 9794, uptime 0:00:31

grapevine_cassandra RUNNING pid 3799, uptime 0:13:42

grapevine_client BACKOFF Exited too quickly (process log may have details)

grapevine_coordinator_service RUNNING pid 3808, uptime 0:13:42

grapevine_dlx_service BACKOFF Exited too quickly (process log may have details)

grapevine_log_collector RUNNING pid 3811, uptime 0:13:42

grapevine_root RUNNING pid 5869, uptime 0:08:09

grapevine_supervisor_event_listener STARTING

grapevine_ui RUNNING pid 3797, uptime 0:13:42

reverse-proxy=4.0.0.14944 RUNNING pid 3802, uptime 0:13:42

router=4.0.0.14944 FATAL Exited too quickly (process log may have details)

(grapevine)

EDIT: I can probably hack my way through this to figure a few things out. I'd rather work here with some visibility. I don't want to bother Nick in TAC. I'll jump through some services and if I end up breaking it enough I'll rebuild.

Joshua Warcop · ‎11-01-2016

I fixed without running a reset, but here is what I got from TAC after escalation -

Please perform the following steps to bring the cluster back to a clean/running state:

1. Ensure both VMs are powered "on"

2. SSH into one of the VMs and run the following command:

* reset_grapevine

3. A series of prompts would be presented to the user, asking if they want to delete specific data/configure. Since the customer wants to save their cluster data, for each prompt/question presented, specify "no".

After answering all the prompts, the command will proceed to reset the cluster back to a clean/running state with their data.

Depending on the speed of their hardware, this operation will take around 30-60 minutes to complete.

View solution in original post

aradford · ‎10-26-2016

Hi,

we have changed port 14141. It should redirect to

https://<apic>/controllerDevelopment

Common startup issue is ntp. can you verify that the ntp server is reachable?

Joshua Warcop · ‎10-27-2016

Sure, that redirect is understood by the reverse-proxy/router, but ssh on the box it's trying to connect to localhost:14141 to run service commands. All 'grape' commands return localhost:14141 unavailable.

ntp is good

Joshua Warcop · ‎10-27-2016

The core services can't connect to message broker which is rabbitmq right?

169.254.1.1:5672

cchitnis · ‎10-27-2016

When you are redirected to controllerDevelopment, are you able to login? Or you get login error still? (Assuming you are not already logged into through the cluster and redirected; but a fresh login attempt by typing direct link to controllerDevelopment)

Couple of grapevine core services from your previous output look to be in BACKOFF state. Did they recover at all?

I understand grape instance status might not be working. Is it the same case with "grape application status" command?

Worst case I'd suggest resetting the grapevine so ALL the services would be up and running in their own sweet time; but of course, that's the last resort.

Joshua Warcop · ‎11-01-2016