some virtual machines lost connection when migrating from VSS to NX1KV

nvermande · ‎10-20-2010

Hi there,

my problem is that when i migrate vms from the standard vswitch to the nexus 1000V, some of them are loosing connection but the ports are up in the nexus.

other vms (in the same vlan) keep their connection after being migrated in the same port-profile.

vPC-HM is configured with good sub-groups (on 2 4948 switches with 1 local etherchannel of 2 ports on each). For example, i have:

eth1/1 sg0 sw1

eth1/2 sg0 sw1

eth1/3 sg1 sw2

eth1/4 sg1 sw2

conf on 4.0(4)SV1(2) with vsphere 4.0 build 208167

conf of the veth port-profile:

type: vethernet
status: enabled
capability l3control: no
pinning control-vlan: -
pinning packet-vlan: -
system vlans: none
port-group: data_19
max ports: 1024
inherit:
config attributes:
    switchport mode access
    switchport access vlan 19
    no shutdown
evaluated config attributes:
    switchport mode access
    switchport access vlan 19
    no shutdown

the uplink port-profiles are ok, everything is in production.

there are no errors in the standard logs

One thing i have noticed is when looking at the command:

VSM(config)# module vem 4 execute vemcmd show port
LTL    IfIndex   Vlan    Bndl SG_ID Pinned_SGID Type Admin State CBL Mode   Name
    8          0   3969       0     32          32 VIRT     UP    UP    4 Access l20
    9          0   3969       0     32          32 VIRT     UP    UP    4 Access l21
   10          0    301       0     32           1 VIRT     UP    UP    4 Access l22
   11          0   3968       0     32          32 VIRT     UP    UP    4 Access l23
   12          0    302       0     32           2 VIRT     UP    UP    4 Access l24
   13          0      1       0     32          32 VIRT     UP    UP    0 Access l25
   14          0   3967       0     32          32 VIRT     UP    UP    4 Access l26
   15          0   3967       0     32          32 VIRT     UP    UP    4 Access l27
   16   1a030000      1 T   305      0          32 PHYS     UP    UP    1 Trunk vmnic0
   17   1a030100      1 T   305      0          32 PHYS     UP    UP    1 Trunk vmnic1
   18   1a030200      1 T   304      2          32 PHYS     UP    UP    1 Trunk vmnic2
   19   1a030300      1 T   305      1          32 PHYS     UP    UP    1 Trunk vmnic3
   20   1a030400      1 T   304      1          32 PHYS     UP    UP    1 Trunk vmnic4
   48   1b030000     11       0     32           1 VIRT     UP    UP    4 Access vmk0
   49   1b030010     10       0     32           1 VIRT     UP    UP    4 Access vswif0
   50   1b030020     19       0     32           0 VIRT     UP    UP    4 Access PScs.eth0
51   1b030030     19       0     32          32 VIRT     UP    UP    4 Access St.eth0
   53   1b030050     19       0     32           0 VIRT     UP    UP    4 Access PSsk.eth0
   54   1b030060     19       0     32           1 VIRT     UP    UP    4 Access AdM.eth0
   56   1b030080     10       0     32           1 VIRT     UP    UP    4 Access vmk1
   58   1b0300a0    196       0     32           0 VIRT     UP    UP    4 Access PSE ethernet0
   59   1b0300b0     19       0     32           1 VIRT     UP    UP    4 Access Citrix0.eth0
304   16000008      1 T     0     32          32 VIRT     UP    UP    1 Trunk
305   16000001      1 T     0     32          32 VIRT     UP    UP    1 Trunk

In Bold, the vm has no network link, you can see that the Pinned_SGID is to 32 !!?? i suppose it's related to the problem

the others VMs are distributed among the existing subgroups ... and do have network connection

What can be the origin of the problem and how can we fix it?

thank you

mipetrin · ‎10-24-2010

Hi Nicolas,

You have correctly identified that the affected Virtual Machine has been pinned to the incorrect Sub-group, which should be either 0 or 1, thus it thas no connectivity. I've seen issues like this in the past, caused by port-channel flaps or vMotion of a host (as is your case), which were as a result of CSCtg79060 - Virtual ports not getting pinned. To verify if you have hit exactly the same issue, please run the following command on the VEM - "vemlog show all". If you see "No SG for pinning" or "No mbr ports in any sub_groups" then it is the same issue.

Let me know if this seems to be the same issue or not.

Thanks,
Michael

nvermande · ‎10-25-2010

I'll try to get the output of the command as soon as possible and will tell you the results.

Thank you for your answer.

Nicolas

abbharga · ‎10-26-2010

Hi Nicolas,

Could you also get the show running from the VSM and also from this host get the output bundle of the 'vem-support all' command

./Abhinav

nvermande · ‎10-28-2010

I’ve been informed by the customer that he had recently connected an Iscsi device. He has connected this device through the software initiatior emulated by the vmkernel (plugged in the nx1k)
This port-profile is in access mode in a different VLAN of course. However we can see in the “vemlog show all" these errors: (the ltl 57 corresponds to a VM in vlan PROD but not the vlan of the vmkernel)

Oct 28 11:37:49.361055     58672   0   99 1   Error sf_port_attach_ack_handler : Setting veth_if_index ltl (57) veth_ifindex (0x1c000100)
Oct 28 11:37:49.361081     58673   0   99 1   Error Setting iscsi-multipath 0 failed for 57
Oct 28 11:37:49.367305     58674   0   99 1   Error remove port from BD failed oldBD 1

This VM (LTL57) has lost its network at the same time. But I don’t know why we have error on iscsi purpose as everything that is supposed to be iscsi related should be in the vmkernel and not on VM Network.

Are there kown bugs with iscsi under NX1KV?

How can resolve this pb without creating a strandard vSwitch?

lwatta · ‎10-28-2010

We really need to see a show run. I see 5 vmnics and 2 uplink port-profiles. My guess is that N1KV cannot figure out which uplink that VM is supposed to be on hence it's not assigned at all. Can you post the running-config?

louis

nvermande · ‎10-28-2010

Here is the tech support (svs).

5 nics, 2 uplink profiles:

1 for COS, vmkernel(iscsi), ctrl, packet.

1 for all VM VLANs

lwatta · ‎10-28-2010

Nicolas,

From a config standpoint it looks ok. Is there a reason you are using subgroup cdp for the system uplink and subgroup manual for the data uplink? Also which port-profile is the one they are using for iSCSI?

louis

nvermande · ‎10-28-2010

We have used manual SG for everything that is related to the VM Networks because we had some problems with CDP when detecting link failures (too long even with cdp set to 5s).

The iscsi traffic is managed by the vmkernel so by VLAN 11 (data_11), linked to the uplink port-profile system-uplink.

So this is why i'm asking why are there iscsi related errors on LTL57 which is a virtual machine in (data_19), linked to the uplink port-profile data_uplink. (data_11 is authorized in system-uplink trunk and not in data-uplink trunk)

nvermande · ‎11-03-2010

What seems to resolve the problem is to reload the primary VSM. Same problem is happening to another customer. The VSM appliance is always located on vsphere servers that have VEM module installed. (I know in some cases connection failure could happen with VEM/VSM for example when upgrading vem or vsm that is located on a vsphere host that has also VEM installed).

Are there some known bugs with this kind of config?

I ve just upgraded to version 1.3b VSM and VEM, but still thes same problem, some virtual machines don't have network connectivity. After VSM reboot, it's ok .....

mipetrin · ‎11-04-2010

Hi Nicolas,

I was going to suggest that you open a TAC case so that an engineer could troubleshoot the issue live with you. However, I see that you have already done so. I suggest that you work with your TAC engineer, and then we can update this forum with how to resolve this problem - for everyone else's reference.

Thanks,

Michael