Re: NSO in Docker/K8s Experiences

crench92 · ‎04-14-2022

Just wondering from the community what you're experience has been using NSO as a container or in a Kubernetes environment. Anyone using it in production like this? Any suggestions or tuning when it comes to deployment or performance? Any suggestions on testing to run within a CI/CD pipeline?

I'm starting to get used to deploying and using NSO as a container and pod but I worry that in a production environment, there may be some hidden querks as we're still in a development / research capacity.

u.avsec · ‎04-15-2022

Hey.

We do use a docker host for our testing pipelines. When it comes to tearing down/spinning stuff up quickly I find containers more convenient. Also containers allow devs easily play around with LSA designs and different NSO versions.

I would dare to say that for regular operations running prod in containers or not doesn't make that much of a difference. Depends on the company infrastructure I guess. We are in cloud so spinning VMs is pretty much as convenient as spinning containers.

A tip for automated testing with NSO would perhaps be to not use NETSIM. Reason is that confD in NETSIM parses config and takes really long time for little benefit if you are just testing config validity on "Does it execute or break" level. South-bound locked devices is the way to go. It can quicken pipeline execution for 60%+.

JamesHarr74967 · ‎04-15-2022

We're running NSO in docker using the NSO-in-docker project (created and largely managed by Kristian Larsson) and it's been a worthwhile endeavor. There are a few things you'll probably want to adopt it:

1. The project is based in large part around GitLab and GitLab-CI. Github-Actions support is in the process of being ironed out, but I'm not sure it's ready for prime-time.

2. You will need a place to host container images... Lots of container images.

- Gitlab self-hosted comes with one that works out of the box pretty well, it can be backed by S3 if you want to not have to worry about filling up the disk of your gitlab server. We started to run into stability issues and didn't really want to spend time figuring out what was going wrong. For that and a number of unrelated reasons, we migrated to a separate repository (SaaS).

- You can find a number of SaaS container registries including Amazon ECR/Google CR/Azure CR, Artifactory and others offer security scanning and other tooling.

- We have not tried the giblab SaaS container registry

- With a custom container registry comes some other things you need to solve. IE: Giving your project/runners access to the registry. Giving all your developers access to the container registry.

3. You're going to want some fairly beefy Gitlab-CI runners. The faster your runners, the faster your CI pipeline can execute.

4. You have to buy into the project structure and not fight it too much. I tried to bend it to my will at first (mono-repo) and re-learned a lot of lessons that Kristian already learned and put into nso-docker already.

Generally the project divides into a number of different repos:

1. A main `nso-docker` project that builds a debian container with NSO installed. You can have multiple versions of NSO.

2. Each NED gets its own repo and builds independently (against each version of NSO you want to support)

3. You might have vendor-provided packages like `resource-manager` that each go into their own repository (again, built against each version of NSO you want to support)
4. You may have some custom NSO packages you wrote that you want to put into their own repo.
5. You will have one or more system repos that a) pulls in NEDs and packages from the above bullets that you want b) may contain packages of its own. Oh, and all against multiple versions of NSO.

If you're like us, you have one production install of NSO. You'll skip #4, and put all your customizations in the repo of #5. You still need to have a repo for NEDs and it's probably a good idea to have a repo for projects like resource-manager

If you're a bigger network with multiple installs of NSO to handle scale, different parts of the network (IE packet vs optical vs CPE), or different business units with different needs, you might want to start putting your custom packages in #4 so you can create multiple system builds (#5) each with their own NEDs and packages as required. Especially if some packages are re-used across system builds. We have no need for something like this, so we haven't gone down that route.

You may choose to put some custom packages in #4 if you have some packages that take a long time to build and/or test and evolve at a slower cadence. We could probably speed up our build by migrating some packages into their own repo that don't change as much and just let them build on their own.

Overall the NSO-Docker has been a huge benefit to us. It means that our builds are consistent, migrating NSO versions or NED versions is pretty straight forward, tests are consistently executed, etc. Our production NSO server is pretty minimalistic -- docker/moby, and then a set of bash scripts to manage the production NSO instance (take backups, stop, re-launch to upgrade, setup high-availability).

All-in-all, you're going to have some overhead to manage the components the way NSO-Docker wants to manage them, but it's been WELL worth it for us. I whole heartedly recommend taking the leap, but know what you're in for in terms of setting up the supporting infrastructure (gitlab, gitlab runners, container registry with lots of images).

We have not adopted NSO in Kubernetes, but it's something we're revisiting periodically because we want to run it there eventually.

Oh and 100% agree on using southbound-locked in your CI tests. The I/O between NSO and the NetSims is unnecessary when testing (most) packages because they both work on the same Yang models, so if NSO accepts it in the CDB, the NetSim will accept it. The one exception might be if you're using live-status on the device in a package.

crench92 · ‎04-28-2022

Great input here. Really appreciate it. When we say Southbound locked, do we mean the device stays locked and only used during tests? So the pipeline would unlock it only during the test and then lock it again?

u.avsec · ‎04-28-2022

What I have in mind is complete "ghost" devices. I add them like normal device, leave them south-bound locked and then add capabilities to those devices directly (I have a list of those per network platform so I can throw them in NSO when needed). No need for sync-from. Or that they exist as either real or netsim devices in the first place.

This allows for scalability. Let's say there is a use-case where service has to be configured (for whatever reason) to at least 15 devices. I just run a loop in the test framework 15 times and get unique entities.

Downside is that NSO works only with NEDs here. Sometimes that is not enough.

After the tests are done devices are deleted or in my case, pipeline just scraps the container it was using to run tests in completely.

If you pipeline includes actual lab devices, then locking them would defeat the purpose as in this case tests also include NSO-network device interoperability, basically tests check if your package code is OK and if NEDs are OK.

I guess it depends what kind of testing you have in mind, then you go for different approach.

crench92 · ‎04-28-2022

Ohhhh. Well that's kinda brilliant. Whatever is being developed and tested only runs against the NED so NSO doesn't communicate to any devices, netsim containers, etc.

I'd imagine at some point in the pipeline though, a different level of testing where you're testing your service against real devices needs to be done before confidently deploying to production.

JamesHarr74967 · ‎05-02-2022

Kristian Larsson did a talk on how DT/Terastream does this with NSO and vrnetlab. It's a pretty good talk and something to aspire to. We eventually want to get to a place where we can test our customer-facing route-policy which is probably the most error-prone part of our services. Internally, we've dubbed tests that only push to netsims (or that use southbound-locked) as management-plane tests.

https://www.youtube.com/watch?v=xAb1Dx7Dj0M

Another thing we've found that having strong tests for RFS (resource-facing-services) means that CFS (client-facing-services) tests get a lot simpler.

IE: If you make sure that the RFS service that configures a sub-interface works in all the weird corner cases you can find (native encap, dot1q, qinq, with/without l2transport) and errors out when it needs to, your higher level L2VPN CFS service (which winds up using the sub-interface RFS service) can be more along the lines of a smoke-test to make sure a commit goes through without error. Same with your L3VPN service, your Internet-Peer service, your backbone service, etc, etc etc. You only need to test that RFS service thoroughly once.

Occasionally we'll have a regression test for a tricky corner case for a CFS service, but this gets you 80% of the way there without a lot of infrastructure setup.