INSIGHTS
Delivering assurance at the speed of AI

Engineering

Fine-tuned Models for Faster Network Fault Isolation

Summary

AI assistants for network analysis tasks must accurately, quickly, securely, and cost effectively analyze large complex streams of time-series data. Fine-tuning smaller open models (instead of using large proprietary cloud-based LLMs) is a promising way to try to help achieve these objectives, but we found that this required us to solve some challenging leading-edge technical obstacles.


Complexity and Domain Specificity of the Cisco AI Assistant’s Views Explainability Feature

ThousandEyes Views Explanations (powered by Cisco AI Assistant) transforms complex multi-variate network telemetry time-series data into anomaly detections, plain-language fault domain isolations, and actionable insights. The following screenshot shows an example of the assistant in action.

AI Assistant generating fault domain assessment
Figure 1: Explaining Views by AI Assistant

Motivations for Fine-Tuning

Cisco AI Assistant already performs the Views Explainability task well using readily available cloud-based frontier LLMs. However, there are many reasons to seek a smaller fine-tuned model, including: improving speed (both time to first token and output throughput tokens/sec), strict enterprise data privacy and security (via self-owned instances vs multi-tenant public LLMs), lowering cost per high volume requests (lower memory and compute requirements), and most importantly even better accuracy (by showing it better example responses directly, without indirect and fragile tweaking/experimentation via system prompt engineering which is the only option for tuning a closed proprietary LLMs).

Challenges

Despite these common motivations, network domain-specific fine-tuning presents unique challenges not found in typical text-based tasks: it requires network domain expertise combined with the ability to interpret complex multi-variate time-series data. Additionally, as in many fine-tuning tasks, there are many complex pipeline infrastructure and tooling issues we had to solve, which are outside the scope of this blog.

First, a very large context length (e.g. 64k) is common in our task, whereas more moderate lengths (e.g. 8k) are most common for most fine-tuning applications. Although many models can support large contexts during inference, it is much more difficult to do so during training, especially with the relatively limited GPU resources typically desired for fine-tuning applications versus pretraining ones. Indeed, many cloud-based vendors supplying fine-tuning services do not even support such large context lengths.

Our unusually large context lengths arise from both large system prompts (as was required to instruct even smart/large cloud-based LLMs to handle our domain-specific complexity) and even larger user input data prompts (i.e. extensive telemetry signals gathered from many sources across millions of devices across internet and customer networks, along with their historic statistics needed to determine anomalies that inform fault domain analysis).

With such large contexts, even using LoRA or QLoRA, instead of full fine-tuning, is not enough, because the challenge is how to best fit not only the model weights in GPU memory, but also the unusually large activation tensors. We have found that cutting-edge techniques provided by the open-source Unsloth project can best overcome this challenge, allowing us to achieve context lengths of 64k and beyond when fine-tuning Llama 70B models. However, due to its extensive use of monkey patching and other low-level optimizations, Unsloth is currently limited to single-GPU use cases. This required us to carefully engineer and experiment with Unsloth and other configurations to enable reliable multi-GPU training.

Second, optimal loss weighting for our task should not be uniform across all tokens. For example, accuracy in generating correct numbers is critically important and represents one of the largest potential sources of hallucinations, yet most tokens during training are not number-related. Standard uniform loss over all response tokens therefore undervalues the prediction of correct number-related tokens. Additionally, while our system (expert instructions) and user (data) prompts are large, the target response (view explanation) itself is typically small (e.g., 500 tokens or less)—often representing less than 1% of the total sequence length. We therefore revisited the loss calculation pipeline and modified the loss computation to include token-level weighting and masking strategies that better align with our problem's specific characteristics. These changes led to substantial improvement in training performance, with approximately 20% greater reduction in loss over the course of training.

Third, finding a good base model to fine-tune upon is challenging. Smaller is preferrable for better speed and cost, but for accuracy larger may be necessary. Extensive fine-tuning of many model options would be expensive and time-consuming. We have found that various heuristics help reduce that search overhead, such as eliminating a smaller candidate base model which performs poorly on cheap basic requests (e.g. extracting the value of a given telemetry signal from the user input, when it presumably gets overwhelmed by the large system and data prompts). A sweet spot seems to be base models which first demonstrate such basic competencies (e.g. in parsing the JSON input data), so that fine-tuning can data-efficiently focus on the complex domain-specific aspects of our task.

Illustrative Results

The table below summarizes an example of the accuracies and speeds we achieved, averaged across hundreds of test examples. LLM-as-a-judge scored each test LLM response on two dimensions: "relevance" (how focused it was on the user question) and "faithfulness" (how accurate it was with respect to the telemetry data). Lower relevance scores often arise from responses that repeat or leak portions of the system prompt instead of answering the question—a relatively common issue with weaker LLMs, especially on the large context lengths (i.e., large system and user data prompts) typical in our Views Explainability task. Lower faithfulness scores often arise from hallucinations or poor reasoning. Although the 70B base model initially performed worse than the large (~trillion parameter) frontier LLMs used for comparison, its fine-tuned version became at least as good or better on these metrics. Additionally, the 70B local models were orders of magnitude faster than the frontier LLMs. We routinely use human domain experts to validate LLM-as-a-judge evaluations.

 

Relevance 

Faithfulness 

Completion Time 

Large frontier cloud-based LLM 

0.95 

0.87 

~30 s 

70B fine-tuned local LLM 

0.95 

0.89 

~1 s 

70B base local LLM 

0.79 

0.85 

~1 s 

For the above example, QLoRA (4-bit quantized base weights with LoRA rank=64) fine-tuning of the Llama 3.3-70B-Instruct base model, using a single epoch over thousands of training pairs (i.e., user prompts and corresponding target assistant responses generated by various LLM models and human subject-matter experts), completed in approximately 1.5 hours on a single 8 × H100 GPU node using 8-way data parallelism. On the same node, the generation time using the efficient vLLM inference engine with continuous batching and 8-way tensor parallelism for all base and fine-tuned (LoRA adapter) models averaged approximately 1 second per outage sample across the 361 test examples for the 70B models.


Our work illustrates the ability of properly fine-tuned local LLMs to achieve all four key criteria—accuracy, data privacy, latency, and cost-effectiveness—without compromise, on complex domain-specific expert tasks.

Subscribe to the ThousandEyes Blog

Stay connected with blog updates and outage reports delivered while they're still fresh.

Upgrade your browser to view our website properly.

Please download the latest version of Chrome, Firefox or Microsoft Edge.

More detail