Understanding the Meta, Comcast, and LinkedIn Outages


Troubleshooting Real World Zscaler Issues

By Nitin Nayar
| | 8 min read


In our first post on monitoring Zscaler secure web gateways (SWGs), we looked at the overall architecture of the Zscaler service and the monitoring implications for IT and network administrators. We also explored how to use ThousandEyes to solve a Salesforce performance issue using a monitoring approach that combines end-to-end HTTP testing through the Zscaler GRE tunnel, and network-layer testing to the Zscaler GRE tunnel endpoint and ZEN proxy server. In this post, we’ll walk through troubleshooting a real-world issue that we encountered with a ThousandEyes customer during their readiness assessment for deploying a cloud-based CRM system for their organization.

Understanding SaaS User Experience through Zscaler

Deploying a SWG like Zscaler is like extending your branch office network boundaries into the cloud to a data center (Zscaler ZEN) where your traffic is inspected for security threats. For the most part, this works amazingly well when you consider the physics involved, but there are times when issues arise. So, as with any new technology roll-out, it’s a good idea to plan for a readiness phase where you can see if there are any hiccups before you cut all your employee traffic over.

In this case, the customer wanted to understand the impact of a cloud-based inline secure proxy on Veeva, a CRM system for life sciences and pharmaceutical companies. In this case, we set up Enterprise Agents to monitor and compare user experience and network performance between their pre-Zscaler and Zscaler-based architecture over a six week period. Based on the Page Load, HTTP Server and Network layer tests provisioned on the two Enterprise Agents, we built a series of monitoring reports to provide insights.

The first report that the customer looked at was a User Experience report that displays a comparison of Page Load test results conducted with and without Zscaler over a continuous 24 hour window. As we can see in Figure 1 below, inserting Zscaler into the data path significantly impacted the Veeva user experience, with Page Load times increasing by 97% from 2.5 seconds to almost 5 seconds.

Page Load time comparison with and without Zscaler
Figure 1: A comparative look at page load times with and without Zscaler.

The question that arises is why this is occurring. To find the answer, we looked further down the stack at reports focused on HTTP Server and Network layer performance metrics. As we see below in Figures 2 and 3, there is no noticeable difference in either HTTP or Network layer performance metrics when comparing the Zscaler and Direct Internet Access test results.

HTTP Response time comparison with and without Zscaler
Figure 2: Comparing HTTP response time with and without Zscaler.
Network latency comparison with and without Zscaler
Figure 3: Comparing network latency with and without Zscaler.

Troubleshooting Inferences and Further Analysis

There were two main troubleshooting inferences drawn from the above set of test results. On a positive note, network connectivity through Zscaler was generally stable. When we observed spikes in HTTP Response Times, those spikes were traceable to concomitant spikes in network latency. The second was that the overall degradation in user experience was likely due to specific objects on the Veeva page that were taking longer to download due to the Zscaler security analysis.

To further validate our inferences, we compared detailed Page Load waterfall diagrams for both Zscaler and Direct Internet Access paths. In Figure 4 below, we can see that an additional 2 seconds in download time has been added to the Zscaler Page Load time due to a single JavaScript file. Note that this JS file is hosted by Fastly, a CDN provider used by Veeva, which highlights the complex matrix of service delivery paths that SaaS providers leverage for delivering an optimal user experience.

Page Load objects comparison with and without Zscaler
Figure 4: The Page Load waterfall shows that JavaScript file downloads are taking significantly longer through Zscaler.

By identifying the precise objects on the page that were causing the issue, the customer was able to employ real data and collaborate with both Zscaler and Veeva to resolve the issue. Overall, the availability of detailed analyses gave the project team the ability to fix issues in their readiness phase on a per site and per Zscaler ZEN basis. Embracing a lifecycle approach and adopting monitoring early in the readiness phase enabled the operations team to establish baselines, set proactive alerts, establish new troubleshooting processes, and collaborate with their cloud vendor ecosystem.

Key Takeaways

When rolling out Zscaler or another cloud-based secure web gateway, it is important to remember that you’re dealing with multiple internal and external dependencies, including DNS, ISPs, SWG and SaaS providers and your own network. Any of those could be the source of an issue, and the last thing you want to do is blame the wrong party. Measuring and benchmarking at multiple layers gives you the insights you need to isolate the problem domain and the data to get the right party to help you resolve the problem. If you’d like to learn more, download our white paper on monitoring cloud-based secure web gateways. If you’re ready to get started, request a demo.

Subscribe to the ThousandEyes Blog

Stay connected with blog updates and outage reports delivered while they're still fresh.

Upgrade your browser to view our website properly.

Please download the latest version of Chrome, Firefox or Microsoft Edge.

More detail