In this update from ThousandEyes Connect Santa Clara, we’ll summarize the presentation by Viet Nguyen, Senior Manager of Problem and Change Management and formerly Senior Manager of Network Engineering/Security/Operations at PayPal.
During his talk, Viet described how his team keeps their many service providers accountable by collecting and sharing objective data. He also discussed how he and his team operationalizes the use of ThousandEyes within PayPal.
Trust But Always Verify
PayPal is a company with massive operations, with 203 million active customer accounts, $3 billion in revenue and $1.7 billion in payment transactions in Q1 of 2017. While PayPal provides online payment services itself, it also relies on many other external services, from SaaS providers to CDN and DDoS mitigation vendors, many of which are hosted offsite. Viet’s philosophy is to trust your service providers, but always verify the facts of your providers’ performance on your own.
Viet mentioned a number of monitoring use cases and service providers, including:
- DDoS mitigation: Monitor saturation attacks and mitigation performance
- CDN: Monitor content caching and acceleration
- SaaS: Monitor to the cloud
- Internal Services: Monitor internal apps, extranets and get an inside-out view from your employees’ devices to the applications they’re using
In his presentation, Viet discussed specific examples of detecting issues and keeping their DDoS mitigation and CDN providers accountable.
When Your DDoS Mitigation Provider Can’t Handle an Attack
Viet recounted a story when PayPal experienced a very large DDoS attack. They immediately leveraged their DDoS mitigation provider, but found that their provider experienced saturation issues of their own. As a result, availability was impacted, though it was a short-lived attack.
On seeing this, Viet’s team dug into the Path Visualization. Viet said, “Normally, when you experience these issues, you get called five minutes after the event’s over and you have to try to figure out what happened. The fact that ThousandEyes is saving all the traces and metrics for you is key. Going through the data and seeing what occurred is the magic of what ThousandEyes does.”
Using the ThousandEyes Path Visualization, Viet discovered that both upstream ISPs of their DDoS mitigation provider, Tata and Telia, were overloaded and experiencing high loss. Viet found that collateral damage can affect multiple customers of the same DDoS mitigation provider. For example, if another customer of the same provider is experiencing an attack, you may also be impacted, even if you yourself are not being attacked.
After uncovering these insights, Viet shared the data with their DDoS mitigation provider, along with the suggestion to diversify their set of Internet service providers. A year later, the DDoS mitigation provider had diversified to four different ISPs. Viet noted the importance of keeping providers accountable and asking questions like, “How are you monitoring our experience?” and to CDN providers, “How do you monitor how our customers get to you?”
Discovering Obscure Issues in PayPal’s CDN Provider
Viet also described an interesting issue that his team uncovered in their CDN provider. Availability over time of their CDN edge started becoming jagged and much more inconsistent than before, which the team’s finely-tuned alerts were able to catch.
In contrast, tests to the PayPal origin were rock solid and saw no degradation in performance. It was clear that “the problem was with our CDN provider, but it was still difficult to convince them that they were having issues, since nothing was wrong with their standard metrics.” But once the PayPal team shared their ThousandEyes data with their CDN provider, they understood the problem.
The ThousandEyes Customer Success team also worked with Viet’s team and ran a tcpdump on a Cloud Agent. They determined that the SSL handshake started and then paused for around 2 minutes, causing performance impacts and connection timeouts. It was something no one had seen before—the root cause was a config overload on their CDN provider’s edge server. Edge servers are shared among customers, and one customer was periodically pushing a new config that overloaded the server, but didn’t degrade performance enough for the CDN provider to notice. Once their CDN provider fixed the issue the next morning, the PayPal team used ThousandEyes to immediately check that everything was working and back to normal.
Dashboards and Reports
Viet also described how his team uses dashboards to quickly compare the performance of multiple tests in a single pane. Viet noted, “For CDN services, we run availability graphs so we can tell if issues are affecting just us or all similar CDN provider’s customers. We compare with some of the ThousandEyes shared tests to other CDN customers, like Slack, to see if they were affected as well.”
Viet’s team also set up dashboards to compare performance metrics like availability and latency to both their origin and their CDN provider. In the below example, “by lining these up, you can tell if there’s an issue with either the CDN or origin. If our CDN provider dips at the same time as our origin, the issue is on our side. If not, it’s fairly clearly on the CDN side.”
In addition to real-time dashboards, Viet also uses reports for historical trend analysis. During a bake-off of CDN providers, Viet’s team ran tests from Cloud Agents to different CDN providers and synthesized the results in a report to see which had the best performance.
His team has used reports during bake-offs to summarize results from running tests from Cloud Agents to multiple CDN providers. Viet also mentioned that “reports can help track provider SLAs; these reports have good facts that can push your agenda. They’re also eye candy for execs.”
If You Build It, They’ll Come
Viet then addressed a common problem that many organizations face: how do you operationalize a third party service like ThousandEyes in your NOC?
One helping factor for PayPal was creating documentation wikis written in their own internal language. As Viet said, “I was able to tell my NOC folks, if you run into a problem or something unusual, do a search in our internal wiki. There may be a write-up that shows how we researched that issue, along with share links and screenshots.”
Viet also noted the importance of fine-tuning alerts and collaborating with internal subject-matter experts on what the target performance and thresholds should be: “ThousandEyes can monitor a lot more than what your main expertise is. Bring in your load balancer folks and ask what the performance should be in a particular location; get the networking team’s input on BGP, or even hand over the keys and have them set up alerts sent to their console.” Otherwise, if there are too many alerts, no one will use the service. In Viet’s words, “If you spend the time to tune it, train people and document issues, you’ll see lots of reward.”
Once alerts were tuned, Viet’s team also implemented API-driven events. They polled the ThousandEyes API every minute, looking for any alerts on their tests. Alerts will then pop up on the consoles of PayPal’s NOC engineers (Webhooks is another option to do this). Their alerting system has also become more sophisticated over time: “We can also pull the data down and make our own correlations across the data. When certain alerts go off, you can click on a link to an internal standard operating procedure (SOP) that has everything down to how to open tickets.”
Viet has also made it a point to spread the word about ThousandEyes internally: “Once people start using ThousandEyes, they realize how useful it is.”
For more from ThousandEyes Connect Santa Clara, check out the post summarizing the talk from Intuit.