In this post from ThousandEyes Connect New York, we’ll summarize the presentation by Nabil Ismail, VP of Operations & Technology at Hi-Rez Studios.
During his talk, Nabil described his team’s formidable task of ensuring that they continually provide a smooth, fast online experience to their 15+ million gamers across the globe. He also discussed how they use ThousandEyes Cloud Agents to understand user experience from vantage points located around the world: “Right now we’re running latency of about 60ms across all of our ThousandEyes agents, and we feel that’s a good place for all the games we have.”
Unique Challenges in Online Gaming
Hi-Rez Studios was founded in 2005, and after the company found its success in 2014 with the game Smite, they have rapidly grown from 60 to more than 300 employees. Nabil described their niche as “online free-to-play games that are all based on monetization. So, our end users are the people we care about—if they’re not having a good experience, they’re not staying in our ecosystem and they’re not paying us.”
With a highly passionate user base, “our philosophy is very community-driven. We have a great transparency between us and our community, and we’re dedicated to building a game they love.” Nabil went on to describe the different challenges that each of the games present: “We have a lot of different genres of games, and each of these games means something different from a network perspective. Massively Multiplayer (MMO) games are not as taxing on the network because there are a lot of social interactions within the game, and you’re not constantly in combat. On the other hand, in First Person Shooter (FPS) games, you’re constantly fighting and shooting. If there are ten other people playing in the same map, the server has to know where all the other clients are to calculate the fighting outcomes. From a latency standpoint, if I lag in that minute, all I’m going to come back to after that lag is: I’m dead.”
The network infrastructure at Hi-Rez consists of three parts: “Our core data center is where things like authentication, queuing and social aspects of the games happen. Then we have game servers that are located all over the world, which solved the lag problem for our gamers all over the world. Finally, we have the content delivery network—we maintain our own clients and patching system, so we’re unique in that we’re an independent developer and publisher.”
Nabil continued: “We have the core network in Atlanta, and we utilize Internap. Back in 2008, we evaluated to see which provider is the best at providing great network, private network access point (PNAP) and mirror technology. That means that the provider measures every connection to our data centers and reroutes you to another provider that has better latency and a better connection. We have that technology in all of our data centers. We also have a lot of entry points for our end users to connect to us. Our data centers are all over, we’re in many different exchange points and we also have five different CDNs that help us deliver data.”
Because Hi-Rez has a very aggressive patching system that frequently pushes new content to players for more than 10 games, the team has to patch every day of the week. As a result, Hi-Rez has more than 5 petabytes of data transfer a month, running more than 600 servers around the globe.
The below diagram shows the infrastructure design of the Hi-Rez network at a high level. In Nabil’s words, “We’re very dependent on the Internet and ISPs—things we have no control over. This is the message today, where you have no control. When you’re going to AWS you have no idea what’s going to happen.”
Managing Constant DDoS Attacks and User Issues
The Hi-Rez main core network “is very protected. We get DDoS’ed about three times a day, more than 70 Gbps every time, so we’re completely behind Prolexic. We don’t dare to get off Prolexic because attackers have monitors to check for the second we stop broadcasting through Prolexic.”
“Around the time we found ThousandEyes, we were being brought to our knees and we couldn’t do anything about it. When DDoS attackers see Prolexic they usually walk away, but this time they found every IP scheme that Internap has assigned to them in the public record, and they attacked every octet and network until they brought us down. They figured out the weak points in Internap’s network and brought everything down, including their headquarters and mail servers. The only solution was to fix those weak points in Internap.”
“In order for us to satisfy the community, we had to compress the reaction timeline, from the time we find the problem to the time we tell the community what’s happening and ask for their forgiveness. We lost tens of thousands, if not hundreds of thousands of players because we couldn’t get it together.”
“From that point on, we upped our network to 250 Gbps for both ingest and outgoing, and we also implemented ThousandEyes to figure out exactly what’s happening right away. We use everything that ThousandEyes offers: we put an Enterprise Agent in every one of our data centers, and we’re monitoring back to one server in our core data center, which gives us full visibility if something like a BGP route change happens. I no longer have to check if competitors like Blizzard or League of Legends are also down when we’re trying to figure out if a Tier 1 provider is having a problem or if it’s a larger Internet issue.”
“We also utilized a lot of Cloud Agents, so we can understand what the end users are seeing. We saw great success with that—our support team can easily log in and see that CenturyLink is having an issue in D.C., for example. All we have to do is take a snapshot and send it to the end user, which automatically gives us credibility because we can tell them why they’re having issues.”
If users report problems and ThousandEyes confirms them, the Hi-Rez team simply takes down the servers having issues, since they have enough capacity to move traffic around. Nabil noted that because AWS is the most reliable of their providers, they usually bring up cloud services for the users having issues, so they can continue playing while the underlying problems are fixed.
Below are screenshots from an in-house tool that the Hi-Rez team built. It gives a sense of where users are accessing Hi-Rez games from, and the experience of end users around the globe.
Formidable Network Challenges
Nabil went on to discuss the most significant problems that Hi-Rez now faces, and how his team has used ThousandEyes to mitigate these issues.
In the past, when the Hi-Rez data centers had problems, Nabil’s team would struggle with conflicting messages: all of their players were complaining of issues, but all of their monitoring solutions were green and showed no problems at all. At that point, “we had to run a lot of MTR reports and talk to three different departments just to get that data. Then we had to contact our users and ask if it’s okay to get their public IPs.” In Nabil’s words, “Thank God I don’t have to do any of that stuff anymore.”
Now that Nabil’s team uses ThousandEyes, when they get complaints, “I log into ThousandEyes, look for the relevant time period, and I see the problem. I take a snapshot and share it from ThousandEyes to my vendor, and use it to open a ticket. I keep track of those issues and have a bi-weekly meeting with my account executive to show him all the screw-ups. ThousandEyes really helped us identify all these issues and make relationships better—I’m not taxing my guys and our vendors, and we’re able to get to a resolution because we’re giving them concrete information. The last mile has been my pain for a very long time.”
Nabil then showed a ThousandEyes screenshot of a time period when their data centers were the targets of a DDoS attack. All of the agents observed extremely high loss, and “it was always very clear that we had a major issue.”
Tier 1 Providers
Nabil acknowledged that the Internet and ISPs are out of his team’s control: “With Tier 1 providers, you never know.” His team is working to make their response to issues with Tier 1 providers more efficient and intelligent. They’ve already built an intelligent response into their games: “If you’re not doing well in a game and worried it’ll affect your score on the leaderboard, you might just exit. For exiting, we give you a penalty by putting you in the queue for ten minutes where you can’t play. But whenever the game senses problems within our core, it issues a ‘safe mode’ so we don’t penalize players for our mistakes or for our services going down.”
In the future, the Hi-Rez team plans to work on setting up an API service that talks to the ThousandEyes API. “That’s what we see as an API call—when the Internet is having issues, go immediately into safe mode.”
Below is a BGP Route Visualization that shows route changes and reachability issues from one of the Tier 1 providers to the Hi-Rez network. Nabil’s team quickly identified and communicated the problem to the provider, who promptly fixed the issue.
Unsurprisingly, Nabil said, “DDoS is my biggest pain.” He presented another BGP Route Visualization that showed reachability issues during a DDoS attack—Hi-Rez’s network was almost completely unreachable from route monitors located around the world.
However, being constantly attacked isn’t the only issue. The team also deals with issues related to constantly being behind Prolexic. As Nabil said, “Prolexic offered us tons of protection, but they also brought some instability into our world. As they’re shunting traffic back and forth inside their scrubbing centers, they’re affecting us badly.”
“We had a problem where some parts of our servers would disconnect for no reason every eight minutes. This was especially bad because our users connect through consistent connections. If that sticky connection breaks, that means you’ve disconnected, and you have to start back at the login screen. Our users would get very upset when we had this issue because they would have to reconnect every eight minutes. It was a mystery, and we tried everything to the point of replacing our entire network and data center. We got new Juniper equipment and implemented a multi-hundred thousand dollar project because that’s what all the fingers were pointed at.”
“Since we implemented ThousandEyes, we found that the issues were occurring whenever Akamai made a change within its network. What solved the problem was sitting down with their engineers and simply telling them not to make those changes as often. ThousandEyes paid for itself for a whole year just by uncovering that.”
The Impact of ThousandEyes
In his closing remarks, Nabil summarized the difference that ThousandEyes has made for him and his team. “I’m able to do a lot of root cause analysis—I no longer live in the gray not knowing what happened during a network outage. Now we’re able to pinpoint everything. We use Freshservice as a ticketing service, so we issue every incident with a snapshot from ThousandEyes, and we’re able to be accountable in front of our stakeholders, executives and users. The analysis is quick—for me to be able to just log in and know what’s happening in a matter of a minute is huge. I can’t get that from any other network monitoring tools we’ve used.”
“Using ThousandEyes improved our partnerships with all of our vendors and resulted in happier players.” Best of all, ThousandEyes improved work-life balance for the IT team and allowed them to focus on more engaging, strategic projects: “In 2016, I implemented the policy that no one on my team can work over 50 hours a week, because we were going at the rate of more than 80 hours. ThousandEyes closed that gap for us and gave us the ability to proactively figure out issues. Now, it’s very easy to satisfy our president, who is constantly on Twitter and Facebook seeing user problems. The first ThousandEyes account was for our president, so he’d be able to see all of the alerts.”
For more from ThousandEyes Connect NYC, see our post on Optimizing WAN to Deliver SharePoint Online Globally.