In this post from ThousandEyes Connect Santa Clara, we’ll summarize the presentation by Donavan Fritz, Sr. Network Site Reliability Engineer at Netflix.
During his talk, Donavan described how his team tracks IP addresses within the AWS cloud — specifically, how they built a system to quickly figure out where a given IP address is allocated at a given time.
IP Addresses Mean Nothing
At Netflix, Donavan works on the Cloud Network Engineering team, which is responsible for DNS, network architecture inside AWS VPC (Virtual Private Cloud) as well as network triage and telemetry inside the cloud. Uniquely, they are a networking team that has no access to hardware; as Donavan says, “When you think about typical networking teams you generally think routers and switches, but we don’t have that luxury — we operate entirely in the overlay that Amazon provides.”
Donavan then showed a sample of netstat output. The problem with this output is that it’s impossible to tell what the IP addresses represent: “These IPs actually mean nothing to me; and as a member of the cloud networking team, this is a big problem that we need to solve. We should be able to quantify every one of these addresses.” To solve this problem, Donavan first set out to have a very well-defined question to answer, because “what are these IPs?” is not a precise enough question.
No Network Segmentation in the Cloud
To provide context, Donavan then explained their AWS operations: “In typical networks, you have the concept of network segmentation. With a well-defined IP scheme, you can look at an address and recognize that it’s in a particular data center or if it’s a domain controller. In the cloud, we don’t have this kind of segmentation.”
Tokens in a Bucket
Donavan thinks about IP addresses like tokens in a bucket; where the VPC is a bucket of tokens, and resources need tokens to communicate on the network. At any point in time an IP could be an EC2 instance; when that instance is terminated, the IP address goes back to the bucket and is free to be used by something else.
As a result, an IP address can be used by many different applications. In Netflix’s environment, IPs change a lot. Because IPs represent different things at different times, the key question must be revised: “What are these IPs at this time?”
A Well-Defined Question
Donavan then mentioned another issue: VPCs can overlap in IP space, it’s important to quantify this as well. He differentiates overlapping IP networks by putting them into different routing domains.
He also adds more color to the key question: “We also want to define what is ‘what?’ If you tell me that this IP address belongs to an EC2 instance with id i-123, that doesn’t tell me much. I still want to know more detail.” So, the fully refined question is now: “What application has this IP, at this time, in this routing domain?” This is now a very well-defined question with explicit inputs and outputs — it’s a function that a system can be built around.
There’s one last thing to consider. Netflix uses another AWS technology called VPC FlowLogs, which provides tiny records of information like the one below. It provides Layer 3 and Layer 4 data describing communication happening inside Netflix’s VPC. Similar to the problem with the netstat output, this doesn’t have meaning because there’s no indication as to what the IP addresses are. But, this does provide the inputs required to answer our key question: IP addresses, timestamp and routing domain. But because VPC FlowLogs describes every piece of communication across all of Netflix’s accounts, instances and regions, volume is huge. So when Donavan’s team built a system to figure out what each IP was, they needed to take into account the high volume of information.
Due to the high volume of data, Donavan and his team are working on building an IP change stream to answer their key question about identifying IPs.
Donavan describes the process: “We look at both Amazon and internal data sources, poll them at regular intervals, look for things that have been added, deleted or changed, and then place any changes into an event stream. The stream stores every IP address in our environment and at what time it changed.”
“So what does this look like in practice?” Donavan asked. Inputting the IP address, timestamp and routing domain (e.g., “18.104.22.168 @ 1495670748 in inet.0”) yields useful details relevant to the IP address, including the application and placement in the network.
Reaping the Benefits
All of this data is very useful: “This helps us discover what’s happening in our network. If we annotate every FlowLog with this data, we can derive a connectivity map of our network.”