VIDEO
The McLaren Formula 1 Team keeps their network race-ready with ThousandEyes.

Engineering

Using Large Constants in MongoDB MapReduce Framework

By Bruno Nery
| May 29, 2013 | 4 min read

Summary


If you are working on data aggregation problems using MongoDB MapReduce framework, and you think your job is slow, check this out: a way to speed up your solution may be contained in this very blog post.

On a data aggregation problem, we need to group subdomains into a single top domain (e.g. www.google.com, google.com, and mail.google.com are grouped into google.com). In order to determine what the primary domain is, we find the longest eTLD (effective top-level domain) and take it with the preceding zone (e.g. for www.portugal.gov.pt, both .pt and .gov.pt are eTLDs; since .gov.pt is the longest one, we identify portugal.gov.pt as the domain).

To tell if a part of a domain is an eTLD, we match it against Mozilla's list of public suffix rules We have a function that uses a set consisting of all rules, which are about 6000. Since this is JavaScript, we use a plain object to mimic sets.

Originally, the function had the following prototype:

var isPublicSuffix = function(suffix) {
	...
};

A global variable 'suffixRules' contains the set of all rules, and the 'scope' parameter was used to pass it to the MapReduce functions. With our current dataset (about 500 thousand records) and configuration, MongoDB took 36m56s to run the MapReduce job. Unsatisfied with the results, we tried a trick: since 'suffixRules' is in fact a constant, we embedded it in the function definition.

First, we changed it so that 'suffixRules' becomes a parameter:

var isPublicSuffixPrototype = function(suffixRules, suffix) {
	...
};

Then, we fixed the first parameter before passing the function to MongoDB (we use the 'system.js' collection to store functions on the server side) using the following partial application wrapper:

var generateIsPublicSuffix = function(suffixRules) {
	return new Function('suffix', 'return ('
	+ isPublicSuffixPrototype.toString() + ')('
	+ JSON.stringify(suffixRules) + ', suffix);');
};

This cut down execution time to 1m53s, a 20-fold improvement in performance. The loss in performance from using the 'scope' variable seems to be related with how variables are synchronized between MongoDB and the embedded V8 engine. Any thoughts?

related blogs

Blog Thumbnail: Monitoring Root DNS Prefixes
Engineering
Monitoring Root DNS Prefixes
Monitor BGP hijacks on root DNS prefixes. Cisco ThousandEyes measures control plane anomalies and real-world data plane impacts.
By Lefteris ManassakisKemal Sanjta & Mike Hicks | July 18, 2025 | 18 min read
Blog Thumbnail: BGP Zombies Show Up Regularly
Engineering
BGP Zombies Show Up Regularly
BGP zombies cause persistent routing issues, leading to loops and instability that impact global Internet reachability and network performance.
By Fontas Dimitropoulos & Iliana Xygkou | July 8, 2025 | 10 min read
Blog Thumbnail: ThousandEyes & Prometheus 3.0: Steps To Stream Data
Engineering
ThousandEyes & Prometheus 3.0: Steps To Stream Data
Prometheus 3.0 is here! Now stream ThousandEyes data directly using OpenTelemetry for seamless integration.
By Antonio Jimenez Martinez | April 9, 2025 | 5 min read