If you are working on data aggregation problems using MongoDB MapReduce framework, and you think your job is slow, check this out: a way to speed up your solution may be contained in this very blog post.
On a data aggregation problem, we need to group subdomains into a single top domain (e.g. www.google.com, google.com, and mail.google.com are grouped into google.com). In order to determine what the primary domain is, we find the longest eTLD (effective top-level domain) and take it with the preceding zone (e.g. for www.portugal.gov.pt, both .pt and .gov.pt are eTLDs; since .gov.pt is the longest one, we identify portugal.gov.pt as the domain).
To tell if a part of a domain is an eTLD, we match it against Mozilla's list of public suffix rules We have a function that uses a set consisting of all rules, which are about 6000. Since this is JavaScript, we use a plain object to mimic sets.
Originally, the function had the following prototype:
var isPublicSuffix = function(suffix) { ... };
A global variable 'suffixRules' contains the set of all rules, and the 'scope' parameter was used to pass it to the MapReduce functions. With our current dataset (about 500 thousand records) and configuration, MongoDB took 36m56s to run the MapReduce job. Unsatisfied with the results, we tried a trick: since 'suffixRules' is in fact a constant, we embedded it in the function definition.
First, we changed it so that 'suffixRules' becomes a parameter:
var isPublicSuffixPrototype = function(suffixRules, suffix) { ... };
Then, we fixed the first parameter before passing the function to MongoDB (we use the 'system.js' collection to store functions on the server side) using the following partial application wrapper:
var generateIsPublicSuffix = function(suffixRules) { return new Function('suffix', 'return (' + isPublicSuffixPrototype.toString() + ')(' + JSON.stringify(suffixRules) + ', suffix);'); };
This cut down execution time to 1m53s, a 20-fold improvement in performance. The loss in performance from using the 'scope' variable seems to be related with how variables are synchronized between MongoDB and the embedded V8 engine. Any thoughts?