Skip to content

Tuning Mahout’s Item-based Recommender

April 28, 2012


Mendeley Suggest (Figure 1) generates personalised research article recommendations for researchers.  It’s a pretty awesome feature, that I like to think of as a personalised research advisor who’s at your beck and call 24/7.  It’s powered by Mahout, a machine learning and data mining library, that’s designed for large scale data processing.  While Mahout’s recommendation algorithm implementation is lean, it can still burn a hole in your pocket when you’re generating recommendations for millions of users on a daily basis.  In this post, I look at ways to reduce the cost of generating recommendations without sacrificing their quality, by better understanding the characteristics of the input data.

Figure 1: A screenshot of Mendeley Suggest, a research article recommendation system.

Measuring Recommenders

Mendeley generates article recommendations for all of its users on a weekly basis.  When generating recommendations, I’m primarily interested in two factors: recommendation quality; and cost.  Of course, I’m looking for high quality and low cost.  Typically, reducing the cost, will also reduce the quality, but I’d like to reduce the cost without reducing the quality.  Let’s first look at how quality and cost are measured before putting some numbers on how Mahout’s recommender performs.

Recommendation Quality

Assume that we have 100 user libraries.  We can generate recommendations for all 100 users but how can we tell if they are any good?  We could present the recommendations to the users and get them to rate how good they are.  This process, however, is quite time consuming, especially when you have several hundred or even thousands of ways of generating recommendations and you want to know which one works best.

An alternative to asking users, is to run cross-validation with a ‘leave some out’ method.  From the 100 libraries, split them into a training set and a testing set.  Let’s put 90 libraries in the training set and 10 in the testing set.  From the testing set, for each user library, hide some of their articles.  Train the recommender with the 90 training set libraries and the non-hidden articles from the testing set.  Predict recommendations for the users in the testing set and compare them with their hidden articles.  If your recommender manages to recommend any of the articles that were hidden, then it’s doing a pretty good job because we know that these articles are of interest to that user, since they are in their library.  We can measure the quality of recommendations with those bread-and-butter staple metrics of information retrieval: precision; and recall.  Repeat this procedure several times, say ten, from the start, creating new training and testing sets, so that all user libraries appear in at least on testing set.  This is called ten-fold cross validation and you can run through many recommender configurations very quickly.


Mahout’s distributed recommender is written in MapReduce and designed to run on Hadoop.  The quicker the recommender generates results, the less it costs to run.  Mahout can be run on Amazon’s Elastic Map Reduce (EMR) framework.  When you spin up an EMR cluster of computers, you pay for the number of computers that you rent per hour.  When the MapReduce job completes, the EMR cluster is torn down and the final cost is computed.

Generating Recommendations

Mahout 0.6’s distributed item-based collaborative filtering recommender has been tested on EMR under cross-validation conditions.  It was trained on a large set of 350,000 user libraries, all containing between 2 and 1,000 research articles.  In total, there are over 250 million articles in these libraries, representing around 35 million unique articles.  5,000 user libraries were tested.  On average, it cost $650 to generate recommendations for all users, with an average precision of 0.15 @ 10 (i.e. 1-2 good recommendations out of 10) (Figure 2).

Figure 2: Results from testing the Mahout 0.6’s out of the box distributed item-based recommender.

While this level of precision looks low, it’s actually quite competitive.  First, we don’t know for sure if a recommended article that does not appear in a user’s library would be of interest to that user or not.  That is, a user’s library is not a complete set of their interests, but a subset.  With real user testing, we tend to find that acceptance rates are higher than this precision predicts.  Second, this is a difficult task.  There are very few personalised research article recommender systems in the world and only a couple that operate on a large scale: Mendeley Suggest; and CiteULike.  This is a competitive rate of precision in this setting.  How about the cost for generating the recommendations though?  When I look at the steps of the MapReduce job in detail, I don’t get the impression that it’s making effective use of the resources available, meaning that the job takes longer to complete and costs more.

Mapper and Reducer Allocation

The Problem

In order to efficiently generate recommendations, Hadoop distributes Mahout’s MapReduce calculations across a cluster of computers.  Hadoop, however, doesn’t always make the best use of the resources available to it.  When I was running the recommender on EMR, I looked at how the jobs were being distributed across the cluster.  Mahout chains 10 MapReduce jobs together, running each step in the pipeline sequentially.  In one job in the pipeline, hundreds of gigabytes of data were all being poured into a single reducer task, despite there being 40 reducers available across the cluster (Figure 3).  The lone reducer took almost 40 hours to complete its task.  A more efficient use of the cluster would be to distribute the calculation across more reducers.
Figure 3: Reduce Completion Graph shows that 1 reducer has been allocated to the job.  It has progressed through less than 10% of the reduction phase, still copying data that is output from the mappers.

A Solution

Sometimes Hadoop needs a helping hand in allocating mappers and reducers.  The number of reducers allocated to a job can be set quite easily using the “mapred.reduce.tasks” property.  This will guarantee the number of reducers allocated to the job.


If the number of mappers being allocated to a job isn’t appropriate then Hadoop also provides some functionality for tuning the number.  You can change the split size that Hadoop applies to input files when its allocating the slices to mappers.  If, for example, the input is 128MB in size and the split size is 64MB, then the input is split between 2 mappers.  Reducing the split size to 32MB will split the input across 4 mappers.  Unlike reducer allocation, this code will not guarantee the number of mappers that is allocated to a job since multiple input files are split separately.  So, if the job takes 3 input files, each of size 65MB and the split size is 64MB then each of the input files will be split in two, leading to 6 mappers being allocated in total, despite their total size being only 195MB (i.e. 3 x 65MB), which could be split across 4 mappers.


Continuing the previous example of reducer under-allocation, I ran the same job but with 40 reducers being allocated this time round (Figure 4).  The overall time for the job to complete reduced from almost 40 hours to around 14 hours.  That’s a time saving of 65% for that single job and shaves off 26 hours from the entire 10 job pipeline.

Figure 4: Reduce Completion Graph shows that 40 reducers are running in parallel.

Speed Up 1

I helped Hadoop to better allocate the number of mappers and reducers at all steps of the 10 job pipeline.  Overall, this had the effect of reducing the cost of generating recommendations by $150 (Figure 5).  That’s almost a quarter of the original cost.  Interestingly, helping Hadoop to make better use of its resources does not have an impact on the quality of the recommendations generated. As a result, we can generate recommendations of the same high quality but for less expense.

Figure 5: Comparison of the results from testing Mahout 0.6’s out of the box distributed item-based recommender (Mahout 0.6) and a version with better mapper and reducer allocation (Mahout 0.6′).  The latter version costs $150 less than the out of the box version.

Distribute Data into Even Partitions

The Problem

As previously discussed, we can reduce the overall running time, and therefore cost, for a job when we distribute tasks over multiple nodes in our cluster.  When Hadoop allocates multiple reducers to a job, it attempts to partition the reducer’s input data evenly so that each reducer has roughly the same amount of work to do.  This partitioning, however, isn’t always well distributed.  Figure 4 shows that we can distribute a job across multiple reducers, 40 in this case.  Looking at the amount of data being passed on to each reducer, however, reveals that the input data is not being evenly distributed (Figure 6).  Although the two reducers highlighted have been running for the same amount of time.  One is already over 10% complete with the other is just over 1% complete.  The more complete reducer has got to 10% by only copying around 50KB of data, while the other reducer has already copied around 500MB of data.  Since the data is not being evenly distributed, it means that some reducers are being given less work to do and that the overall completion time for the job is not going to reduce as much as it would if the reducers shared evenly distributed loads.

Figure 6: Reduce Completion Table with file counters for two selected reducers.  One reducer has received around 50KB of data while the other has received around 500MB of data.

A Solution

Hadoop allows you to define the way in which it partitions the input data fed into the reducers.  Since my input data isn’t evenly distributed, it’s important to sample it in order to find appropriate partition points.  Hadoop provides an InputSampler that will do just that.  It samples your data and then sends these partition points to a partitioner that allows your data to be more evenly distributed.

 InputSampler.Sampler<IntWritable, Text> sampler =
new InputSampler.RandomSampler<IntWritable, Text>(...);
InputSampler.writePartitionFile(conf, sampler);

Applying this partitioner to the misbehaving steps has a dramatic improvement on the run-time (Figure 7).  Since the amount of data being sent to each of the 40 reducers is well distributed, it means that they all complete in roughly the same amount of time.  The run-time for the most ill-distributed job reduces from around 14 hours to just 2 hours.

Figure 7: Reduce Completion Graph that shows 40 reducers running in parallel.  The reducers are progressing at roughly the same pace as one another.

Speed Up 2

By applying well distributed partitioners to the steps in the Mahout pipeline, the overall cost of generating recommendations reduced by $270, a saving of over 50% (Figure 8).  Partitioning the data more evenly means that the Mahouts resources are being used more evenly.  Again, while the cost of generating recommendations has been reduced, their quality has not been compromised.

Figure 8: Comparison of the results from testing Mahout 0.6’s out of the box distributed item-based recommender (Mahout 0.6), a version with better mapper and reducer allocation (Mahout 0.6′) and a version with better mapper and reducer allocation plus better data partitioning (Mahout 0.6”).  The third version costs $270 less than the second version.


Mahout’s algorithms are very efficient.  Hadoop, however, sometimes needs a helping hand in making the best use of the resources available.  In this post, I considered two ways of reducing the overall cost of generating recommendations on a large scale without reducing the quality of the results.  The first is to set the number of mappers and reducers that are allocated to jobs, helping them to exploit the full resources of the cluster in which they are running.  The second is to evenly distribute the input data that is passed to jobs by evenly partitioning it.  By understanding the shape of your data, you can tune Hadoop to make better use of its resources and significantly reduce the cost of running jobs.  In this case, the overall cost of generating recommendations has been reduced from $650 to $240, a reduction of 63% in cost (Figure 9).

Figure 9: Comparison of the results from testing Mahout 0.6’s out of the box distributed item-based recommender (Mahout 0.6), a version with better mapper and reducer allocation (Mahout 0.6′) and a version with better mapper and reducer allocation plus better data partitioning (Mahout 0.6”).  The third version costs $410 less than the out of the box version.
  1. Thx for sharing thought, Can I know how to you calculating the cost for each recommendation iteration.

  2. All of the experiments were run on Amazon's Elastic Map Reduce ( You can hire machines and are charged for them by the hour. The costs that are reported are averages over multiple runs.

  3. Bartosz Kupisz permalink

    Hello Kris,
    at the beginning I want to mention, that I’m really impressed with the work you have been doing. At the moment I’m working on my master’s thesis about recommendation engines on different cluster computing systems. In the first phase my job is to choose the best similarity metric for given boolean dataset with Mahout. And here I’m really confused since Mahout doesn’t support evaluation of distributed recommenders. Could you share some knowledge on this topic? In what way exactly you evaluated versions of item-based recommender? I know the theory but writing ‘framework’ for this seems quite tough for me. I would be very gratefull for any tips. Thank you in advance đŸ™‚

Trackbacks & Pingbacks

  1. Comprehensive Comparison of Reference Managers: Mendeley vs. Zotero vs. Docear « Docear

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: