Skip to content

Summary of Academic-Industrial Collaborations for Recommender Systems Workshop

Introduction

Mendeley Recommender Workshop

Last week, Mendeley hosted an all day workshop on Academic-Industrial Collaborations for Recommender Systems.  As we’re fast running out of space in our London office, we rented a venue that was nearby named Headrooms.  With a friendly staff, attentive to everyone’s needs and great start-up décor, I’d definitely rent it again.  In the morning and early afternoon we were treated to seven talks from a variety of speakers who shared their experiences of academic-industrial collaborations and recommender systems.  We finished the afternoon by splitting into smaller groups to discuss the challenges involved in making such collaborations a success and sharing useful advice with one another.  The day then finished, as all good days do, with a quick trip to the funkily named Giant Robot, to taste some of their excellent cocktails.

Presentations

Seven presentations were delivered by our eight speakers, one of them being an entertaining double act.  We tried to film as much of the event as we could so that we could share them with you afterwards.

First off, Jagadeesh Gorla began with a presentation entitled A Bi-directional Unified Model.  Jagadeesh talked about the results presented in his www2013 paper on group recommendations via Information Matching, a new probabilistic model based on ideas from the field of Information Retrieval, which learns probabilities expressing the match between arbitrary user and item features: this makes it both flexible and powerful.  He is currently working on developing an online implementation for deployment in an online gaming platform.

Our double act, Nikos Manouselis and Christoph Trattner then followed with the intriguingly entitled presentation Je T’aime… Moi Non Plus: Reporting on the Opportunities, Expectations and Challenges of a Real Academic-Industrial Collaboration.  They gave an honest and candid reflection of their expectations for working together and how some of their past experiences in other collaborations weren’t as successful as hoped.  It was great material that fed into the discussions later in the day.

Heimo Gursch then gave some Thoughts on Access Control in Enterprise Recommender Systems.  While his project is still in the early stages, he had quite a few experiences that he could share from working with industry partners from the perspective of an academic.  He was working on designing a system that would allow employees in a company to effectively share their access control rights with one another rather than relying on a top down authority to provide them.  It’s also the first time that I’ve seen a presenter give his car keys to a member of the audience.  I do hope that he got them back.

Maciej Dabrowski delivered an exciting presentation Towards Near Real-time Social Recommendations in an Enterprise.  His team and him have been working on a cross-domain recommendation system that works in a federated manner.  It exploits semantic data from linked data repositories to generate recommendations that spans multiple domains.

Mark Levy, from our team here at Mendeley, then presented some of the work that he has been doing in a talk entitled Item Similarity Revisited.  The presentation was filled with useful advise from an industrial perspective on what makes a good recommender system.  He also explored the idea that simple algorithms may be more useful than complex ones in an industry setting, showing some impressive results to back it up.  I’ll include Mark’s presentation after his paper has been published.

Benjamin Habegger then took us on a rollercoaster ride exploring some of his successes and failures in his last startup, 109Lab: Feedback from a Start-up Experience in Collaboration with Academia.  He reflected on his experience of co-founding a startup and what he learned from it.  Although he worked with academia during the process, he wasn’t clear about the value that it actually brought.

Finally, Thomas Stone presented Venture Rounds, NDAs and Toolkits – Experiences in Applying Recommender Systems to Venture Finance.  Thomas had some nightmare experiences with NDAs during his PhD.  So much so, that he’s still unclear what he has the right to publish in his thesis.  He also gave a nice introduction to PredictionIO, an open source machine learning server.

Discussion Groups

Once the presentations were given, everyone was invited think about the challenges and difficulties that they had faced in working in academic-industry collaborations and to write down some topics on a flip chart.  We then split into three groups and, using these topics as guidance, discussed the issues faced and presented some solutions.

A number of issues were identified including:

  • Prototypes vs roduction code – do the partners know what is expected from whom?
  • How to find the right partners
  • Access to data (e.g. NDA issues)
  • Evaluating systems
  • Best practices

After the three groups discussed the points we all gathered back to share our thoughts and conclusions.  In general, we all seemed to share similar problems in making academic industry collaborations successful.  We discussed that there should always be a clear set of expectations agreed from the outset and that partners should know their roles.  Communication lines should be kept open and the spirit of collaboration encouraged.  What’s more, it can help to have members of the teams working together in the same physical location, even if it’s just for a brief period, in order to work together well.

Conclusions

Working in academic-industrial collaborations is hugely rewarding but it can be tough.  Finding the right partners who understand each other’s goals and constraints is important from the outset.  We can all learn from one another but we need to put in some effort in order to enjoy the rewards.

I’d like to thank everyone who put in the effort to make the workshop a success and, as I follow up the several e-mails that I’ve got, hope to start to new and fruitful collaborations!

Advertisements

Workshop on Academic-Industrial Collaborations for Recommender Systems

Workshop Date: 10 July 2013

Submission Date: 1st July, 2013

Location: Headrooms, 1-2 St John’s Path, London, EC1M 4DD

What is this Workshop?

Recommender systems are fast becoming as standard a tool as search engines, helping users to discover content that interests them with very little effort.  Their commercial popularity has made them as hot a topic in industry as they are in academia.  With this shared interest, conferences like ACM’s Recommender Systems often attract as many academics as industry practitioners.  This workshop aims to bring local academics and industry practitioners together to share experiences and make new contacts.

This workshop provides a forum for researchers and developers alike to showcase their recommender system work and explore possibilities for new collaborations.  We’re interested in fostering new links between academia and industry, providing academics with real-world environments in which to solve problems and industrial partners with access to some of the local area’s top academic researchers in the field.  We’re also interested in hearing from researchers and developers who are currently or were previously involved in such collaborations, learning from what went well and not so well in their experiences.

Call for Presentations/Demonstrations

Researchers and developers from academia and industry are invited to give presentations in our 1-day workshop.  We expect each presenter will have a total of 20 minutes to present and/or demonstrate their recommender system work which will be followed by 10 minutes of discussion.  Topics of interest include, but are not limited to:

  • Academic’s experiences of working with industry
  • Industry practitioner’s experiences of working with academics
  • Experiences of working with recommender system software libraries
  • Experiences of testing recommenders offline
  • Experiences of testing recommenders online
  • Experiences of sharing data between academia and industry

If you would like to present, please contact kris.jack@mendeley.com with a short description of your presentation in no more than 300 words.  Also, please let us know if you are already involved in any academic-industry partnerships.  If so, what were the main challenges that you had in working together and how, if possible, did you overcome them?  If not, are you seeking some partners?  What would you like to get from the partnerships?  This will help us to identify useful topics for our round table discussion.

Important Dates

Presentation Abstract Deadline: 1st July, 2013

Notification of Acceptance Deadline: 3rd July, 2013

Workshop Date: 10th July, 2013

Please let us know (kris.jack@mendeley.com) if you will attend the workshop whether you intend to give a presentation or not.

On receiving your abstract, we will get back to you within days as to whether we think that you’re presentation would be interesting for the workshop audience.  So if you submit before 1st July, you can expect to receive a reply sooner than the notification deadline too.

Who is expected to attend?

Researchers and developers who work with recommender system technologies.  We invite researchers who are working on novel solutions to some of the most challenging problems faced in recommender system research and industry practitioners who grapple with the challenges of developing recommender systems into products for real people to use.  By bringing researchers and industry practitioners together for a day we hope that everyone will benefit from making new contacts in an open and friendly environment.  We particularly encourage both researchers and industrial partners who have had previous experiences of academia-industry collaborations.  As it is a one day event, we expect the audience to be quite local, made up mainly of participants from London and the UK although all are welcome.

Workshop Programme

9:00 Breakfast and coffee

9:20 Open workshop and welcome

9:30 Jagadeesh Gorla, A Bi-directional Unified Model

10:00 Break

10:20 Nikos Manouselis & Christoph Trattner, Je t’aime… moi non plus: reporting on the opportunities, expectations and challenges of a real academic-industrial collaboration

10:50 Heimo Gursch, Thoughts on Access Control in Enterprise Recommender Systems

11:20 Break

11:50 Maciej Dabrowski, Towards near real-time social recommendations in an enterprise

12:20 Mark Levy, Item Similarity Revisited

12:50 Lunch

14:05 Benjamin Habegger, 109Lab: feedback from a start-up experience in collaboration with academia

14:35 Thomas Stone, Venture Rounds, NDAs and Toolkits – experiences in applying recommender systems to venture finance

15:05 Break

15:25 Round table discussion on academic-industry collaborations

16:25 Break

16:45 Exploring possible collaborations

17:30 Drinks for all at Giant Robot!

Note: Presentation slots have 20 minutes for presentations followed by 10 minutes of discussion.

Accepted Presentation Abstracts

Maciej Dabrowski, Towards near real-time social recommendations in an enterprise

http://advansse.deri.ie/

This work proposes a new perspective on near real-time social recommendations in enterprise social platforms based on the challenges we identified in a comprehensive survey of existing approaches as well as the enterprise social platforms. We explore how Linked Data techniques associated primarily with the Semantic Web can be applied in a Corporate Enterprise context to provide near real-time social recommendations. We draw from research and experimental experience along with practical experiences based on the Webex Social platform, which is modeled on three elements: people, communities, and information. We evaluate a number of potential recommendation and notification techniques before proposing a novel method for generating recommendations based on the concept of Spreading Activation in an enterprise social graph. Our approach considers multiple information sources and is based on an experimental demonstrator using data from DBPedia and other sources.  The proposed method allows for dynamic inclusion of additional information sources with very low effort, significantly improving recommender flexibility. Furthermore we define a system and architecture capable of making recommendations in near real-time whereas the majority of existing recommendation systems focus on a pre-compute preparation of results. This is critical in enterprise environments where informational events but more importantly their context and meaning is time critical. The architecture that we propose may be deployed in a standalone mode or as an integral part of an enterprise social software platform. We highlight how our approach improves the management of knowledge and communication within organization through more effective exploitation of information available in the corporate social web.

In addition to the contribution of this architecture and system which contains a novel approach for enterprise social recommendations we also propose the extension of certain IETF and W3C standards, namely the eXtensible Messaging and Presence Protocol (XMPP) and Semantic Processing And RDF Query Language (SPARQL) respectively.

Jagadeesh Gorla, A Bi-directional Unified Model

The underlying problem of recommender systems is to present a user with a set of relevant items (products, ads, people, research papers, etc.). The simplest way to look at it as the problem of estimating the probability of relevance between the user-item pairs and ranking them based on the probability. Even though it seems simple, it poses a difficult challenge on how to utilise  the available information about the users, items, and the interactions between various user-item pairs in estimating the relevance. A simple example would be, in collaborative filtering we only have information on the user-item interactions (clicks, ratings, etc.). Sometimes, we may only have user, product description but not the interactions (e.g., age, product_type). We may have both (e.g., Xbox Live!). And, it is desirable to use the available information in estimating the relevance.

The common approach to solve the problem is based on Matrix Factorisation (MF). MF assumes a common feature (factor) space between the user, item and then these factors are estimated based on the interaction between users-items. In MF, there is no straightforward approach to incorporate native features of user, product (e.g., age, location) unless modelled as users/items.

I will present a unified model that (a) does not compute explicit similarity (like in K-nearest neighbourhood), (b) does not assume common feature space (like in MF) and (c) models users/items with interpretable features (as opposed to hidden), uses all the available information. And, can be used for building large-scale personalised systems in recommendation & search.

Benjamin Habegger, 109Lab: feedback from a start-up experience in collaboration with academia

In this presentation I will talk about my experience as CTO of 109Lab, my first start-up launched with 2 associates, Laurent Assouad and Bastien Marot. The goal of 109Lab was providing a “Big Memory” service helping people sort out, share and print the best of their digital memories (pictures, videos, notes, etc.).  Our target business model was primarily based on selling photo-products based on these memories. Since the beginning of the project, we had the will to work in connection with scientific research, in particular due to my background as a researcher. Therefore, we directly started off with a collaboration with researchers of the LIRIS Laboratory at the INSA of Lyon. The work we lead with the LIRIS was studing how semantically enriching digital memories with contextual information (e.g. exif, spacio-temporal meta-data) could help sorting, organizing and creating stories with these memories. Among this work, we investigated how we could use user-action logs to recommend future actions based on what other user’s had done in the past in a similar context. To help the user sort his memories, clustering and classification techniques  were envisaged. After a two year experience, launching two services Blaboum.com and OhMyToast.com, working on a seemingly hot topic for many people, gaining quite some notoriety within both the research and entreprenship communities in Lyon, we decided to stop for different, likely entangled, reasons. Far from a real failure, this experience has been very rich in lessons on which I am now building upon. This talk will be about this experience, the difficulties in making the mixture take, my view of the possible reasons it did not take for us, feedback on mixing academia with a start-up project but mostly why it was really worth trying and why I am likely to do so again!!

Heimo Gursch, Thoughts on Access Control in Enterprise Recommender Systems

This  talk presents a project that the Know-Center is working on with the Virtual Vehicle and four large German automotive companies.  This project is set in the engineering domain of these automotive companies. The current solution of the information gathering process is unsatisfying for their employees. Due to the diverse and heterogeneous nature of their data services, it is hard for them to find what they need. Strict company policies controlling access to the data make the situation even worse. Enterprise search solutions that are currently available are no satisfying solution, since they are hardly used when introduced in the companies.

The  proposed solution consists of a deep integration of recommender systems and the access control scheme that is applied. This presentation focuses on possible solutions to their problems and why information technology alone cannot solve all the issues. Starting with an overview of the current situation at our partner’s, the presentation will continue with aspects of access control as well as the recommender system.

Nikos Manouselis & Christoph Trattner, Je t’aime… moi non plus: reporting on the opportunities, expectations and challenges of a real academic-industrial collaboration

This presentation will be a live exchange of ideas & arguments, between a representative of a start up working on agricultural information management and discovery, and a representative of academia that has recently completed his PhD and is now leading a young and promising research team.

The two presenters will focus on the case of a recommendation service that is going to be part of a web portal for organic agriculture researchers and educators (called Organic.Edunet), which will help users find relevant educational material and bibliography. They currently develop this as part of an EU-funded initiative but would both be interested to find a way to further sustain this work: the start up by including this to the bundle of services that it offers to the users of its information discovery packages, and the research team by attracting more funding to further explore recommendation technologies.

The start up representative will describe his evergoing, helpless and aimless efforts to include a research activity on recommender systems within the R&D strategy of the company, for the sakes of the good-old-PhD-times. And will explain why this failed.

The academia representative will describe the great things that his research can do to boost the performance of recommendation services in such portals. And why this does-not-work-yet-operationally because he cannot find real usage data that can prove his amazing algorithm outside what can be proven in offline lab experiments using datasets from other domains (like MovieLens and CiteULike).

Both will explain how they started working together in order to design, experimentally test, and deploy the Organic.Edunet recommendation service. And will describe their expectations from this academic-industry collaboration. Then, they will reflect on the challenges they see in such partnerships and how (if) they plan to overcome them.

Thomas Stone, Venture Rounds, NDAs and Toolkits – experiences in applying recommender systems to venture finance

Academic-Industrial Collaborations – I am undertaking research with a venture capital firm called Correlation Ventures and applying information retrieval techniques to areas such as industry classification, peer/competitor identification and matching investors with private companies. Due to the nature of the data I am working with there have been extended delays in terms of getting agreements (NDAs) and receiving datasets (anonymizing, 3rd party consent). Previously I had faced several challenges (availability, privacy, IP issues) in finding suitable industry partners and have a handful experiences to share.

Tools for Researchers – I have had experience working with several different toolkits and libraries for applying recommender systems and machine learning (MyMediaLite, LibFM, RapidMiner, Python (SciPy, NumPy, scikit-learn), SVDFeature). I am also now involved with an open-source project PredictionIO targeted at developers who want to build scalable predictive features in their own applications (content discovery, recommendation, personalization). I would be happy to share my experiences both positive and negative on my experience using these different tools as a research student.

Mark Levy, Item Similarity Revisited

The announcement of the Netflix Prize back in 2006 saw the start of a rush to develop novel methods for preference prediction based on collaborative filtering of ratings, a theme which continues to be pursued to this day in academic circles. The impact of rating prediction methods on industry, however, is unclear. Netflix themselves commented on their tech blog in 2012 that rating predictions formed only a single (and relatively uninfluential) input feature to the ranking models which they actually use to generate recommendations. Meanwhile several other industry players, particularly those whose datasets contain only implicit feedback and not ratings, are known still to use simple item similarity methods as the basis of their recommender systems.

Item similarity methods offer fast computation at recommendation time, and natural explanations of why particular items are being recommended, but they have not been a focus of academic research, except as benchmarks which can apparently be easily beaten by more complex algorithms, perhaps because item similarity tends to give high quality recommendations only when carefully tuned for a particular dataset. An interesting paper from 2012 bucked the trend by introducing Sparse Linear Methods (SLIM), and showing that they easily outperformed more complex preference prediction models for top-N recommendation, but at rather a high computational cost compared to traditional item similarity methods when applied to large datasets.

In this presentation I show experimental results which suggest that a simple relaxation of the problem constraints solved by SLIM can lead to an item similarity method which outperforms model-based algorithms but at reasonable computational cost. I put this in the context of some reflections on the reality of running large-scale industrial recommender systems based on experience at Last.fm and Mendeley.

Why is Mendeley Hosting this Event?

Mendeley employs recommender systems technologies to help researchers organise their research, collaborate with one another and discover new research.  We have an in-house team of R&D engineers and Data Scientists who collaborate with academics in four European funded international research projects.  One of these projects is the TEAM project (http://team-project.tugraz.at/), a research project funded by the European Commission in FP7 in the context of the Industry- Academia Partnerships and Pathways (IAPP) Action within the Marie Curie Programme.  The TEAM project is sponsoring this event, encouraging collaborations to form between academia and industry.

Where is the Event?

The event will take place in Headrooms, 1-2 St John’s Path, London, EC1M 4DD (map).

If you have trouble finding the venue, from Britton Street, look for the Jerusalem Tavern and you should see the arch that leads to St John’s Path (see pictures below).

IMG_20130618_104747IMG_20130618_104756

Finding Topics in Mendeley’s Research Catalogue, Part 2: Latent Dirichlet Allocation on a Sample of the Catalogue

Introduction

Mendeley’s 80 million+ research catalogue most likely covers all of the topics that have ever been worked on in research.  What are these topics though and how do we go about finding them?  In my previous post, I looked at these topics from the perspective of recognised academic disciplines and found that the catalogue’s main topics are Biological Sciences, Mathematics and Computer and Information Science.  That’s a familiar set of topics for us to see and gives us an idea of the catalogue’s content.  I wonder, however, if there’s a set of topics that more naturally describes the catalogue, without being constrained to fit in these disciplines that we already know.

In this post, I’d like to explore this idea through the application of a machine learning model named Latent Dirichlet Allocation (LDA).  Given a set of articles, LDA attempts to automatically find the topics that run through them.  I’ll start by describing LDA in a little more detail before comparing some of the implementations of this technique that are available to download.  I’ll then show, as a starting point, how to use LDA to find topics in one of the most read fields in Mendeley’s catalogue, Computer and Information Science, leaving finding topics in the entire catalogue to a future post.  Given that we already know the article classifications in sub-disciplines for this field (again, see Part 1 in this series of posts), I’ll compare how similar they are to the topics found by LDA.  Ok, enough of the preliminaries, let’s get started.

What’s LDA

LDA is a model that has often been applied to large collections of articles in order to find the topics that run through them.  It draws on our intuition that articles can cover several different topics to a greater or lesser degree.  Similar to the previous post, where an article can appear in several disciplines/sub-disciplines, LDA assumes that several topics can appear in articles too.  Unlike in the previous post, however, where topics of academic disciplines and sub-disciplines were predefined by human experts, LDA does not require topics to be defined from the outset.  Instead, the algorithm attempts to find topics that do a good job of accounting for the words in the articles.So if we don’t start out with topics then what do the topics that LDA comes up with look like?  Topics created by LDA are a distribution of words.  For example, LDA may come up with two topics when describing Computer and Information Science articles that are associated with software and hardware respectively (Table 1).  It will be able to give a score to each of the words in a topic to describe how important the word is to the topic (note that scores are not illustrated in the table for simplicity).  It can also describe how important each of these two topics are to a given article, based on the words that appear in the article.  LDA therefore generates two outputs: a list of topics that are each word distributions; and the importance of each topic to a list of articles.

Table 1: Example topics in Computer and Information Science
Topic 1 (software) Topic 2 (hardware)
software computer
game processor
code keyboard
programmer printer
program hardware

From a linguistic perspective, it is interesting to look at this kind of computational modelling as a step towards expressing the semantics in an article.  Language employs both syntax (i.e. words and rules) and semantics (i.e. meanings).  When you write an article, you take an idea and express it through the words in a language in an attempt to convey a meaning.  In other words, articles allow for the transmission of semantics through the medium of syntax.  Syntax is all that a computer model has to work with.  The topics that LDA pulls out, however, tend to invoke a coherent meaning when you read them.  It’s almost as if the model is chipping away at the syntax in order to expose the underlying semantics.  I find it pretty neat to think of it like that.

In probability and statistics LDA gets referred to as a generative model.  If you’d like to read more about LDA then these articles are a great place to start:

LDA Libraries

There’s no shortage of LDA implementations out there.  Since we’re looking at modelling the topics in a very large corpus of 80 million+ articles, I’m ultimately interested in those that scale up well.  At Mendeley we have an in-house Hadoop cluster and prefer to work with code that scales horizontally rather than vertically.  I know of two LDA implementations that are written in MapReduce and run on Hadoop.  The first is implemented in Mahout, Apache’s large scale machine learning library.  The second comes out of the Cloud Computing Research Team based in the University of Maryland and is named Mr. LDA.  I tried both and found that, despite Mr. LDA’s cringeworthy name, it’s much faster than Mahout’s implementation so decided to go with that one.

Data Selection and Preparation

I’m interested in finding out which topics are in Mendeley’s catalogue.  I’ll leave answering that question to the next post, however, and first concentrate on getting some initial results from LDA on a sample set of data.  Since I’d like to results out quickly, I’m going to limit my first corpus to 100,000 articles from a single discipline in the catalogue.  Here’s how I selected them.
Out of the 80 million+ articles in the catalogue, I kept all those from the field of Computer and Information Science, leaving around six million.  I chose this field since it’s one of the larger ones in the catalogue and I feel quite comfortable judging the quality of the topics that are found.  I then took articles with abstracts that contain between 100 and 150 words and randomly selected 100,000 of them.  Although 100,000 articles may seem like a lot, Mr. LDA will make quick work of them allowing me to test lots of parameters, learn the tool’s behaviour and get results out quickly.
Now that we have some articles, it’s time to preprocess them.  As is common in many natural language processing tasks, I make use of Lucene’s standard tokeniser to break down the abstracts into tokens (i.e. words), remove words that contain non-alphanumeric characters and filter out words that are very common in English (using Clef’s stop list).  This reduces the words that appear in the articles, speeding up the overall job without losing useful information for building topics.  I then use Lucene to stem the words so that their different forms are normalised (e.g. perform, performs and performing all become perform).  All results presented use this corpus of 100,000 preprocessed articles.

Running LDA

There are a few steps in running Mr. LDA.  First off, we need to parse our articles to prepare them for the modeller.  Then hold some of them back, say 10%, so that we have something with which to evaluate the model once it’s trained.  Then we train some models on the remaining 90% of articles, selecting a different number of topics for each one.  Finally, we test how good each of our models are using the 10% held back articles to see how many topics we should consider.  Let’s go through those steps in more detail.First step is to parse the articles.  Mr. LDA encourages us to think of a couple of important points at this stage.  Every word that appears in an article can potentially appear in a topic.  We don’t need to include words that are very common to our articles though.  These words will give the model more work to do and will not really help it to make useful topics.  Similarly, we should also think about removing words that appear too rarely.  If a word only appears once then it’s not going to be very useful for making topics.  In this setup, I looked at the frequency distribution of the words in my 100,000 articles.  In total, there were 109,712 unique words.  From looking at the words sorted by frequency, I decided to remove words that appeared in more than 10% and less than 1% of the articles.  74 unique words appeared in too many articles and 97,484 unique words appeared in too few articles.  This leaves 12,228 unique words.

Next we split the articles into a 90% (i.e. 90,000) training set and a 10% (i.e. 10,000) testing set.  It’s good practice to split a corpus into a training set and a testing set.  Once a model learns topics for the set of training articles, it can then be tested on the testing set to see how well it accounts for them.  Since it has not been exposed to the testing set during training, this gives us an idea of how generalisable the learned model is to new articles.

The next step is to a start training some models.  Mr. LDA, as with several implementations of LDA, requires that you provide the number of topics that are to be learned from the outset of training.  Choosing the correct number of topics is somewhat based on experience so don’t be afraid to experiment.  Since you have training and testing data sets you’ll be able to validate just how well a particular number of topics generalises to new data.  To kick things off, I first decided to run training with five topics, increasing it to 10, 20, 30, 40, 50 and then 100.  I ran each of the training sessions for 40 iterations which tends to give reasonably good results for Mr. LDA’s brand of LDA, variational approximation for Bayesian inference.

Mr. LDA allows you to take a model that has previously been learned and to test how well it generalises to another set of articles.  Once each model was trained, it was tested.  The output of testing a model on a set of articles is expressed as a log likelihood value.  The higher the value, the better the model’s likelihood of describing the data.  Don’t worry too much about the actual value produced.  Instead, compare the values produced across different models that you have tested to find out which one is a best fit for the article.  You will probably find different likelihood values for models that are trained to learn a different number of topics.  This will help you decide how many topics are in your articles.

The log likelihood values produced by testing these nine models learned, shows that it changes depending upon the number of topics (Figure 1).  We can see that the peak appears between 20 and 25 topics, with 25 topics being slightly more likely.

Figure 1: Likelihood for 5, 10, 15, 20, 25, 30, 40, 50 and 100 topics.

Topics

The 25 topics that were learned by Mr. LDA are summarised in Tables 2-6.  Note that the words appear in their stemmed forms due to the stemming performed in the preprocessing steps.  For example, the word increase appears in the tables as increas, indicating that it has been stemmed.  The top 15 words from each topic are shown, ordered by probability, with the most probable appearing higher up in the list.  The implementation of LDA used did not give the topics names so I tried to add some myself, shown in brackets after the topic number.  When I couldn’t think of an appropriate name I added a question mark.  I couldn’t think of names for 6 out of 25 of the topics.

Table 1: Topics 1-5
Topic 1 (?) Topic 2 (internet) Topic 3 (programming & hardware) Topic 4 (image processing) Topic 5 (security)
organ file program imag signatur
loss tcp code segment vessel
weld increas parallel signal malwar
site laser memori filter ultrasound
crack determin game detect botnet
determin segment execut video arteri
wavelength clock hardwar region fractur
resid surfac fault nois detect
forens smooth power code lip
panel sensor architectur estim flow
form target processor color beat
growth shadow run frequenc coronari
element remov dynam channel materi
optic charact circuit error blood
observ coat reduc surfac deform
Table 2: Topics 6-10
Topic 6 (genetics) Topic 7 (e-publishing & e-commerce) Topic 8 (software engineering) Topic 9 (?) Topic 10 (genetics)
sequenc blog product watermark gene
gene blogger architectur ambient protein
protein vein agent auction plant
alloi rate integr fire sequenc
laser asl project nest immun
express increas busi forest dna
weld bicycl framework embed express
dna ebxml tool uma speci
surfac lesion resourc bid genom
acid sign qualiti sistema yield
increas lp engin ma content
amino cosmic compon handwrit grain
radix investig cost información biolog
mutat clerk context crash genet
dental dermoscopi plan digit clone
Table 3: Topics 11-15
Topic 11 (e-learning/e-education) Topic 12 (HCI) Topic 13 (networks) Topic 14 (?) Topic 15 (?)
learn interact node group weld
social behavior mobil collect stress
knowledg visual sensor concentr crack
student human wireless conduct temperatur
practic interfac protocol garbag materi
librari event rout found metal
articl speech traffic water laser
educ virtual scheme correl mechan
univers physic energi combin measur
scienc task layer forecast surfac
review activ packet energi fractur
author displai delai measur element
year subject access chines forc
public cognit link translat load
digit measur power polici field
Table 4: Topics 16-20
Topic 16 (security and privacy) Topic 17 (health care) Topic 18 (body and health care) Topic 19 (robotics) Topic 20 (information retrieval & semantic web)
secur patient ecg robot web
attack health weld track languag
trust medic arsen recognit semant
protocol care group human document
protect clinic heart motion queri
privaci treatment laser face search
kei diseas bone learn retriev
authent ag signal detect databas
detect cancer increas posit ontolog
scheme diagnosi breath visual text
access children dev sensor content
polici hospit seizur map knowledg
vulner year ventricular camera word
intrus breast cardiac vision domain
encrypt group atr locat relat
Table 5: Topics 21-25
Topic 21 (bio + chem + phys) Topic 22 (algorithms) Topic 23 (?) Topic 24 (?) Topic 25 (machine learning)
cell optim color burst learn
activ estim gamut bond predict
temperatur space guitar menu classif
water graph weld song pattern
mass paramet rate mutat train
star solut amp mutant classifi
concentr point increas music class
acid measur concentr observ rule
protein approxim low ob neural
magnet search activ myocardi select
compound linear ey optic fuzzi
increas obtain condit region cluster
observ size metal content accuraci
chemic local quantum loci machin
alpha constraint rat coronari decis

Comparing LDA’s Topics with Academic Disciplines

LDA generated 25 topics describing the 90,000 Computer and Information Science articles.  How do these topics compare with the human created discipline classification for the same field?  Taking each of LDA’s topics, I compared them one by one, with the academic disciplines, attempting to find similarities and differences between their meanings.  In some cases, topics matched up to each other pretty well, having roughly equivalent scopes, in others they shared a subset-superset relationship (i.e. one topic was more focussed than the other) while in other cases there was no obvious overlap (Figure 3).  Note that during the discussion, I’ll use capital letters when referring to discipline topics and lower case letters for LDA derived topics.

Figure 3: Comparison of sub-disciplines and topics derived from LDA for the field of Computer Science

12 out of 21 academic topics have good matches in terms of meaning with 13 of the 25 LDA topics.  Four topics from each set of topics match very well with one another having similar meanings and scope.  These topics are computer security, networks, human-computer interaction and software engineering.  It’s good to see such nice matches between the two topic groupings.  It gives us confidence that LDA is working well and encourages us to explore all the topics that it has generated in more detail.

Algorithms, Artificial Intelligence, E-commerce, E-publishing, Graphics, Information Retrieval, Neural Networks and Programming Languages all have LDA topics that match up to them reasonably well but with less well matching scopes as the previous three topics.  For example, the academic disciplines of Artificial Intelligence and Neural Networks match most closely to the LDA topics of robotics, image processing and machine learning.  Artificial Intelligence normally covers both robotics, image processing and machine learning.  It makes me question if Neural Networks and Artificial Intelligence should appear in the same discipline classification or not.  I tend to think of Neural Networks as a subset of Artificial Intelligence rather than living at the same level.  Also, the LDA results make me wonder if Artificial Intelligence should be represented as the specialisations of robotics, image processing and machine learning.  If these are the three topics of Artificial Intelligence that are most published then perhaps our classification system should reflect it.

As a second example of topics with slightly different scopes, I now turn to Information Retrieval and information retrieval & semantic web.  Clearly, the LDA topic is a superset of the discipline.  It could be that the semantic web fits naturally in the Information Retrieval discipline and that I made it explicit when naming the LDA topic.  By representing these two subjects in the same topic, LDA shows that they are similar to one another.  This is intuitively appealing as much research on the semantic web stems from work in information retrieval.

Not all topics match up with one another.  LDA found two topics that refer to genetics, a subject that doesn’t appear in the academic discipline classification.  Undoubtedly, there is much collaboration between computer science and genetics researchers, creating quite a vibrant field of cross-disciplinary research.  It’s useful to see this reflected in LDA’s topics.  Similarly, there are topics of health care and body & health care that also highlight cross-disciplinary research between computer science and health care researchers.  There is also a field on bio & chem & phys.  This topic was difficult to name since the words in it are so mixed.  Perhaps, as such, it’s difficult to say much more about this topic than the fact that because it originates from a corpus for articles in the discipline of Computer and Information Science, it highlights that there is much cross-disciplinary research conducted between these fields.

There is one more topic that LDA generated that I think deserves some attention.  The topic of e-learning/e-education has attracted increasing attention from researchers as of late but doesn’t appear in the academic discipline classification.  I think that this is a good example of classification schemes needing to be updated.  The academic classification scheme, as good as it is, will inevitably have to be updated in order to reflect the changing topics in research.  LDA, being supplied not only with articles from decades gone by but also with articles that were just published in the last month, is able to come up with topics that are up-to-date.  We should use the LDA generated topics to help inform the process of updating the academic discipline classification and make sure that it provides useful and relevant categories for modern researchers.

Conclusions

What are the topics in Mendeley’s catalogue?  We took a step closer to answering this question by examining a selection of Computer and Information Science articles.  We trained a machine learning algorithm named Latent Dirichlet Allocation on a corpus of 90,000 articles and discovered 25 topics.  Studying those topics suggested some overlap with an established discipline classification which gives us enough confidence that LDA was telling us something useful.  Furthermore, it presented a number of topics that, while well known to researchers, do not appear in the academic discipline classification, thus suggesting that the classification system could benefit from being updated.

In future posts, I’d like to look at what Latent Dirichlet Allocation can tell us about other disciplines in Mendeley’s catalogue and eventually look at the entire catalogue as a whole.  The techniques and library used here all scale up well to very large collections of data.  I’ll also consider a set of tools that can be built using  LDA’s topics to further help researchers with their work.

Finding Topics in Mendeley’s Research Catalogue, Part 1: Crowdsourcing

Introduction

Mendeley has crowdsourced a massive catalogue that you can explore to find the most interesting research from all over the world.  The amount of research articles in the catalogue is truly staggering.  Over  two million users have added over 350 million articles to Mendeley which, after we sort them, leaves us with over 80 million unique articles (Figure 1).  That’s 80 million articles on pretty much every research topic that has ever been published.  So what are those topics?

Figure 1: Number of documents, members and research groups in Mendeley(http://www.mendeley.com/on 21 March, 2013.

In a series of posts, beginning with this one, I’d like to explore that question.  In this post, I’ll look at the topics in Mendeley’s research catalogue with respect to their academic disciplines.  These disciplines are used to classify research into high level topics such as Biological Sciences, Economics and Computer and Information Sciences and, for each discipline, into lower level sub-disciplines such as Artificial IntelligenceInformation Retrieval and Database Systems for Computer and Information Sciences.  Based on these disciplines, let’s ask which topics are in Mendeley’s research catalogue.  First, I’ll explain how articles are assigned to disciplines then show how they are proportioned in terms of main and sub-disciplines.

Assignment of Articles to Disciplines

Article disciplines are classified into 25 main disciplines which each have a number of sub-disciplines.  The classification was created by human experts and gives a good picture of which fields exist in research.  When a researcher signs up for a Mendeley account, they identify which research discipline they work in using this classification scheme.  They can also indicate which sub-disciplines they work in.
Articles that are added to Mendeley are assigned disciplines based on the disciplines of the researchers who added them.  For example, the paper MapReduce: Simplified Data Processing on Large Clusters has been added by 2,008 different researchers to Mendeley (Figure 2).  78% of those readers identify themselves as working in Computer and Information Sciences, while only 6% work in Biological Sciences and 3% in Engineering.  Such a crowdsourced approach to article classification is interesting as it allows us to see which disciplines of researchers are interested in which articles and allows articles to appear in more than a single discipline.

Figure 2: Readership statistics for MapReduce: Simplified Data processing on Large Clusters.

Main Disciplines as Topics

Let’s first try to answer the question of which topics are in the catalogue from the perspective of main (i.e. high level) academic disciplines.  You can explore Mendeley’s research catalogue by browsing through its papers based on research disciplines (Figure 3).
Figure 3: Mendeley’s research catalogue (http://www.mendeley.com/research-papers/) with academic disciplines circled. 
From a research discipline perspective Mendeley’s catalogue contains 25 topics (Figure 4).  The most frequently appearing topic is Biological Sciences which accounts for 22.89% of the catalogue’s articles, followed by Mathematics with 15.45% and Computer and Information Science with 7.54%.  Social Sciences, Environmental Sciences, Arts and Literature, Law, Sports and Recreation and Astronomy/Astrophysics/Space Science each account for less than 1% of articles in the catalogue.
Figure 4: Articles classified by main discipline.

Sub-disciplines as Topics

When you browse Mendeley’s catalogue by main discipline, you are presented with a list of papers that appear in that discipline and a further set of sub-disciplines that you can drill down into.  For example, if you choose to browse papers that appear in Computer and Information Sciences then you’ll be presented with a list of 21 sub-disciplines (Figure 5).  By clicking on a sub-discipline you’ll be presented with papers that appear in that sub-discipline.

Figure 5: Mendeley’s research catalogue for papers in Computer and Information Sciences (http://www.mendeley.com/disciplines/computer-and-information-science/) with sub-disciplines circled.
Sub-disciplines give a more detailed classification of the research work that is conducted in a given discipline.  In the case of Computer and Information Sciences, there are 21 sub-disciplines in Mendeley (Figure 6).  The sub-disciplines that have the most articles in them are Artificial Intelligence with 16.53%, Human-Computer Interaction with 12.37% and Information Science with 12.05%.  The topics of Real-Time Systems, Operating Systems, Information Storage, Electronic Commerce, Design Automation and Electronic Publishing make up 1% or less of the papers in Computer and Information Sciences each.
Figure 6: Articles classified by sub-discipline.
For more detailed statistics on Mendeley’s data, please see Mendeley’s Global Research Report 2012.

Conclusion

In this post, I looked at topics in Mendeley’s research catalogue from two levels.  First, I looked at the proportion of topics for the whole catalogue with respect to main disciplines and then I looked at the topics in Computer and Information Sciences with respect to sub-disciplines.  As shown, the topics that account for the most number of articles in the catalogue are Biological SciencesMathematics and Computer and Information Science while in the field of Computer and Information Science, the topics of Artificial IntelligenceHuman-Computer Interaction and Information Science are popular.
In the next post, I’m going to step away from using a hand-crafted classification system to describe topics.  I’m going to show what happens when a machine learning algorithm named Latent Dirichlet Allocation is applied to the catalogue.  This algorithm takes the abstracts of articles in the catalogue and tries to discover topics that implicitly run through them.  I expect that the results will be interesting.

Tuning Mahout’s Item-based Recommender

Introduction

Mendeley Suggest (Figure 1) generates personalised research article recommendations for researchers.  It’s a pretty awesome feature, that I like to think of as a personalised research advisor who’s at your beck and call 24/7.  It’s powered by Mahout, a machine learning and data mining library, that’s designed for large scale data processing.  While Mahout’s recommendation algorithm implementation is lean, it can still burn a hole in your pocket when you’re generating recommendations for millions of users on a daily basis.  In this post, I look at ways to reduce the cost of generating recommendations without sacrificing their quality, by better understanding the characteristics of the input data.

Figure 1: A screenshot of Mendeley Suggest, a research article recommendation system.

Measuring Recommenders

Mendeley generates article recommendations for all of its users on a weekly basis.  When generating recommendations, I’m primarily interested in two factors: recommendation quality; and cost.  Of course, I’m looking for high quality and low cost.  Typically, reducing the cost, will also reduce the quality, but I’d like to reduce the cost without reducing the quality.  Let’s first look at how quality and cost are measured before putting some numbers on how Mahout’s recommender performs.

Recommendation Quality

Assume that we have 100 user libraries.  We can generate recommendations for all 100 users but how can we tell if they are any good?  We could present the recommendations to the users and get them to rate how good they are.  This process, however, is quite time consuming, especially when you have several hundred or even thousands of ways of generating recommendations and you want to know which one works best.

An alternative to asking users, is to run cross-validation with a ‘leave some out’ method.  From the 100 libraries, split them into a training set and a testing set.  Let’s put 90 libraries in the training set and 10 in the testing set.  From the testing set, for each user library, hide some of their articles.  Train the recommender with the 90 training set libraries and the non-hidden articles from the testing set.  Predict recommendations for the users in the testing set and compare them with their hidden articles.  If your recommender manages to recommend any of the articles that were hidden, then it’s doing a pretty good job because we know that these articles are of interest to that user, since they are in their library.  We can measure the quality of recommendations with those bread-and-butter staple metrics of information retrieval: precision; and recall.  Repeat this procedure several times, say ten, from the start, creating new training and testing sets, so that all user libraries appear in at least on testing set.  This is called ten-fold cross validation and you can run through many recommender configurations very quickly.

Cost

Mahout’s distributed recommender is written in MapReduce and designed to run on Hadoop.  The quicker the recommender generates results, the less it costs to run.  Mahout can be run on Amazon’s Elastic Map Reduce (EMR) framework.  When you spin up an EMR cluster of computers, you pay for the number of computers that you rent per hour.  When the MapReduce job completes, the EMR cluster is torn down and the final cost is computed.

Generating Recommendations

Mahout 0.6’s distributed item-based collaborative filtering recommender has been tested on EMR under cross-validation conditions.  It was trained on a large set of 350,000 user libraries, all containing between 2 and 1,000 research articles.  In total, there are over 250 million articles in these libraries, representing around 35 million unique articles.  5,000 user libraries were tested.  On average, it cost $650 to generate recommendations for all users, with an average precision of 0.15 @ 10 (i.e. 1-2 good recommendations out of 10) (Figure 2).

Figure 2: Results from testing the Mahout 0.6’s out of the box distributed item-based recommender.

While this level of precision looks low, it’s actually quite competitive.  First, we don’t know for sure if a recommended article that does not appear in a user’s library would be of interest to that user or not.  That is, a user’s library is not a complete set of their interests, but a subset.  With real user testing, we tend to find that acceptance rates are higher than this precision predicts.  Second, this is a difficult task.  There are very few personalised research article recommender systems in the world and only a couple that operate on a large scale: Mendeley Suggest; and CiteULike.  This is a competitive rate of precision in this setting.  How about the cost for generating the recommendations though?  When I look at the steps of the MapReduce job in detail, I don’t get the impression that it’s making effective use of the resources available, meaning that the job takes longer to complete and costs more.

Mapper and Reducer Allocation

The Problem

In order to efficiently generate recommendations, Hadoop distributes Mahout’s MapReduce calculations across a cluster of computers.  Hadoop, however, doesn’t always make the best use of the resources available to it.  When I was running the recommender on EMR, I looked at how the jobs were being distributed across the cluster.  Mahout chains 10 MapReduce jobs together, running each step in the pipeline sequentially.  In one job in the pipeline, hundreds of gigabytes of data were all being poured into a single reducer task, despite there being 40 reducers available across the cluster (Figure 3).  The lone reducer took almost 40 hours to complete its task.  A more efficient use of the cluster would be to distribute the calculation across more reducers.
Figure 3: Reduce Completion Graph shows that 1 reducer has been allocated to the job.  It has progressed through less than 10% of the reduction phase, still copying data that is output from the mappers.

A Solution

Sometimes Hadoop needs a helping hand in allocating mappers and reducers.  The number of reducers allocated to a job can be set quite easily using the “mapred.reduce.tasks” property.  This will guarantee the number of reducers allocated to the job.

 job.getConfiguration().setInt(
"mapred.reduce.tasks",
numReducers);

If the number of mappers being allocated to a job isn’t appropriate then Hadoop also provides some functionality for tuning the number.  You can change the split size that Hadoop applies to input files when its allocating the slices to mappers.  If, for example, the input is 128MB in size and the split size is 64MB, then the input is split between 2 mappers.  Reducing the split size to 32MB will split the input across 4 mappers.  Unlike reducer allocation, this code will not guarantee the number of mappers that is allocated to a job since multiple input files are split separately.  So, if the job takes 3 input files, each of size 65MB and the split size is 64MB then each of the input files will be split in two, leading to 6 mappers being allocated in total, despite their total size being only 195MB (i.e. 3 x 65MB), which could be split across 4 mappers.

 job.getConfiguration().set(
"mapred.max.split.size",
String.valueOf(splitSize));

Continuing the previous example of reducer under-allocation, I ran the same job but with 40 reducers being allocated this time round (Figure 4).  The overall time for the job to complete reduced from almost 40 hours to around 14 hours.  That’s a time saving of 65% for that single job and shaves off 26 hours from the entire 10 job pipeline.

Figure 4: Reduce Completion Graph shows that 40 reducers are running in parallel.

Speed Up 1

I helped Hadoop to better allocate the number of mappers and reducers at all steps of the 10 job pipeline.  Overall, this had the effect of reducing the cost of generating recommendations by $150 (Figure 5).  That’s almost a quarter of the original cost.  Interestingly, helping Hadoop to make better use of its resources does not have an impact on the quality of the recommendations generated. As a result, we can generate recommendations of the same high quality but for less expense.

Figure 5: Comparison of the results from testing Mahout 0.6’s out of the box distributed item-based recommender (Mahout 0.6) and a version with better mapper and reducer allocation (Mahout 0.6′).  The latter version costs $150 less than the out of the box version.

Distribute Data into Even Partitions

The Problem

As previously discussed, we can reduce the overall running time, and therefore cost, for a job when we distribute tasks over multiple nodes in our cluster.  When Hadoop allocates multiple reducers to a job, it attempts to partition the reducer’s input data evenly so that each reducer has roughly the same amount of work to do.  This partitioning, however, isn’t always well distributed.  Figure 4 shows that we can distribute a job across multiple reducers, 40 in this case.  Looking at the amount of data being passed on to each reducer, however, reveals that the input data is not being evenly distributed (Figure 6).  Although the two reducers highlighted have been running for the same amount of time.  One is already over 10% complete with the other is just over 1% complete.  The more complete reducer has got to 10% by only copying around 50KB of data, while the other reducer has already copied around 500MB of data.  Since the data is not being evenly distributed, it means that some reducers are being given less work to do and that the overall completion time for the job is not going to reduce as much as it would if the reducers shared evenly distributed loads.

Figure 6: Reduce Completion Table with file counters for two selected reducers.  One reducer has received around 50KB of data while the other has received around 500MB of data.

A Solution

Hadoop allows you to define the way in which it partitions the input data fed into the reducers.  Since my input data isn’t evenly distributed, it’s important to sample it in order to find appropriate partition points.  Hadoop provides an InputSampler that will do just that.  It samples your data and then sends these partition points to a partitioner that allows your data to be more evenly distributed.

 InputSampler.Sampler<IntWritable, Text> sampler =
new InputSampler.RandomSampler<IntWritable, Text>(...);
InputSampler.writePartitionFile(conf, sampler);
conf.setPartitionerClass(TotalOrderPartitioner.class);

Applying this partitioner to the misbehaving steps has a dramatic improvement on the run-time (Figure 7).  Since the amount of data being sent to each of the 40 reducers is well distributed, it means that they all complete in roughly the same amount of time.  The run-time for the most ill-distributed job reduces from around 14 hours to just 2 hours.

Figure 7: Reduce Completion Graph that shows 40 reducers running in parallel.  The reducers are progressing at roughly the same pace as one another.

Speed Up 2

By applying well distributed partitioners to the steps in the Mahout pipeline, the overall cost of generating recommendations reduced by $270, a saving of over 50% (Figure 8).  Partitioning the data more evenly means that the Mahouts resources are being used more evenly.  Again, while the cost of generating recommendations has been reduced, their quality has not been compromised.

Figure 8: Comparison of the results from testing Mahout 0.6’s out of the box distributed item-based recommender (Mahout 0.6), a version with better mapper and reducer allocation (Mahout 0.6′) and a version with better mapper and reducer allocation plus better data partitioning (Mahout 0.6”).  The third version costs $270 less than the second version.

Conclusion

Mahout’s algorithms are very efficient.  Hadoop, however, sometimes needs a helping hand in making the best use of the resources available.  In this post, I considered two ways of reducing the overall cost of generating recommendations on a large scale without reducing the quality of the results.  The first is to set the number of mappers and reducers that are allocated to jobs, helping them to exploit the full resources of the cluster in which they are running.  The second is to evenly distribute the input data that is passed to jobs by evenly partitioning it.  By understanding the shape of your data, you can tune Hadoop to make better use of its resources and significantly reduce the cost of running jobs.  In this case, the overall cost of generating recommendations has been reduced from $650 to $240, a reduction of 63% in cost (Figure 9).

Figure 9: Comparison of the results from testing Mahout 0.6’s out of the box distributed item-based recommender (Mahout 0.6), a version with better mapper and reducer allocation (Mahout 0.6′) and a version with better mapper and reducer allocation plus better data partitioning (Mahout 0.6”).  The third version costs $410 less than the out of the box version.

Under the Bonnet of Mahout’s Item-based Recommender

Introduction

Mahout is an excellent machine learning library built on Hadoop.  It implements a wide range of algorithms designed to run on large scale data sets.  In this blog, I look at Mahout’s distributed recommendation engine in detail.  Although the theory behind collaborative filtering is relatively straight-forward to understand, the distributed implementation of the algorithm needs some explanation.  I hope that this will help anyone who is keen to understand how Mahout’s recommender code implements an item-based collaborative filtering algorithm.

Item-based Recommendations

First off, let’s recap the theory behind how an item-based recommender works.  Given a set of items that someone is known to like, recommend a new set of items that are similar to them.  Two items can be regarded as similar to one another if different people like both of them.  In terms of pseudocode:

for every item i that has u has no preference for yet
  for every item j that u has a preference for
    compute a similarity s between i and j
    add u‘s preference for j, weighted by s, to a running average
return top items, ranked by weighted average
– Owen, S., Anil, R., Dunning T., and Friedman E. 2010. Mahout in Action. Manning Publications.

Sean Owen, the main developer behind Mahout’s recommender, has previously given a very good, high level description of Mahout’s item-based filtering algorithm.  He shows that an item-based recommendation algorithm can be implemented as a matrix multiplication equation, where the recommendations for a given user can be calculated by multiplying the item-item similarity matrix (i.e. definition of item similarity) with an item-user vector (i.e. user preferences for the items).  Don’t worry, the mathematics isn’t too tough, it’s just basic matrix multiplication:

This multiplication can be broken down in order to satisfy MapReduce’s distributed processing constraints, allowing it to be implemented in Mahout.

Believe it or not, matrix multiplication can generate pretty good recommendations.

Under Mahout’s Recommender’s Bonnet

Now, let’s delve into Mahout’s code.  The following description is based on Mahout 0.6’s implementation.  Our starting points is org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.  The code is currently structured so that it breaks the recommender into four logical parts:

  1. prepare the preference matrix
  2. generate the similarity matrix
  3. prepare for matrix multiplication
  4. multiple matrices

I’ll follow this structure in describing the code.

Prepare the preference matrix

The code for the preference matrix preparation is contained in the PreparePreferenceMatrixJob class.  This class is comprised of the three following MapReduce jobs.

1. ItemIDIndexMapper and ItemIDIndexReducer – maps item ids in the input data from Long types to Integer types.  This keeps Mahout’s non-distributed recommender, formally Taste, compatible with the distributed recommender being described here.  Of course, mapping Longs to Integers will result in collisions since Longs are 64-bit while Integers are only 32-bit (http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html).  I remove this step from my customised version of the recommendation job and the later step that maps Integer values back to Longs

2. ToItemPrefsMapper and ToUserVectorsReducer – generates the user-item vectors
3. ToItemVectorsMapper and ToItemVectorsReducer – generates the item-user vectors

Generate the Similarity Matrix

Next, the similarity matrix is generated.  This code appears in the RowSimilarityJob, again as a set of three MapReduce jobs.

1. VectorNormMapper and MergeVectorsReducer – reads in the item-user vectors and outputs them as user-item vectors, calculating statistical matrix properties in the process. These statistics are later employed in matrix calculations.

2. CooccurrencesMapper and SimilarityReducer – reads in the user-item vectors from the previous step and creates half of an item-item similarity matrix.  Only half of the matrix is generated at this stage as it is symmetric along its diagonal, so the other half can be inferred.  The matrix also lacks values along its diagonal.

3. UnsymmetrifyMapper and MergeToTopKSimilaritiesReducer – reads half of the item-item similarity matrix and generates an almost complete one, missing only the values along its diagonal.

Prepare for Matrix Multiplication

With both the item-item similarity matrix and the user-item vectors, it’s now possible to multiply them together and generate recommendations for users.  In this part of the code, the matrix multiplication begins, again, by implementing three MapReduce jobs.

1. SimilarityMatrixRowWrapperMapper and default Reducer – completes the item-item similarity matrix by adding in Double.NaN values along its diagonal.  In preparation for the item-item matrix being multiplied with the item-user vectors, it converts the matrix from VectorWritable to VectorOrPrefWritable.

2. UserVectorSplitterMapper and Reducer – reads in user-item vectors and outputs them as item-user vectors, formatted as VectorOrPrefWritable, to make them compatible for multiplication with the item-item matrix.

3. Default Mapper and ToVectorAndPrefReducer – combines the output from the previous two steps, thus starting the multiplication of the item-item similarity matrix with the item-user vectors.

Multiple Matrices

Now that the item-item similarity matrix and the item-user vectors have been combined together into a single source of data, the matrix multiplication can be completed and user recommendations generated.  This process is implemented in a single MapReduce job.

1. PartialMultiplyMapper and AggregateAndRecommendReducer – completes the matrix multiplication, producing item recommendations for users.

Conclusion

The theory behind collaborative filtering is intuitive to grasp.  Producing a distributed implementation of it, however, isn’t so straight-forward.  Thankfully, that’s where libraries like Mahout come in, letting us benefit from work that smart folks have already done.  After you have played with the system for some time though, you’ll perhaps want to get your hands dirty and tweak it to be more suitable for the problem that you’re trying to solve.  I hope that this post can help to serve as an entry point for people who want to understand the implementation in more detail.  Enjoy Mahout, I know that I do!

Visualising my Mendeley Contact Network

Keeping up with the latest advances in your research field isn’t easy.  I find it particularly difficult to keep up with the fast pace with which computer science research moves.  Building and maintaining a good contact network can help though.  Contacts can let you know of any important new work that pops up on their radar, making it less likely that you’ll miss it.  I use Mendeley to manage my contact network.  As of today, 8th, March, 2012, I’ve got 162 contacts (http://www.mendeley.com/profiles/kris-jack/).  I thought that I would have a play around with some visualisation software to see what it could do with my contact network.

Using Vizster (http://hci.stanford.edu/jheer/projects/vizster/), an open-source library released under a BSD license, I created a visualisation of my contact network (Figure 1).  My profile is in the centre of the network and a line is drawn between me and all of my connections.  Similarly, lines are drawn between my contacts who are connected to one another.  Vizster attempts to cluster people together based upon the connections between them and to colour in communities that it finds.  This is a pretty neat feature and it works quite well for my contacts.  I recognise the people in the right most grouping as being colleagues from the University of Dundee, where I studied, and majority of people in the far left group as being colleagues from Mendeley, where I work.  It’s interesting to see the groups that have formed and to see the people who appear at the boundaries between them.  Prof. Granitzer, for example, is surrounded by people who he used to work with at TU Graz, but he’s also quite close to the Mendeley group, which makes sense given how closely he collaborates with us.  Not everyone in my contact network has been well placed though.  The people who swarm around me in the middle, for example, come from several groups of contacts that Vizster didn’t identify.  I’d like to look more into how such technologies can not only help me, but help all Mendeley users to manage and better understand their own contact networks.

Figure 1: Visualisation of contact network with names and pictures of contacts

Visualising my contact network has helped me to see how my contacts are connected together and the natural groups that they form.  It’s useful to see, at a glance, where the groups touch one another and who straddles the boundaries.  It also helps me get to know my existing contacts better by seeing who knows who.  Such visualisations do have their limitations, but that’s to be expected.  Overall, I’m pretty glad with the results from my little foray into the world of visualisation and glad that I’ve got Mendeley’s contact network to keep up-to-date with latest advances in my field.