Skip to content

Is Content still King or has Usage Usurped it? A Case Study in Recommending Scientific Literature.

March 30, 2015

Introduction

Imagine that you’re a researcher and you’ve just found an article that’s really useful for you. It asks lots of the questions that you’ve been asking and even gives some interesting answers. How can you find other articles that are related to that one?

I’ve been working with some top researchers (Roman Kern and Michael Granitzer) on trying to build a system that helps solve this problem. Given any of the millions of articles in Mendeley’s Research Catalogue, can we automatically propose other articles that are related to the one you’re reading?  We currently try to do this for Mendeley users by recommending related research (Figure 1).

Related research for an article in Mendeley's catalogue.

Figure 1: Related research for an article in Mendeley’s catalogue.

Alternative Solutions

This is a pretty common problem in the fields of Information Retrieval and Information Filtering.  Many search engines have built-in features that support it, like Lucene’s MoreLikeThis, and it’s also commonplace in recommender systems, like Mahout’s ItemSimilarity.  There are three basic approaches to solving this problem:

  1. Take the content of the research article (e.g. title, abstract) and try to find other articles that have similar content.  This is often called a content-based approach.
  2. Look at who reads the research article and try to find other articles that are also read by these people.  This tends to be called a collaborative filtering approach.
  3. Use both the content and usage data (i.e. who reads what) into account to find related articles.  Unsurprisingly, this gets called a hybrid approach.

We can set up a few experiments that use each of these approaches as we have all of the data that we need at Mendeley.  In terms of content, we have the metadata for millions of research articles and in terms of usage data, we can look at reading patterns throughout Mendeley’s community.  It’ll be interesting to see which approach gives the best results.

Experiment

We set up an experiment to compare content-based, collaborative filtering and hybrid approaches to retrieving related research articles.  First off, like in many experiments, we need to have a data set that already tells us which articles are related to which other articles, so that we can use this to evaluate against.  While we don’t know how all articles are related to one another (if we did then we wouldn’t need to build a system that tries to automatically find related articles) we do have some data for a subset of these relationships.  Namely, we can look at which articles researchers add to publicly accessible groups on Mendeley and make the assumption that any two articles that appear in the same group together are more likely to be related to one another than any two articles that do not appear in the same group together.  Ok, this isn’t perfect but it’s a good evaluation set for us to use to compare different approaches.

We then set up three systems, one for each approach, that we’ll call content-based, usage-based (i.e. collaborative filtering approach) and hybrid.  To test them we select a random sample of articles from the publicly available Mendeley groups and, for each one, requested that the three systems return an ordered list of five articles that were most related to them.  If an article returned was related to the article input into the system (i.e. they both appear in a public group together) then this counts as a correct guess for the system.  Otherwise, the guess is judged as incorrect.

Results

The relative accuracy of the three approaches can be seen in Figure 2.  Here, accuracy is a measure of the average number of correctly retrieved related articles over all tests (average precision@5).  The content-based approach performs worst, retrieving just under 2/5 related articles on average (0.361); the usage-based approach doesn’t perform much better, retrieving around 2/5 related articles on average (0.407); while the hybrid approach performs best, retrieving just under 3/5 related articles on average (0.578).

Comparison of three approaches to retrieving related research

Figure 2: Comparison of three approaches to retrieving related research. Each approach could retrieve a maximum of five correct results.

These results are interesting.  From an information retrieval perspective, given that research articles are rich in content, it’s reasonable to assume that a content-based approach would perform very well.  This rich content only allows us to retrieve around two relevant articles out of five attempts.  This seems quite low.  From a recommender systems perspective, it’s common for usage-based approaches to outperform content-based ones, but the results here do not indicate that there is a big difference between them.  Perhaps the content is, in fact, rich enough here to be a reasonable competitor for usage-based approaches.

The clear winner, however, is the hybrid approach.  By combining together content-based and usage-based approaches, we can see significant gains.  This suggests that the information being exploited by these two approaches is not the same, which is why we can get better accuracy when we combine them.

Conclusion

The question raised in the title is whether content is still king or if usage has usurped it.  In this study, when we try to automatically recommend related research literature, content-based and usage-based approaches produce roughly the same quality of results.  In other words, usage has not usurped content.  Interestingly, however, when we combine content and usage data, we can produce a new system that performs better than using just one of the data sources alone.

This is a very high level take on the work that we have done.  For those of you who’d like more details, a write up has been published in Recommending Scientific Literature: Comparing Use-Cases and Algorithms.

Advertisements

From → Uncategorized

4 Comments
  1. joeranb permalink

    Hi Kris,

    that’s a really nice Blog post, being worth published as a short paper at JCDL or the WOSP ;-). I could imagine that over time (when even more people use Mendeley, and the usage-based recommender system has more data available), the accuracy even further increases.

    I think there is one more promising recommendation approach besides content based and collaborative filtering (and hybrid), and that is graph-based approaches: with all the articles in Mendeley’s catalog it should be possible to build a huge citation network (not only including citations but also entities such as authors, venues, …). Then you could determine relatedness of articles with graph algorithms such as Random Walk (with Restart) etc. Some interesting read on this topic are the papers from “TheAdvisor” https://scholar.google.de/scholar?q=%22TheAdvisor%22&btnG=&hl=de&as_sdt=0%2C5

    I would be really interested to see how graph-approaches perform compared to CBF and CF.

    For others being interested in research paper recommender systems, my literature survey might be interesting 🙂 http://docear.org/papers/research_paper_recommender_systems_–_a_literature_survey.pdf

    • You’re spot on with graph-based approaches being promising. In addition to the citation graph, we also have coauthor graphs and social networks. I hope that we can share some results for them soon!

  2. jtkane permalink

    Hi Kris, I have to agree with Joe and yourself as graph-based approaches are most promising. I found your blog entry via http://getprismatic.com/ who have recently implemented an Interest graph approch very successfully for curated content based upon the users interests. It kicks ass and can almost be considered additive for reading content you are interested in. Certainly better than other recommendations methods I’ve worked on in open source search engine technologies. I look forward to reading more about your results with citation and coauthor graphs! – John

  3. I think the hybrid approach wins big because the two approaches it is made up of return recommendations that are complementary rather than overlapping. It’s the kind of difference you would see between item-item and user-user results – the first is “tighter” around the content and the second contains “leaps” that make sense but don’t follow directly from content.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: