New version of Mendeley Suggest has been released. It’s in Beta so please do give feedback!
Mendeley Suggest is an article recommender system for researchers. It’s designed to help them discover new research based on their short and long term interests and to keep them up-to-date with what’s popular and trending in their domains.
The first set of recommendations is based on all of the articles that you have added to your personal library in Mendeley. These recommendations are very good for me, as much of my recent work has been on recommender systems.
The second set of recommendations is based on activity in Mendeley’s community. It recommends articles that are important in your domain but not necessarily related to your immediate interests. It’s a nice tool for helping researchers to keep up-to-date with position papers and reviews that have made an impact in your broader domain which, in my case, is Computer Science.
View original post 358 more words
Twitter recently introduced some new functionality. Now you can create polls that stay open for 24 hours and allow you to tap into the Twitter community to get some answers to your burning questions. Anyone on Twitter can vote for one of the two options in the poll and their choice remains anonymous. I thought I’d have a little fun with them.
First a Few Words about how I use Twitter
Out of all of the tools that I use to keep up-to-date with what’s going on in the world of work, Twitter has to be the tool that I rely on the most. I never use Twitter’s main site, other than to check out analytics from time to time, but I regularly use Tweetdeck to keep on top of several topics at a glance.
When I want more of a summary of interesting articles being tweeted on a particular topic I also dip into Right Relevance to see what’s happening.
I also like to contribute to Twitter and tweet quite regularly. I find it’s a great way to release random thoughts. Crowdfire really helped me to find interesting people to follow and build up my own network.
Now on to the Polls
The first poll that I created was about Data Science. In particular, it was about the job title of Data Scientist. Are people happy with their Data Scientist titles?
Interestingly, only 55% of the respondents like their title. Perhaps Data Scientist isn’t the sexiest job of the 21st century any longer. I love my job but this fits with what I’ve recently seen in the community. I increasingly see data science professionals from a range of companies becoming frustrated by the lack of clear meaning in their job title. There are several roles that data scientists play (this recent study from Bob Hayes goes into detail) and lumping them all together as if they are one isn’t helpful. It can also lead to envy and resentment across teams where one kind of data scientist is doing a very different job from another although upper management sees their contributions are equal. Or, similarly, the same title means very different things in different companies and makes it hard to know what you’re dealing with when you see it on someone’s CV.
That result was interesting enough to encourage me to play more. This time, I sent out a much more provocative poll to see what would happen.
This one got over double the respondents, 48 people voted for it over 20 for the first one. Looks like I’m not alone in thinking that LinkedIn has a problem with recruiters.
I then went back to the topic of Data Science. I was surprised in Bob Hayes’ study about Data Science roles that so few of them had a lot of knowledge about Machine Learning. For me, machine learning is central to this job. So, I opened up a new poll.
46 votes and a resounding yes with 80%. So, although I don’t know who answered the poll, it looks like they have the same expectation as I do that Data Scientists should be experts in Machine Learning.
As a follow up from the first poll about whether Data Scientists like their job title, I then got to thinking that perhaps there’s a mismatch between what employees expect a Data Science job to be and what employers advertise as a Data Science job.
It turns out that people don’t think that Data Science job adverts describe the role well. This might also be a reason for some of the unrest in the Data Science community. It’s hard to keep good people if you aren’t clear and honest about what they will be working on during the interview process.
Finally, I wasn’t sure if the results of Twitter polls could really be trusted so I put it to the vote.
At the time of writing this post, the poll is still ongoing. I wonder what the answer will be?
Are these Poll Results Credible?
In a word, no. Anyone who knows the first thing about creating polls and surveys will know not to put much stock in the results of these polls. There’s plenty of selection bias going on, no way to tell what the demographics of the respondents were, no way to ensure if the respondents were eligible to answer, no way to follow up and dig deeper and only a small number of respondents replied. Some of these problems are particular to this test but others will apply generally to Twitter polls casting doubt on their general usefulness.
Let’s not let science get in the way of a little fun though.
Twitter polls are great, I’m addicted already! I think I’m going to enjoy playing with them over the next few months. I’ll be taking all of the results with a substantial helping of salt but hey, it’s fun.
Finally, as if it were planned. I just noticed that I have sent 1,499 tweets so far. That’ll make my next tweet, which will be about this blog post, the big 1,500! Thanks Twitter!
There’s something big happening in the world of technology. Over the past couple of years there’s been a resurgent interest in neural networks and excitement over the challenging problems that they can now solve. Folks are talking about Deep Learning… I’ve been keeping an eye on what’s happening in the field to see if there’s anything that we can use to build better tools for researchers here at Mendeley. Recently I went along to the Re-Work’s Deep Learning Summit and thought I’d share my experiences of it here.
The Deep Learning Summit
The 300 attendees were mainly a mixture of technology enthusiasts, machine learning practitioners and data scientists, who all share an interest in what can be done uniquely with deep learning. We heard from around 50 speakers, over two days explaining how they are using deep learning. There were lighting talks, normals ones with Q&A and some fireplace chats. There were lots of questions, chief amongst them being ‘What is deep learning?’, and some notable highlights.
Paul Murphy, CEO of Clarify, gave a brief history of deep learning. While neural networks were popular in the 1980s, they went out of fashion for a couple of decades as other techniques were shown to outperform them. They were great at solving toy world problems but beyond some applications, such as automatic hand-writing recognition, they had difficulty in gaining traction in other areas. Yann LeCun, Yoshua Bengio, Geoffrey Hinton, and many others persevered with neural network research and, with the increase in computational processing power, more data and a number of algorithmic improvements, have shown that they can now outperform the state-of-the-art in several fields including speech recognition, visual object recognition and object detection. Deep Learning was born.
There was an interesting fireside chat with Ben Medlock (@Ben_Medlock) from SwiftKey, the productivity tool that predicts the next word you’re about to type to save you time. I love this app. Ben spoke about the challenges involved in natural language processing and how many of the current syntactic approaches don’t exploit the semantics of the sentences. This is where deep learning comes in. Using tools like word2vec, you can compare words in a semantic space and use that to improve text prediction. He spoke about how they have done some work with recurrent neural networks, possibly the deepest of deep learning, to improve their tools.
A lot of the work presented was in the area of vision. This is a field in which deep learning has made consistent advances. Matthew Zeiler presented some impressive demos from Clarifai. They take videos and automatically tag them with concept tags in real-time, from a selection of over 10,000 tags. They report that deep learning has significantly improved the quality of results here. It’s available through their API as a service and they already have a number of high profile customers such as vimeo, vodafone and trivago.
Some early work on neural turing machines also peaked my interest. Alex Graves, from Google DeepMind, told us that modern machine learning is good at finding statistics. For example, you can train a network model to give you the probability of an image, given a label, or the probability of some text given some audio, but the techniques behind them don’t tend to generalise very far. To improve them, the usual solution is to make better representations of the data, such as making the representations more spare or disentangling them into different factors. Neural turing machines offer another solution where instead of learning statistical patterns, they learn programs. These programs are made up of all of the usual things that programming languages provide like variables, routines, subroutines and indirection. The main advantage of this is that these machines are good at solving problems where the solutions fit well into algorithmic structures. I assume that this also makes the solutions much more readable, adding some transparency to the typical black box solutions.
Finally, a fun one but certainly with serious science behind it. Korey Kavukcuoglu, also from Google DeepMind, spoke about their agent-based systems that use deep learning to learn how to play Atari games. For him, deep learning takes us from solving narrow AI problems to general AI. He showed that through using reinforcement learning, where agents learn from observations in their (simulated) environments and are not given explicit goals, they trained Deep Q networks (convolutional neural networks) to play a number of Atari games. In several games they perform with human-like performance and even go beyond in some cases. They built this using Gorila, Google’s reinforcement learning architecture, designed to help scale up deep learning to be applied to real-world problems.
Deep Learning is not just hype, which was one of my worries before going to the summit. It clearly can solve lots of real-world problems with a level of accuracy that we haven’t seen before. I’ve kicked off a few hacks and spikes to explore what we can build for Mendeley’s users using some of these techniques. If we get good results then expect Deep Learning to be powering some of our features soon!
With this first post I’m going to introduce Mendeley Research Maps. At Mendeley we have monthly hackdays when we can experiment with new technologies, work on side projects or simply learn something new and have fun. During one of my first hackdays I started to work at a two dimensional visualisation of research interests, inspired by an idea suggested to me by a good friend and future colleague, Davide Magatti. The first hack produced the following visualisation:
In this picture disciplines are arranged on the map based on how related they are: medicine is very broad and close to many other disciplines, such as biology and psychology. In the opposite corner computer science is close to engineering and economics. From this first draft I started to build a web application for Mendeley users, using the Mendeley Python SDK for querying the Mendeley API, Flask as a web framework, CherryPy
View original post 1,144 more words
Imagine that you’re a researcher and you’ve just found an article that’s really useful for you. It asks lots of the questions that you’ve been asking and even gives some interesting answers. How can you find other articles that are related to that one?
I’ve been working with some top researchers (Roman Kern and Michael Granitzer) on trying to build a system that helps solve this problem. Given any of the millions of articles in Mendeley’s Research Catalogue, can we automatically propose other articles that are related to the one you’re reading? We currently try to do this for Mendeley users by recommending related research (Figure 1).
This is a pretty common problem in the fields of Information Retrieval and Information Filtering. Many search engines have built-in features that support it, like Lucene’s MoreLikeThis, and it’s also commonplace in recommender systems, like Mahout’s ItemSimilarity. There are three basic approaches to solving this problem:
- Take the content of the research article (e.g. title, abstract) and try to find other articles that have similar content. This is often called a content-based approach.
- Look at who reads the research article and try to find other articles that are also read by these people. This tends to be called a collaborative filtering approach.
- Use both the content and usage data (i.e. who reads what) into account to find related articles. Unsurprisingly, this gets called a hybrid approach.
We can set up a few experiments that use each of these approaches as we have all of the data that we need at Mendeley. In terms of content, we have the metadata for millions of research articles and in terms of usage data, we can look at reading patterns throughout Mendeley’s community. It’ll be interesting to see which approach gives the best results.
We set up an experiment to compare content-based, collaborative filtering and hybrid approaches to retrieving related research articles. First off, like in many experiments, we need to have a data set that already tells us which articles are related to which other articles, so that we can use this to evaluate against. While we don’t know how all articles are related to one another (if we did then we wouldn’t need to build a system that tries to automatically find related articles) we do have some data for a subset of these relationships. Namely, we can look at which articles researchers add to publicly accessible groups on Mendeley and make the assumption that any two articles that appear in the same group together are more likely to be related to one another than any two articles that do not appear in the same group together. Ok, this isn’t perfect but it’s a good evaluation set for us to use to compare different approaches.
We then set up three systems, one for each approach, that we’ll call content-based, usage-based (i.e. collaborative filtering approach) and hybrid. To test them we select a random sample of articles from the publicly available Mendeley groups and, for each one, requested that the three systems return an ordered list of five articles that were most related to them. If an article returned was related to the article input into the system (i.e. they both appear in a public group together) then this counts as a correct guess for the system. Otherwise, the guess is judged as incorrect.
The relative accuracy of the three approaches can be seen in Figure 2. Here, accuracy is a measure of the average number of correctly retrieved related articles over all tests (average precision@5). The content-based approach performs worst, retrieving just under 2/5 related articles on average (0.361); the usage-based approach doesn’t perform much better, retrieving around 2/5 related articles on average (0.407); while the hybrid approach performs best, retrieving just under 3/5 related articles on average (0.578).
These results are interesting. From an information retrieval perspective, given that research articles are rich in content, it’s reasonable to assume that a content-based approach would perform very well. This rich content only allows us to retrieve around two relevant articles out of five attempts. This seems quite low. From a recommender systems perspective, it’s common for usage-based approaches to outperform content-based ones, but the results here do not indicate that there is a big difference between them. Perhaps the content is, in fact, rich enough here to be a reasonable competitor for usage-based approaches.
The clear winner, however, is the hybrid approach. By combining together content-based and usage-based approaches, we can see significant gains. This suggests that the information being exploited by these two approaches is not the same, which is why we can get better accuracy when we combine them.
The question raised in the title is whether content is still king or if usage has usurped it. In this study, when we try to automatically recommend related research literature, content-based and usage-based approaches produce roughly the same quality of results. In other words, usage has not usurped content. Interestingly, however, when we combine content and usage data, we can produce a new system that performs better than using just one of the data sources alone.
This is a very high level take on the work that we have done. For those of you who’d like more details, a write up has been published in Recommending Scientific Literature: Comparing Use-Cases and Algorithms.
Authors: Phil Gooch and Kris Jack
One of the most used features at Mendeley is our metadata extraction tool. We’ve recently developed a new service that tries to automatically pull out the metadata from research article PDFs. This is currently used by the new Mendeley Web Library and iOS applications (Figure 1). We’re often asked how well it works so we thought we’d put this short post together to answer that question.
Automated metadata extraction is one of those problems in AI that appears very easy to solve but is actually quite difficult. Given a research article, that has been well formatted by a publisher, normally it’s easy to spot key metadata such as its title, the authors, where it was published and when it was published. The fact that publishers use a diverse range of layouts and type settings for articles isn’t a problem for human readers.
For example, the five articles in Figure 2 all have titles but in each case they appear in different positions, are surrounded by different text and are in different fonts. Despite this, humans find it very easy to locate and read the titles in each case. For machines, however, this is quite a challenge.
At Mendeley, we have been working on building a tool that can automatically extract metadata from research articles. Our tool needs to be able to help researchers from any discipline of research, meaning that we need to be able to cope with PDFs across the full range of diverse styles that publishers have created. In the next section, we’ll look at the metadata extraction tool that we have developed to address this problem.
The Solution: Metadata Extraction
The metadata extraction tool takes a PDF as input and produces a structured record of metadata (e.g. title, authors, year of publication) as output (Figure 3).
In this section, we’re going to open up the bonnet of the metadata extraction tool to reveal that it’s actually a pipeline of several components that depend on one another (Figure 4):
- We take the PDF
- Convert to text: pdftoxml extracts text from the PDF into a format that provides the size, font and position of every character on the page.
- Extract metadata: This information is converted into features that can be understood by a classifier that decides whether a sequence of characters represents the title, the author list, the abstract or something else. Once the classifier has determined which bits of text are the relevant metadata fields, the fields are cleaned up to make them more presentable. The text conversion and metadata extraction components are wrapped up in an open-source tool called Grobid, which we have modified so that it can be integrated easily with our codebase.
- Enrich with catalogue metadata: The extracted fields are used to generate a query to the Mendeley metadata lookup API. If we find a match in the Mendeley Catalogue, then we can use its metadata or we can return the metadata that was directed extracted. In this way, if we can enrich the extracted metadata with information from the Mendeley Catalogue then we do, otherwise, we return what we were able to extract without enriching it.
- You get the metadata
How well does the Solution Work?
Ok, enough with describing the tool, how well does it work? Well that’s a kind of a complicated question to answer. It depends on what you’re using it for. Different uses will have different requirements on the quality of the metadata extracted.
For example, if you want it to pull out perfect metadata so that you can use it to generate citations from Mendeley Desktop without having to make any manual corrections then you have pretty high quality requirements. If, however, you don’t need perfect metadata (e.g. the case of a letter may be wrong, a character in the title may be incorrectly extracted, the order of authors is incorrect) but need it to be good enough so that you can easily organise your research, then the quality doesn’t need to be so high. In this section, we’re going to consider the most strict case where the metadata extracted needs to be perfect so that it can be used to generate a citation.
We curated a data set of 26,000 PDFs for which we have perfect citable metadata records. This is our test data set. It’s large enough that it gives us a 99.9% level of confidence that the results from this evaluation would hold when tested against a much larger set of articles.
We then entered these 26,000 PDFs into our metadata extraction tool and found that 21,823 (83.9%) metadata records were generated with perfect, citable metadata: authors, title, year, and publication venue (e.g. journal, conference, magazine) (Figure 5). This means that 16.1% of PDFs either can’t be processed or return incorrect/incomplete metadata.
So that’s how well the metadata extraction tool works from a bird’s eye perspective. As we’ve already shown, however, the metadata extraction tool is a pipeline that chains together a number of different components. We should be able to go a step further and see how well the individual components in our pipeline are performing. These also act as useful machine diagnostics that help us to understand which components need to be improved first in order to improve the tool’s overall performance.
Let’s look at how many PDFs make it through each stage of the pipeline (Figure 6). First, we check whether the PDF is processable. That is, can we convert it to text. If we can’t then we don’t have any data to work with. In our tests, we were able to extract text out of 25,327 (97.4%) of the PDFs. This suggests that using a better PDF to text conversion tool could increase our overall performance by 2.6% which, compared to the 16.1% of PDFs that we didn’t get citable metadata for, isn’t that much.
For the text that we could extract, we attempted to pull out the major metadata fields of title, authors, DOI, publication venue, volume, year and pages. In practice, not all PDFs will contain all these fields in the text but the vast majority should as publishers tend to include them. Whether they are in the PDF or not, we need to be able to find these fields in order to present them to the user for use as a citable record. In 23,961 PDFs (92%), we were able to extract out these metadata fields. This means that we lose another 5.3% of the total number of PDFs at this stage. This suggests that it may be more worthwhile improving the quality of the metadata extraction component over the PDF to text one. We can probe this further to see precisely which fields were extracted, and how accurately, but we’ll discuss this in a later blog post.
We can then query the Mendeley Catalogue with the 23,961 metadata records that we have extracted. Querying the catalogue enriches the extracted fields with additional metadata. For example, if we failed to extract the volume or the page numbers, the catalogue record may have this information. Here, 23,291 out of the 23,961 queries (97.2%) returned a result that allowed us to enrich the metadata. This suggests that the catalogue enrichment step is performing pretty well, but we can probe this further to determine the cause of the 2.8% of failed catalogue lookups (e.g. does the entry in fact not exist, or was the query too imprecise). In this step, we don’t lose any records as this is only an enrichment step. That is, out of the 23,961 metadata records extracted, we have now been able to enrich 23,291 of them but we don’t lose the remaining 670 records (23,961 – 23,291), they just don’t get enriched. Improving the catalogue lookup step would therefore, at most, allow us to enrich another 670 records, which accounts for 2.6% of the original 26,000 PDFs.
Finally, we evaluate the final number of PDFs that yield definitive, citable metadata. Out of the 26,000 PDFs tested, 21,823 (83.9%) have precisely correct metadata. We can see that out of the 16.1% of PDFs that we don’t get correct metadata for, there may be gains of 2.6%, 5.3% and 2.6% by improving the pdf to text tool, metadata extraction component and the catalogue lookup service respectively. In practice, more detailed diagnostics than these are needed to decide which component to improve next and we really need to take their relative costs into account too, but that’s a far more detailed conversation that we’ll go into here.
So we can see how the performance of various steps of the pipeline impact on the overall performance as experienced by you when you drop in a PDF and see (hopefully) a clean metadata record appear in your library.
As a further demonstration of how well the metadata extraction tool works, here’s a selection of 10 randomly selected Open Access Journal article PDFs that we dropped into Mendeley Library and the metadata that was pulled out from each (Figures 7-16). A perfect citable metadata record is generated for 8/10 of the PDFs. In the two cases that fail, the publication venue was incorrect, whilst the other metadata was correct. As a result, you would have to manually correct these two fields before using them to generate citations. This test of our random sample of 10 articles fits with our prediction that 8-10 metadata records created should have perfect citable data.
Integrate Metadata Extraction into your own App
We have made this tool available through Mendeley’s Developer Portal meaning that you can integrate it into your own applications. So if you want to build an app that has the wow factor of built-in metadata extraction then go for it.
So, all in all, Mendeley’s metadata extraction tool works pretty well. If you drop 10 PDFs into your Mendeley Library then, on average, you’ll get perfect, citable metadata for 8-9 of them. There’s always room for improvement though so while we’re continually working on it, please feel free to help us move in the right direction by getting in touch with feedback directly.
While this post was written by both Phil Gooch and myself, I wanted to add a private acknowledgements section here at the end. Everything we do Mendeley is a team effort but Phil Gooch is the main man to thank for this tool. He managed to put in place a high quality system in a short amount of time and it’s a unique offering for both the research and developer communities. Huge thanks.
We would both like to give a few shout outs for some key folks who made this project come to fruition – Kevin Savage, Maya Hristakeva, Matt Thomson, Joyce Stack, Nikolett Harsányi, Davinder Mann, Richard Lyne and Michael Watt. Well done all, drinks are on us!
Everyone who spends time surfing the web comes into regular contact with both search engines and recommender systems, whether they know it or not. But what is it that makes them different from one another? This is a question that I’ve been asking myself more and more as of late so I thought that I’d start to put some of my thoughts down.
The Underlying Technologies
I know plenty of folk who begin to answer this question by saying that search engines and recommender systems are different technologies. In the early days of their development the people working on these systems split largely into two different communities, one of which focussed more on Information Retrieval and the other on Information Filtering. As a result, different avenues of research were pursued, which, despite some cross-fertilisation, shaped the fields, and resulting technologies, in different ways.
These two communities are increasingly coming back together as advances in search engines include lessons learned from Information Filtering techniques (e.g. collaborative search) and recommender systems start exploiting well established Information Retrieval techniques (e.g. learning to rank). As such, it’s becoming less relevant to distinguish search engines and recommender systems based on their underlying technologies.
So how should we distinguish them? I think that their primary difference is found in how users interact with them.
How to Spot a Search Engine
- You see a query box where you type in what you’re looking for and they bring back a list of results. From gigantic search engines like Google to the discrete search boxes that index the contents of a single page blog, they all have query boxes.
- You start with a query that you enter into a query box. You have an idea of what you’re looking for. That thing may or may not exist but if it does then you want the search engine to find it for you.
- You find yourself going back to reformulate your query as you see what results it gives and you widen or narrow the search.
- You may even generate a query by clicking on items that interest you (e.g. movies that you like). The search engine will then retrieve more movies for you.
How to Spot a Recommender System
- Some content has just appeared on your screen that is relevant to you but you didn’t request it. Where did that magic come from? That would be a recommender system.
- You don’t build a query and request results. Recommendations engines observe your actions and construct queries for you (often without you knowing).
- They tend to power adverts, which can give them a bad image, especially if the content is embarrassing.
Search Engines are not the same as Recommender Systems. Both can provide personalised content that matches your needs but it’s not what they do or the techno-magic that they use to do it but more how you interact with them that distinguishes them from one another.