Skip to content

How well does Mendeley’s Metadata Extraction Work?

March 12, 2015

Authors: Phil Gooch and Kris Jack

Introduction

One of the most used features at Mendeley is our metadata extraction tool.  We’ve recently developed a new service that tries to automatically pull out the metadata from research article PDFs. This is currently used by the new Mendeley Web Library and iOS applications (Figure 1).  We’re often asked how well it works so we thought we’d put this short post together to answer that question.

Metadata Extraction Storyboard
Figure 1.  Steps in metadata extraction in Mendeley Library. Step 1, add a PDF. Step 2, wait a second for the metadata to be extracted. Step 3, see the extracted metadata.

The Problem

Automated metadata extraction is one of those problems in AI that appears very easy to solve but is actually quite difficult.  Given a research article, that has been well formatted by a publisher, normally it’s easy to spot key metadata such as its title, the authors, where it was published and when it was published.  The fact that publishers use a diverse range of layouts and type settings for articles isn’t a problem for human readers.

For example, the five articles in Figure 2 all have titles but in each case they appear in different positions, are surrounded by different text and are in different fonts.  Despite this, humans find it very easy to locate and read the titles in each case.  For machines, however, this is quite a challenge.

5 Articles with Different Layouts
Figure 2: Five Articles with Different Layouts. The titles in each article have been highlighted. They all appear in different positions, are surrounded by different text and are rendered in different fonts.

At Mendeley, we have been working on building a tool that can automatically extract metadata from research articles.  Our tool needs to be able to help researchers from any discipline of research, meaning that we need to be able to cope with PDFs across the full range of diverse styles that publishers have created.  In the next section, we’ll look at the metadata extraction tool that we have developed to address this problem.

The Solution: Metadata Extraction 

The metadata extraction tool takes a PDF as input and produces a structured record of metadata (e.g. title, authors, year of publication) as output (Figure 3).

Simplified Metadata Extraction Architecture
Figure 3: Simplified Metadata Extraction Architecture

In this section, we’re going to open up the bonnet of the metadata extraction tool to reveal that it’s actually a pipeline of several components that depend on one another (Figure 4): 

  1. We take the PDF
  2. Convert to text: pdftoxml extracts text from the PDF into a format that provides the size, font and position of every character on the page.
  3. Extract metadata: This information is converted into features that can be understood by a classifier that decides whether a sequence of characters represents the title, the author list, the abstract or something else. Once the classifier has determined which bits of text are the relevant metadata fields, the fields are cleaned up to make them more presentable. The text conversion and metadata extraction components are wrapped up in an open-source tool called Grobid, which we have modified so that it can be integrated easily with our codebase.
  4. Enrich with catalogue metadata: The extracted fields are used to generate a query to the Mendeley metadata lookup API.  If we find a match in the Mendeley Catalogue, then we can use its metadata or we can return the metadata that was directed extracted.  In this way, if we can enrich the extracted metadata with information from the Mendeley Catalogue then we do, otherwise, we return what we were able to extract without enriching it.
  5. You get the metadata
Metadata Extraction Pipeline
Figure 4: Metadata Extraction Pipeline.  Given a PDF, there are three main components invoked to produce a metadata record.

How well does the Solution Work? 

Ok, enough with describing the tool, how well does it work?  Well that’s a kind of a complicated question to answer.  It depends on what you’re using it for.  Different uses will have different requirements on the quality of the metadata extracted.

For example, if you want it to pull out perfect metadata so that you can use it to generate citations from Mendeley Desktop without having to make any manual corrections then you have pretty high quality requirements.  If, however, you don’t need perfect metadata (e.g. the case of a letter may be wrong, a character in the title may be incorrectly extracted, the order of authors is incorrect) but need it to be good enough so that you can easily organise your research, then the quality doesn’t need to be so high.  In this section, we’re going to consider the most strict case where the metadata extracted needs to be perfect so that it can be used to generate a citation. 

We curated a data set of 26,000 PDFs for which we have perfect citable metadata records.  This is our test data set.  It’s large enough that it gives us a 99.9% level of confidence that the results from this evaluation would hold when tested against a much larger set of articles.

We then entered these 26,000 PDFs into our metadata extraction tool and found that 21,823 (83.9%) metadata records were generated with perfect, citable metadata: authors, title, year, and publication venue (e.g. journal, conference, magazine) (Figure 5).  This means that 16.1% of PDFs either can’t be processed or return incorrect/incomplete metadata.

Simplified Metadata Extraction Architecture Evaluation
Figure 5: Simplified Metadata Extraction Evaluation. Given 26K PDFs, the tool can produce citable metadata for 83.9% of them.

So that’s how well the metadata extraction tool works from a bird’s eye perspective.  As we’ve already shown, however, the metadata extraction tool is a pipeline that chains together a number of different components.  We should be able to go a step further and see how well the individual components in our pipeline are performing.  These also act as useful machine diagnostics that help us to understand which components need to be improved first in order to improve the tool’s overall performance.

Let’s look at how many PDFs make it through each stage of the pipeline (Figure 6). First, we check whether the PDF is processable.  That is, can we convert it to text.  If we can’t then we don’t have any data to work with.  In our tests, we were able to extract text out of 25,327 (97.4%) of the PDFs.  This suggests that using a better PDF to text conversion tool could increase our overall performance by 2.6% which, compared to the 16.1% of PDFs that we didn’t get citable metadata for, isn’t that much.

Metadata Extraction Pipeline Evaluation
Figure 6: Metadata Extraction Pipeline Evaluation. Breakdown of how many PDFs are processed at each stage in the pipeline.

For the text that we could extract, we attempted to pull out the major metadata fields of title, authors, DOI, publication venue, volume, year and pages. In practice, not all PDFs will contain all these fields in the text but the vast majority should as publishers tend to include them.  Whether they are in the PDF or not, we need to be able to find these fields in order to present them to the user for use as a citable record.  In 23,961 PDFs (92%), we were able to extract out these metadata fields.  This means that we lose another 5.3% of the total number of PDFs at this stage.  This suggests that it may be more worthwhile improving the quality of the metadata extraction component over the PDF to text one.  We can probe this further to see precisely which fields were extracted, and how accurately, but we’ll discuss this in a later blog post.

We can then query the Mendeley Catalogue with the 23,961 metadata records that we have extracted.  Querying the catalogue enriches the extracted fields with additional metadata. For example, if we failed to extract the volume or the page numbers, the catalogue record may have this information. Here, 23,291 out of the 23,961 queries (97.2%) returned a result that allowed us to enrich the metadata.  This suggests that the catalogue enrichment step is performing pretty well, but we can probe this further to determine the cause of the 2.8% of failed catalogue lookups (e.g. does the entry in fact not exist, or was the query too imprecise).  In this step, we don’t lose any records as this is only an enrichment step.  That is, out of the 23,961 metadata records extracted, we have now been able to enrich 23,291 of them but we don’t lose the remaining 670 records (23,961 – 23,291), they just don’t get enriched.  Improving the catalogue lookup step would therefore, at most, allow us to enrich another 670 records, which accounts for 2.6% of the original 26,000 PDFs. 

Finally, we evaluate the final number of PDFs that yield definitive, citable metadata. Out of the 26,000 PDFs tested, 21,823 (83.9%) have precisely correct metadata.  We can see that out of the 16.1% of PDFs that we don’t get correct metadata for, there may be gains of 2.6%, 5.3% and 2.6% by improving the pdf to text tool, metadata extraction component and the catalogue lookup service respectively.  In practice, more detailed diagnostics than these are needed to decide which component to improve next and we really need to take their relative costs into account too, but that’s a far more detailed conversation that we’ll go into here.

So we can see how the performance of various steps of the pipeline impact on the overall performance as experienced by you when you drop in a PDF and see (hopefully) a clean metadata record appear in your library.

Examples 

As a further demonstration of how well the metadata extraction tool works, here’s a selection of 10 randomly selected Open Access Journal article PDFs that we dropped into Mendeley Library and the metadata that was pulled out from each (Figures 7-16).  A perfect citable metadata record is generated for 8/10 of the PDFs.  In the two cases that fail, the publication venue was incorrect, whilst the other metadata was correct.  As a result, you would have to manually correct these two fields before using them to generate citations.  This test of our random sample of 10 articles fits with our prediction that 8-10 metadata records created should have perfect citable data.

Metadata Extraction Example 1
Figure 7: Metadata Extraction Example 1
Metadata Extraction Example 2
Figure 8: Metadata Extraction Example 2
Metadata Extraction Example 3
Figure 9: Metadata Extraction Example 3
Metadata Extraction Example 4
Figure 10: Metadata Extraction Example 4
Metadata Extraction Example 5
Figure 11: Metadata Extraction Example 5
Metadata Extraction Example 6
Figure 12: Metadata Extraction Example 6
Metadata Extraction Example 7
Figure 13: Metadata Extraction Example 7
Metadata Extraction Example 8
Figure 14: Metadata Extraction Example 8
Metadata Extraction Example 9
Figure 15: Metadata Extraction Example 9
Metadata Extraction Example 10
Figure 16: Metadata Extraction Example 10

Integrate Metadata Extraction into your own App

We have made this tool available through Mendeley’s Developer Portal meaning that you can integrate it into your own applications.  So if you want to build an app that has the wow factor of built-in metadata extraction then go for it.

Conclusion

So, all in all, Mendeley’s metadata extraction tool works pretty well.  If you drop 10 PDFs into your Mendeley Library then, on average, you’ll get perfect, citable metadata for 8-9 of them.  There’s always room for improvement though so while we’re continually working on it, please feel free to help us move in the right direction by getting in touch with feedback directly.

Kris’ Acknowledgements

While this post was written by both Phil Gooch and myself, I wanted to add a private acknowledgements section here at the end.  Everything we do Mendeley is a team effort but Phil Gooch is the main man to thank for this tool.  He managed to put in place a high quality system in a short amount of time and it’s a unique offering for both the research and developer communities.  Huge thanks.

We would both like to give a few shout outs for some key folks who made this project come to fruition – Kevin Savage, Maya Hristakeva, Matt Thomson, Joyce Stack, Nikolett Harsányi, Davinder Mann, Richard Lyne and Michael Watt.  Well done all, drinks are on us!

Advertisements

From → Uncategorized

4 Comments
  1. Reblogged this on Oleg Baskov.

  2. Have you analyzed the “16.1% of PDFs either can’t be processed or return incorrect/incomplete metadata.”? For instance, if some publishers predominate, the scientific community and librarians could put pressure on these publishers to improve their production processes.

    Can you share any characteristics of your original 26K set (e.g. mainly since 1995, science? Or…)?

    Thanks for the work you’re doing and for describing your process!

    • No, we haven’t analysed the 16.1% yet. This set of PDFs includes some where we couldn’t pull out any text (~2.6%), some where we couldn’t find a title or DOI (~5.2%), some where we couldn’t match the extracted data with anything in Mendeley’s catalogue (~2.6%) and finally some that have some metadata (~5.7%) but would not be citable without some manual intervention (e.g. title may have a character wrong, missing author name, missing year). During development, we eyeballed the output that was wrong to see if there were any consistent errors being made and put fixes in place. Now 16.1% that are left are pretty much the difficult cases which will need more work. Since that blog post, we have also enhanced the metadata extraction tool to also pull out XMP metadata. Unfortunately, this doesn’t make a big difference to the overall results.

      The original 26K set was a random sample. The distribution will reflect the make-up of Mendeley’s community.

  3. Thanks Kris. Interesting that XMP metadata wasn’t part of the first pass — I don’t know how widely it’s being used by small publishers these days (though I wish it were).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: