lunes, 16 de noviembre de 2015

PLOS and DBpedia – an experiment towards Linked Data

Editor’s Note: This article is coauthored by Bob Kasenchak, Director of Business Development/Taxonomist at Access Innovations.

PLOS publishes articles covering a huge range of disciplines. This was a key factor in PLOS deciding to develop its own thesaurus – currently with 10,767 Subject Area terms for classifying the content.

We wondered whether matching software could establish relationships between PLOS Subject Areas and corresponding terms in external datasets. These relationships could enable links between data resources and expose PLOS content to a wider audience. So we set out to see if we could populate a field for each term in the PLOS thesaurus with a link to an external resource that describes—or, is “the same as”—the concept in the thesaurus. If so, we could:

• Provide links between PLOS Subject Area pages and external resources
• Import definitions to the PLOS thesaurus from matching external resources

For example, adding Linked Data URIs to the Subject Areas would facilitate making the PLOS thesaurus available as part of the Semantic Web of linked vocabularies.

We decided to use DBpedia for this trial for two reasons:

Firstly, as stated by DBpediaThe DBpedia knowledge base is served as Linked Data on the Web. As DBpedia defines Linked Data URIs for millions of concepts, various data providers have started to set RDF links from their data sets to DBpedia, making DBpedia one of the central interlinking-hubs of the emerging Web of Data.

Figure 1: Linked Open Data Cloud with PLOS shown linking to DBpedia – the concept behind this project.
Secondly, DBpedia is constantly (albeit slowly) updated based on frequently-used Wikipedia pages; so has a method to stay current, and a way to add content to DBpedia pages, providing inbound links—so people can link (directly or indirectly) to PLOS Subject Area Landing Pages via DBpedia.

Figure 2: ‘Cognitive psychology’ pages in PLOS and DBpedia 
Which matching software to trial?
We considered two possibilities: Silk and Spotlight

  • The Silk method might have allowed more granular, specific, and accurate queries, but it would have required us to learn a new query language. 
  • Spotlight, on the other hand, is executable by a programmer via API and required little effort to run, re-run, and check results; it took only a matter of minutes to get results from a list of terms to match. 

So we decided to use Spotlight for this trial.

Which sector of the thesaurus to target?
We chose the Psychology division of 119 terms (see Appendix) as a good starting point because it provides a reasonable number of test terms so that trends could emerge, and a range of technical terms (e.g. Neuropsychology) as well as general-language terms (e.g. Attention) to test the matching software.


Figure 3: Work flow.
Step 1: We created the External Link and Synopsis DBpedia fields in the MAIstro Thesaurus Master application to store the identified external URIs and definitions. The External Link field accommodates the corresponding external URI, and the Synopsis DBpedia field houses the definition – “dbo:abstract” in DBpedia.
Step 2: Matching DBpedia concepts with PLOS Subject Areas using Spotlight:

  • Phase 1: For the starting set of terms we chose Psychology (a Tier 2 term) and the 21 Narrower Terms that sit in Tier 3 immediately beneath Psychology (listed in Appendix ).
  • Phase 2: For Phase 2 we included the remaining 98 terms from Tier 4 and deeper beneath Psychology (listed in Appendix ).
Step 3: Importing External Link/Synopsis DBpedia to PLOS Thesaurus: Once a list of approved matching PLOS-term-to-DBpedia-page correspondences was established, another quick DBpedia Spotlight query provided the corresponding Definitions. Access Innovations populated the fields by loading the links and definitions to the corresponding term records. For the “Cognitive psychology” example these are:

Synopsis DBpedia: Cognitive psychology is the study of mental processes such as “attention, language use, memory, perception, problem solving, creativity, and thinking.” Much of the work derived from cognitive psychology has been integrated into various other modern disciplines of psychological study including

  • educational psychology, 
  • social psychology, 
  • personality psychology, 
  • abnormal psychology, 
  • developmental psychology, and 
  • economics.
How did it go?
The table shows the distribution of results for the 119 Subject Areas in the Psychology branch of the PLOS thesaurus:
Add caption
Thus a total of 96 matches could be found by any method (80.7% of terms – top three rows of the Table). Of these, 86 terms (72.3% of terms) were matched as one of the top 5 Spotlight hits (top two rows of the Table), as compared to 71 matches (59.7% of terms) being identified correctly and directly by Spotlight as the top hit (top row of the Table).

Figure 4 shows the two added fields “Synopsis DBpedia” and “External Link” in MAIstro, for “Cognitive Psychology”.

Figure 4: Addition of Synopsis DBpedia and External Link fields to MAIstro.
We had set out to establish whether matching software could define relationships between PLOS thesaurus terms and corresponding terms in external datasets. We used the Psychology division of the PLOS thesaurus as our test vocabulary, Spotlight as our matching software, and DBpedia as our target external dataset.

We found that unambiguous suitable matches were identified for 59.7% of terms. Expressed another way, mismatches were identified as the top hit for 35 cases (29.4% of terms) which is a high burden of inaccuracy. This is too low a quality outcome for us to consider adopting Spotlight suggestions without editorial review.

As well as those terms that were matched as a top hit, a further 12.6% of terms (or 31% of the terms not successfully matched as a top hit) had a good match in Spotlight hit positions 2-5. So Spotlight successfully matched 72.3% of terms within in the top 5 Spotlight matches.

Having the Spotlight hit list for each term did bring efficiency to finding the correspondences. Both the “hits” and the “misses” were straightforward to identify. As an aid to the manual establishment of these links Spotlight is extremely useful.

Stability of DBpedia: We noticed that the dbo:abstract content in DBpedia is not stable. It would be an enhancement to append the Synopsis DBpedia field contents with URI and date stamp as a rudimentary versioning/quality control measure.

Can we improve on Spotlight? Possibly. We wouldn’t be comfortable with any scheme that linked PLOS concepts to the world of Linked Data sources without editorial quality control. But we suspect that a more sophisticated matching tool might promote those hits that fell within Spotlight matches 2-5 to the top hit, and would find some of the 8.4% of terms which were found manually but which Spotlight did not suggest in the top 5 hits at all. We hope to invest some effort in evaluating Silk, and establishing whether or not any other contenders are emerging.

Introducing PLOS Subject Area URIs into DBpedia page: This was explored and it seemed likely that the route to achieve this would be to add the PLOS URI first to the corresponding Wikipedia page, in the “External Links” section.

Figure 5: The External Links section of Wikipedia: Cognitive psychology
As DBpedia (slowly) crawls through all associated Wikipedia pages, eventually the new PLOS link would be added to the DBpedia entry for each page.

To demonstrate this methodology, we added a backlink to the corresponding PLOS Subject Area page in the Wikipedia article shown above (Cognitive psychology) as well as all 21 Tier 3 Psychology terms.

Figure 6: External Links at Wikipedia: Cognitive psychology showing link back to the corresponding PLOS Subject Area page
Were DBpedia to re-crawl this page, the link to the PLOS page would be added to DBpedia’s corresponding page as well.

However, Wikipedia questioned the value of the PLOS backlinks (“link spam”) and their appropriateness to the “External Links” field in the various Wikipedia pages. A Wiki administrator can deem them inappropriate and remove them from Wikipedia (as has happened for some if not all of them by the time you read this).

We believe the solution is to publish the PLOS thesaurus as Linked Open Data (in either SKOS or OWL format(s)) and assert the link to the published vocabulary from DBpedia (using the field owl:sameAs instead of dbo:wikiPageExternalLink). We are looking into the feasibility and mechanics of this.

Once the PLOS thesaurus is published in this way, the most likely candidate for interlinking data would be to use the SILK Linked Data Integration Framework and we look forward to exploring that possibility.

Appendix: The Psychology division of the PLOS thesaurus. LD_POC_blog.appendix

 November 10, 2015

1 comentario:

  1. DreamHost is definitely the best web-hosting provider with plans for all of your hosting needs.