Querying the British National Bibliography

Following up on the earlier announcement that the British Library has made the British National Bibliography available under a public domain dedication, the JISC Open Bibliography project has worked to make this data more useable.

The data has been loaded into a Virtuoso store that is queriable through the SPARQL Endpoint and the URIs that we have assigned each record use the ORDF software to make them dereferencable, supporting perform content auto-negotiation as well as embedding RDFa in the HTML representation.

The data contains some 3 million individual records and some 173 million triples. Indexing the data was a very CPU intensive process taking approximately three days. Transforming and loading the source data took about five hours.

To get an idea of the shape of the data, let us consider a sample resource, http://bnb.bibliographica.org/entry/GB8102507 . Apart from linkage between the various representations, the description of the entity itself is as follows

@prefix ov: <http://open.vocab.org/terms/> .
@prefix isbd: <http://iflastandards.info/ns/isbd/elements/> .
@prefix bibo: <http://purl.org/ontology/bibo/> .
@prefix bio: <http://purl.org/vocab/bio/0.1/> .
@prefix dc: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://bnb.bibliographica.org/entry/GB8102507> a bibo:Book, bibo:Document;
    dc:source <http://bnb.bibliographica.org/dataset/BNBrdfdc03.xml#183143>;
    dc:isPartOf <http://bnb.bibliographica.org/dataset>;
    rdfs:seeAlso <http://purl.org/NET/book/isbn/0241105161#book>,
                 <http://www4.wiwiss.fu-berlin.de/bookmashup/books/0241105161>;

    dc:title "A good man in Africa";
    dc:language [ rdf:value "eng"^^dc:ISO639-2 ];
    dc:extent [ rdfs:label "251p" ];

    dc:contributor [ a foaf:Agent;
                     foaf:name "Boyd, William";
                     skos:notation "Boyd, William, 1952-";
                     bio:event [ a bio:Birth;
                                 bio:date "1952"^^xsd:gYear ];
                     = <http://bibliographica.org/entity/735b02a8f051e2249e40fbd48112d033>;
                   ];

    dc:subject [ rdfs:label "Fiction in English" ],
               [ rdfs:label "1945-" ],
               [ rdfs:label "Texts" ],
               [ a skos:Concept;
                 skos:inScheme <http://dewey.info/scheme/e18>;
                 skos:notation "823/.9/1"^^<ddc:Notation> ],
               [ a skos:Concept;
                 skos:inScheme <http://dewey.info/scheme/e19>;
                 skos:notation "823/.914"^^<ddc:Notation> ].

    dc:publisher [ a foaf:Agent;
                   foaf:name "Hamilton";
                   skos:notation "Hamilton";
                   = <http://bibliographica.org/entity/c080da5b03a0786efa61e61123b359d9>;
                 ];
    dc:issued "1981"^^xsd:gYear;
    isbd:hasPlaceOfPublicationProductionDistribution [ rdfs:label "London" ].

    bibo:identifier "GB8102507";
    bibo:isbn <urn:isbn:0241105161>;
    ov:blid "008042853".

Some of the salient features of this representation are:

The entire dataset is queriable through the SPARQL Endpoint and makes use of some of the extended features of Virtuoso such as full-text indexing. This is accomplished by using the bif:contains built-in function and is what powers the search functionality on the website. The default (example) query returns some details about all books that have "Edinburgh" in their titles:

PREFIX dc: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?book ?title ?name ?description
WHERE {
    ?book a bibo:Book .
    ?book dc:title ?title . ?title bif:contains "Edinburgh" .
    OPTIONAL { ?book dc:description ?description } .
    OPTIONAL {
        ?book dc:contributor ?author . ?author foaf:name ?name
    }
} GROUP BY ?book LIMIT 50

It should be noted that only some predicates are indexed for full-text searching, namely,

Further Work

An ultimate goal of our work in the Open Bibliography group at the OKF is to enable the collection of rich metadata about the relationships between works and authors, to document and map the scholarly discourse. This dataset is an important building block to help ground the references in such a project. However more immediatly we will:

  • Make a voiD description of this dataset describing its properties in more detail available.
  • Make available a dump of the our dataset derived from the BNB so that the data can be easily mirrored and copied for local processing.
  • Correct the errors listed in the Errata section below.

though not necessarily in that order.

Errata

  • ISBNs were represented in the source dataset as string literals of the form URN:ISBN:0123456789 and were erroneously transformed to URIs in violation of the rdfs:range of bibo:isbn.
  • [FIXED] Linkage between the resource and its representations, foaf:isPrimaryTopicOf contains a typeo in the prediate which may make it difficult to use some RDF browsing clients that do not infer the inverse of foaf:primaryTopic.