RDF Storage: Apache Jena TDB

gowebing 2024-11-30

Overview

RDF Storage is a database used to store and query RDF data. RDF stands for Resource Description Framework, which is a standard model for describing and interchanging data on the web.

RDF storage contains collections of RDF statements, which are three-part statements known as triples. Each triple has a resource (subject), property (predicate), and property value (object).

Basic triple architecture

Basic triple architecture

For example, the following statement represents “Alphabet Inc has a CEO named Pichai Sundararajan.”.

Triple Statement

Triple Statement

Moreover, RDF statements can be linked together to create a graph. For instance:

Graph example

Graph example

Unlike a relational database, RDF storage is a type of graph database that stores RDF triple statements.

There are many RDF storages available in the market, such as MarkLogic, Amazon Neptune, Virtuoso, and Apache Jena – TDB. This article only introduces Apache Jena – TDB, which is a free and open-source Java framework to manage RDF data.

Apache Jena TDB

TDB is a component of Apache Jena for RDF storage and query. Apache Jena is a free and open-source Java web framework that provides several APIs and components to process RDF data. The following picture shows the framework architecture of Apache Jena:

Apache Jena architecture

Apache Jena architecture

The core APIs of Apache Jena is the RDF API used to process RDF data and SPARQL API used to query RDF data.

SPARQL is an RDF query language able to retrieve and manipulate data stored in RDF format, including Apache Jena TDB. It is similar to SQL in a relational database. Typically, the SPARQL consists of two parts:

The SELECT clause identifies the variables (prefixed with ‘?’) to appear in the query results
The WHERE clause provides the triple pattern to match against the RDF data

     SPARQL 
   
xxxxxxxxxx

SELECT ?subject ?predicate ?object
WHERE {
  ?subject ?predicate ?object
}

For example, to query who is the CEO of Alphabet Inc. from the above example, the SPARQL looks like:

     SPARQL 
   
xxxxxxxxxx

SELECT ?NAME
WHERE {
  “Alphabet Inc.” hasCEO ?Name 
}

The value of the variable (?NAME) will be “Pichai Sundararajan”.

The SPARQL can be sent to the Apache Jena TDB through the TDB command line (bat\tdb2_tdbquery.bat) or the Apache Jena Fuseki component. Apache Jena Fuseki is a SPARQL server that can run as a Java web application (WAR file) and provide the web interface for users to send SPARQL queries to the Apache Jena TDB.

Apache Jena Fuseki

Apache Jena Fuseki

Use Case Scenario: Open PermID Entity Bulk Download

In this section, I will demonstrate how to load RDF data into the Apache Jena TDB in the real use case scenario. To do this, I will use as an example, Open PermID from Refinitiv. Open PermID is freely available at https://permid.org/.

PermID is a shortening of “Permanent Identifier”, which is a machine-readable number assigned to entities, securities, organizations (companies, government agencies, universities, etc.), quotes, individuals, and more.

It is specifically designed for use by machines to reference related information programmatically. Open PermID also provides bulk files (one per entity type), containing the complete lists of the entities including organization, instrument, quote, asset class, currency, instrument code, and person entities. These files are updated weekly. The following picture represents relationships among entities.

Open PermID

Open PermID

The following steps demonstrate how to load organization, industry, and quote files into the Apache Jena TDB. Apache Jena and Apache Jena Fuseki must be installed properly on the machine. To install Apache Jena and Apache Jane Fuseki, please refer to the Apache Jena website.

First download organization, industry, and quote files from the permid.org website. There are two types of files (ttl and ntriples). In this step, ntriples files will be used.

Entity Download Files

Entity Download Files

After downloading, decompress those files and rename the files’ extensions from ntriples to nt.

Next, run the following command in the Apache Jena directory to load those files to Apache Jena TDB. The location of the database is at c:\workspace\database.

     Plain Text 
   
xxxxxxxxxx

C:\workspace\apache-jena-3.14.0>bat\tdb2_tdbloader.bat --loc c:\workspace\database OpenPermID-bulk-organization-xxx.nt OpenPermID-bulk-industry-xxx.nt OpenPermID-bulk-quote-xxx.nt

tdb2_tdbloader.bat output

It may take more than four minutes to populate the database.

At this point, the database is ready and SPARQL can be used to query the database. For example, the following SPARQL lists active US companies in the Healthcare Providers and Services industry.

     SPARQL 
   
 
 
      
    
x

              
          
prefix skos: <http://www.w3.org/2004/02/skos/core#>
prefix vcard: <http://www.w3.org/2006/vcard/ns#> 
prefix tr-org: <http://permid.org/ontology/organization/> 
prefix tr-fin: <http://permid.org/ontology/financial/>
 
SELECT ?orgPermID ?orgName ?exchangeCode ?ticker ?mic ?ric
WHERE {
  ?industry skos:prefLabel "Healthcare Providers & Services" .
  ?orgPermID tr-org:hasPrimaryIndustryGroup ?industry .
  ?orgPermID tr-org:isIncorporatedIn <http://sws.geonames.org/6252001/> .
  ?orgPermID tr-org:hasActivityStatus tr-org:statusActive .
  ?orgPermID vcard:organization-name ?orgName .
  ?orgPermID tr-fin:hasOrganizationPrimaryQuote ?orgQuote .
  OPTIONAL {?orgQuote tr-fin:hasExchangeCode ?exchangeCode .}
  OPTIONAL {?orgQuote tr-fin:hasExchangeTicker ?ticker .}
  OPTIONAL {?orgQuote tr-fin:hasMic ?mic .}
  OPTIONAL {?orgQuote tr-fin:hasRic ?ric .} 
}

The query can be run through the tdb2_tdbquery.bat command or Apache Jena Fuseki.

tdb2_tdbquery.bat output

tdb2_tdbquery.bat output

Apache Jena Fuseki

Apache Jena Fuseki

The results contain organizations’ PermIDs, organization names, exchange codes, tickers, Market Identifier Codes (MICs), and Reuters Instrument Codes (RICs).

Summary

Apache Jena TDB is a free RDF database used to store and query RDF data. RDF is a standard model for describing and interchanging data on the web. The underlying structure of RDF is a collection of triples, each consisting of a subject, a predicate, and an object.

Triples can be linked together to create graphs. Therefore, RDF storage is a type of graph database. The database can be populated with the tdb2_tdbloader.bat command. It uses SPARQL as a query language. The query can be run through the tdb2_tdbquery.bat command or Apache Jena Fuseki.

References

Apache Jena. n.d. Apache Jena. [online] Available at: <https://jena.apache.org/index.html> [Accessed 25 March 2020].
Apache Jena. n.d. Apache Jena - Apache Jena Fuseki. [online] Available at: <https://jena.apache.org/documentation/fuseki2/index.html> [Accessed 25 March 2020].
Apache Jena. n.d. Apache Jena - Jena Architecture Overview. [online] Available at: <https://jena.apache.org/about_jena/architecture.html> [Accessed 26 March 2020].
Apache Jena. n.d. Apache Jena - TDB. [online] Available at: <https://jena.apache.org/documentation/tdb/index.html> [Accessed 25 March 2020].
Cambridge Semantics. n.d. Learn RDF. [online] Available at: <https://www.cambridgesemantics.com/blog/semantic-university/learn-rdf/> [Accessed 25 March 2020].
Open PermID. n.d. Permid. [online] Available at: <https://permid.org/> [Accessed 25 March 2020].
W3C. 2014. RDF - Semantic Web Standards. [online] Available at: <https://www.w3.org/RDF/> [Accessed 25 March 2020].
Db-engines.com. n.d. RDF Stores - DB-Engines Encyclopedia. [online] Available at: <https://db-engines.com/en/article/RDF+Stores> [Accessed 25 March 2020].
Wikipedia, the free encyclopedia. 2020. SPARQL. [online] Available at: <https://en.wikipedia.org/wiki/SPARQL> [Accessed 25 March 2020].
W3C. 2013. SPARQL 1.1 Overview. [online] Available at: <https://www.w3.org/TR/2013/REC-sparql11-overview-20130321/> [Accessed 25 March 2020].
Wikipedia, the free encyclopedia. 2019. Triplestore. [online] Available at: <https://en.wikipedia.org/wiki/Triplestore> [Accessed 25 March 2020].
Ontotext. n.d. What Is RDF? Making Data Triple Their Power. [online] Available at: <https://www.ontotext.com/knowledgehub/fundamentals/what-is-rdf/> [Accessed 25 March 2020].