Tripliser User Guide 1.0

Contents

Basic model

Tripliser uses the XML mapping file to convert inputs to outputs. The inputs could be in various forms, but most typically are XML files. The output is one or more 'triple graphs', or a serialisation of the graph(s).

Concepts

Mime types

Tripliser can produce all mime-types supported by the Jena library:

N3text/rdf+n3
X-Turtleapplication/x-turtle
Plain texttext/plain
RDF XMLapplication/rdf+xml
RDF XML Abbreiviated*application/rdf+xml+abbr

*A non-standard mime-type used to produce abbreviated RDF XML

Primary inputs

Primary inputs are the principal source of the data used to construct the output. Primary inputs are often referred to as simply 'inputs', where primary input is used only to contract with supporting input.

Supporting inputs

Supporting inputs provide a way of supplying additional reference data to a conversion process. Supporting inputs must be of the same form as the input data. These inputs are provided as a map, allowing them to be specifically referenced in the mapping file using the input attribute

Reports

Tripliser produces at least one report for every conversion. Reports contain details about the conversion process and an assessment as to whether it was successful.

Functions

Functions can be provided to the conversion process to allow more advanced querying, such as lookups to third-party services, data type conversion or advanced validation.

Complete model

Command line tool

The command line tool gives access to the key functionality of the Java library.

Installation

The only currently available version is "1.0", replace [version] with this value in the command line examples below.

Download Jar

requires Java 1.6
wget https://github.com/downloads/daverog/tripliser/tripliser-[version]-jar-with-dependencies.jar
wget https://github.com/daverog/tripliser/raw/master/src/main/sh/triplise
chmod u=r+x triplise
./triplise [input] [mapping]

From source - Install Saxon into Maven Repository

Download Saxon-HE 9.3 from here. Next...

unzip saxonhe9-3[version]j.zip
mvn install:install-file -Dfile=saxon9he.jar -DgroupId=net.sf.saxon -DartifactId=saxon-he -Dversion=9.3 -Dpackaging=jar

From source - Build Tripliser

cd [workspace directory]
git clone git://github.com/daverog/tripliser
cd tripliser
mvn assembly:assembly -DdescriptorId=jar-with-dependencies
cp src/main/sh/triplise [installation directory]
cp target/tripliser-[version]-jar-with-dependencies.jar  [installation directory]
cd [installation directory]
chmod u=r+x triplise
./triplise [input] [mapping]

Usage

triplise [input] [mapping]

Where input is the input file and mapping is the mapping XML file. Unless specified, the output will use the input filename and apply a '.rdf' extension.

The tool provides usage help which will explain the available options. To show the usage help, call the tool with no arguments.

Here is a sample usage message:

Java

Tripliser is principally designed as a Java library. As such, the functionality is richer than simply using the command line tool.

Maven dependency

First follow the instructions in 'From source - Install Saxon into Maven Repository' to install Saxon-HE in your local Maven repo. Next, install the latest Tripliser library into your maven repository:

wget -O tripliser-[version].pom https://github.com/downloads/daverog/tripliser/pom.xml
wget https://github.com/downloads/daverog/tripliser/tripliser-[version].jar
mvn install:install-file -Dfile=tripliser-[version].jar -DpomFile=tripliser-[version].pom

To use in your project, include the following dependency in your pom.xml file:

Usage

To get started, a TripliserFactory is needed. A TripliserFactory requires a mapping file, as an InputStream. The resulting Tripliser can be used to process different inputs in different configurations. Both the factory and the Tripliser use a builder pattern to provide a clean syntax.

The following is the simplest possible usage of the Tripliser library:

Using the builder pattern, more options can be applied:

For further details on each builder method, please consult the javadoc.

Supplying functions

When using the Java library, functions can be supplied to enhance the query language. Currently, only Saxon XPath functions can be added. See the Saxon XPath Function Definitions section

TripleGraphs

TripleGraphs can be produced by Tripliser, and are the encapsulation of a triple graph and associated meta-data. They comprise the following:

  • A name
  • A Jena model, containing the actual graph data
  • A report, detailing issues with the conversion
  • Tags, providing additional meta-data

Scope & Merging

Tripliser uses the concept of scope in various areas. The various scopes are defined as follows, broadly speaking, from the widest to the narrowest:

  • MAPPING: The whole mapping file
  • INPUT: Each primary input (not supporting) supplied to the tripliser
  • GRAPH_MAPPING: Each graph element in the mapping file
  • GRAPH (default): Each query result for each graph element in the mapping file
  • RESOURCE_MAPPING: Each resource element in the mapping file
  • RESOURCE: Each query result for each resource element in the mapping file
  • PROPERTY: Each query result for each property element in the mapping file

Scopes are provided to the setMergeScope(scope) method. This determines how the graphs are merged during the conversion process. For example, if a merge scope of RESOURCE_MAPPING was specified, a graph would be produce for each resource element in the mapping file. In the universe example, we would produce one graph full of stars, and one graph full of planets, regardless of the input(s).

TripleGraphCollections

TripleGraphCollections are the produced by the Tripliser.generateTripleGraphCollection(). In addition to providing an iteration over the triple graphs in the collection, the collection also contains a report of its own. This report records entries that are not specific to any particular graph, such as an error reading an input.

Triple graphs can also be obtained by name, from the collection (if this is appropriate to the chosen merge scope).

Iteration vs. Collection

There are two approaches to obtaining multiple triple graphs. One is to create a TripleGraphCollection with the generateTripleGraphCollection method, the other is to create a TripleGraphIterator with the getTripleGraphIterator method.

Using getTripleGraphIterator is much more likely to be efficient as data is processed and stored only when necessary. The efficiency gain will depend on various factors, particularly the merge scope, but the following example illustrates the difference.

Assuming the following:

  • A mapping file with one graph element
  • 100 input files
  • 100 'instances' of a graph per input file (e.g. the graph XPath query returns 100 results for each input file)

Using the collection approach, 10,000 graphs will be added to a collection, each containing a report.

Using the iteration approach, one of each of the 10,000 graphs will be return at a time. At any one time, only one input file will be open and only one triple graph and one report will be in memory.

Mapping file

The mapping file describes how an particular format of input should be converted into triples. This is done using queries, which 'extract' the values from the input and 'inject' them into a graph. Below is an example mapping file:

Constants

To avoid replication of frequently occurring values throughout the mapping file, constants can be used. They are simple key value pairs, defined as follows:

The defined constant can then be applied, using ${name} syntax, elsewhere in the mapping file:

Constants can be used inside the following attributes of property mappings (about or property):

  • prepend
  • append
  • value

Namespaces

Tripliser supports the use of namespace prefixes. Each declared namespace can be used elsewhere in the mapping file. Only namespaces that are used in any particular graph will be declared in resulting RDF serialisations. Namespaces are defined as follows:

The defined namespace can then be applied, using prefix:entity syntax, elsewhere in the mapping file:

Direct use of namespace prefixes is supported by the following attributes of property mappings (about or property):

  • name
  • dataType
  • value
Namespace prefixes can also be used indirectly, where the results of queries can incorporate the prefix form. For example, the following XPath query would be valid:

Where @dc_property results in a valid Dublin Core property.

The RDF namespace (http://www.w3.org/1999/02/22-rdf-syntax-ns#), with a prefix of 'rdf', is built-in by default

Default namespace

The default namespace is extracted from the XML file's xmlns attribute. A minor performance gain can be achieved by indicating the default namespace in the mapping file as follows:

Graphs, Resources & Properties

The most important parts of the mapping file are the graphs, resources and properties, each defined as follows:

  • Graph: A collection of resources intended to form a single triple graph
  • Resource: A resource within a graph
  • About: A special property to define the identity of the resource
  • Property: A specific property of a resource

With the exception of the 'about' property, all the parts described above can also describe multiple things. This is the case when a query is used that returns multiple results, a multi-result-query. For example, if a multi-result-query is used on the graph element, a different graph will be produced for each result. Multiple resources and multiple properties work in the same way. Note that queries are often only potentially multi-result, where the number of results is not known until the input is supplied.

Properties common to graph and resource elements

There are several properties common to graph and resource elements. These are as follows:

  • name: A name used to identify the graph or resource in reports. Also, if merge scope GRAPH_MAPPING is used, graphs that have names can be obtained from a TripleGraphCollection with the method getTripleGraphByName("name")
  • query: The result(s) of this query are only used as input for the queries at a lower level. For example, XPath queries must return 'nodes' which are then passed as a 'context node' to the resources or properties.
  • input: Determines the supporting input on which the query will be run by providing the input's name. If ommited, the query will run on the primary input.
  • required: If true, raises the level of the report entry for a missing graph or resource from warning to error. This causes missing graphs or resources to result in a failed mapping. The default is equal to the strict attribute applied to the root element, which itself defaults to true.
  • comment: A comment to describe the graph or resource mapping.

Property mapping

There are two forms of property mapping:

  • The about element: This mapping generates the URI of the resource and is therefore manadatory.
  • The property element: Top-level properties are placed inside the properties element and generate the properties of the resource. Properties can be nested to create anonymous resources (blank nodes). Properties that contain nested properties cannot have a value and must be a resource

Property mapping provide the following attributes to configure their behaviour:

  • prepend: The object's value or URI will be preprended with this attribute's value. Constants will be converted.
  • append: The object's value or URI will be appended with this attribute's value. Constants will be converted.
  • value: The object's value or URI will be this attribute's value, with any prepended or appended values applied. Constants will be converted.
  • query: The object's value or URI will be the result of the query, with any prepended or appended values applied.
  • input: Determines the supporting input on which the query will be run by providing the input's name. If ommited, the query will run on the primary input.
  • resource: If true, the final value, created from the attributes above, is expected to be a URI. If false, it is expected to be a string value, unless another data type is supplied. Defaults to false normally, true for properties with nested properties.
  • dataType: A datatype URI defining the type of the value. Values for XSD datatypes will be validated. If supplied, property cannot be a resource URI.
  • validationRegex: A regular expression (Java-compatible) which is applied after the value or resource URI have been generated. If a match is not found, the property is not added to the graph.
  • required: If true, raises the level of the report entry for a missing property mapping from warning to error. This causes missing properties to result in a failed mapping. The default is equal to the strict attribute applied to the root element, which itself defaults to true.
  • comment: A comment to describe the property mapping.

None of the attributes above are mandatory, however, a property mapping must satify one of the following conditions:

  • Has a value attribute
  • Has a query attribute
  • Has a nested properties

Tags

The purpose of tags is to extract additional meta-data from the input, to allow post-processing of the triple graph. For example, you might extract an ID which is not used in the ontology, but uniquely identifies the resource in other systems.

Tags can be applied to graphs as follows:

Tag queries will be run when a graph is produced. The results of tag queries, instead of being added as triples, are provided as a simple key-value map, accessible as follows:

Strict mapping

The strict attribute applied to the root rdf-mapping element determines the default of the required attribute for all property mappings. If a required property fails to result in a new triple on the graph, then this is considered a failure in the report. Therefore, a strict mapping is more likely to fail.

Reports

Tripliser produces at least one report for every conversion. Reports contain detail about the conversion process including:

  • Status (error, warning, advice etc...)
  • Scope (resource, graph, etc...)
  • Property name
  • A parent mapping, if existant
  • Cause (An exception class)
  • Message
  • Query details (from the mapping file)
These reports can be in three forms:
  • Logging: A report that logs to stdout
  • Plain text: A human readable, line by line output of the report
  • XML: An XML report with each field of the each entry within its own element
The level of detail is the same for all types of report. If using the command line tool, options are available to switch report type. When using the library, a report can by extracted at the end as XML or plain text.

Example

The following is an example report, in text and XML format:

[warning@property|foaf:name] (DatatypeFormatException) Lexical form 'Test Name' is not a legal instance of Datatype[http://www.w3.org/2001/XMLSchema#int -> class java.lang.Integer] Invalid value format {query='name'}
[failure@property|foaf:name] (InvalidPropertyMappingException) A required property did not have any triples created {query='name'}
[success@resource|resource1] {query='//simple'}
-----------------------
Overall status: failure
Success: false

Statuses

Report statuses indicate the successes or failures of a conversion process. The following statuses are used:

  • SUCCESS: A successful conversion was made
  • ADVICE: An event of note, but not a problem
  • WARNING: An allowable problem with the conversion
  • FAILURE: An unacceptable problem with the conversion
  • ERROR: A more fundamental error, such as an IO error
Overall status is determined as being equal to the most severe report entry status, where ERROR is the most severe status.

The status of a given event can vary depending on the configuration of the mapping file. Required properties will result in a FAILURE if no valid value is found, whereas unrequired properties will only result in a WARNING. The strict attribute, applied to the root element of the mapping file, determines the default value for the required attribute on all properties.

The expression 'no valid value is found' is key to understanding how statuses are assigned. The following table shows how different scenarios map to different statuses, most scenarios apply equally to graph, resource and property queries:

Scenario Status (required=true) Status (required=false)
The input file cannot be read ERROR ERROR
The query syntax is invalid FAILURE FAILURE
The query results in no values FAILURE ADVICE
The query results in one value SUCCESS* SUCCESS*
The query results in multiple values SUCCESS* SUCCESS*
The query results in multiple values, only one of which is a valid URI (assuming resource=true) WARNING (for each URI conversion failure) and SUCCESS* WARNING (for each URI conversion failure) and SUCCESS*
The query results in multiple values, none of which is a valid URI (assuming resource=true) WARNING (for each URI conversion failure) and FAILURE WARNING (for each URI conversion failure) and ADVICE
The query results in multiple values, but all fail the validation regular expression WARNING (for each regex mismatch) and FAILURE WARNING (for each regex mismatch) and ADVICE

*SUCCESS status is used only to indicate a successfully created resource, not for every successful query

XPath query language

The XPath engine used by Tripliser is the Saxon-HE library which provides support for XPath 2.0. When using XPath as the query language, the default behaviour, certain rules apply to how the context of the query affects its execution. These are outined in the following sections.

Context node inheritance

If a query returns one or multiple node results, each node is passed down the graph->resource->property hierarchy and used as context for the next. For example, assuming the source XML:

If a resource uses the XPath query:

Each 'thing' element will be passed as context to the properties, allowing a property to define a relative XPath query:

Resetting context

XPath allows the context to be reset by using the '//' prefix. This does not circumvent the multiplicity of a query result, but allows for aboslute references.

Another alternative is to switch document using the input attribute. Once the input document has been changed, the context node is ignored.

Saxon XPath Function Definitions

When creating a TripliserFactory, Saxon XPath functions can be supplied for use in the XPaths of the mapping file. These functions need to extend the ExtensionFunctionDefinition class.

These functions can be added via this method:

Below is an example function to reverse a String:

Once installed, the function can be used in the mapping file as follows:

Copyright (C) 2010, 2011 by David Rogers