Tripliser User Guide 1.0
Contents
Basic model
Tripliser uses the XML mapping file to convert inputs to outputs. The inputs could be in various forms, but most typically are XML files. The output is one or more 'triple graphs', or a serialisation of the graph(s).
Concepts
Mime types
Tripliser can produce all mime-types supported by the Jena library:
| N3 | text/rdf+n3 |
| X-Turtle | application/x-turtle |
| Plain text | text/plain |
| RDF XML | application/rdf+xml |
| RDF XML Abbreiviated* | application/rdf+xml+abbr |
*A non-standard mime-type used to produce abbreviated RDF XML
Primary inputs
Primary inputs are the principal source of the data used to construct the output. Primary inputs are often referred to as simply 'inputs', where primary input is used only to contract with supporting input.
Supporting inputs
Supporting inputs provide a way of supplying additional reference data to a conversion process. Supporting inputs must be of the same form as the input data. These inputs are provided as a map, allowing them to be specifically referenced in the mapping file using the input attribute
Reports
Tripliser produces at least one report for every conversion. Reports contain details about the conversion process and an assessment as to whether it was successful.
Functions
Functions can be provided to the conversion process to allow more advanced querying, such as lookups to third-party services, data type conversion or advanced validation.
Installation
The only currently available version is "1.0", replace [version] with this value in the command line examples below.
Download Jar
requires Java 1.6wget https://github.com/downloads/daverog/tripliser/tripliser-[version]-jar-with-dependencies.jar wget https://github.com/daverog/tripliser/raw/master/src/main/sh/triplise chmod u=r+x triplise ./triplise [input] [mapping]
From source - Install Saxon into Maven Repository
Download Saxon-HE 9.3 from here. Next...
unzip saxonhe9-3[version]j.zip mvn install:install-file -Dfile=saxon9he.jar -DgroupId=net.sf.saxon -DartifactId=saxon-he -Dversion=9.3 -Dpackaging=jar
From source - Build Tripliser
cd [workspace directory] git clone git://github.com/daverog/tripliser cd tripliser mvn assembly:assembly -DdescriptorId=jar-with-dependencies cp src/main/sh/triplise [installation directory] cp target/tripliser-[version]-jar-with-dependencies.jar [installation directory] cd [installation directory] chmod u=r+x triplise ./triplise [input] [mapping]
Usage
triplise [input] [mapping]
Where input is the input file and mapping is the mapping XML file. Unless specified, the output will use the input filename and apply a '.rdf' extension.
The tool provides usage help which will explain the available options. To show the usage help, call the tool with no arguments.
Here is a sample usage message:
Java
Tripliser is principally designed as a Java library. As such, the functionality is richer than simply using the command line tool.
Maven dependency
First follow the instructions in 'From source - Install Saxon into Maven Repository' to install Saxon-HE in your local Maven repo. Next, install the latest Tripliser library into your maven repository:
wget -O tripliser-[version].pom https://github.com/downloads/daverog/tripliser/pom.xml wget https://github.com/downloads/daverog/tripliser/tripliser-[version].jar mvn install:install-file -Dfile=tripliser-[version].jar -DpomFile=tripliser-[version].pom
To use in your project, include the following dependency in your pom.xml file:
Usage
To get started, a TripliserFactory is needed. A TripliserFactory requires a mapping file, as an InputStream. The resulting Tripliser can be used to process different inputs in different configurations. Both the factory and the Tripliser use a builder pattern to provide a clean syntax.
The following is the simplest possible usage of the Tripliser library:
Using the builder pattern, more options can be applied:
For further details on each builder method, please consult the javadoc.
Supplying functions
When using the Java library, functions can be supplied to enhance the query language. Currently, only Saxon XPath functions can be added. See the Saxon XPath Function Definitions section
TripleGraphs
TripleGraphs can be produced by Tripliser, and are the encapsulation of a triple graph and associated meta-data. They comprise the following:
- A name
- A Jena model, containing the actual graph data
- A report, detailing issues with the conversion
- Tags, providing additional meta-data
Scope & Merging
Tripliser uses the concept of scope in various areas. The various scopes are defined as follows, broadly speaking, from the widest to the narrowest:
- MAPPING: The whole mapping file
- INPUT: Each primary input (not supporting) supplied to the tripliser
- GRAPH_MAPPING: Each graph element in the mapping file
- GRAPH (default): Each query result for each graph element in the mapping file
- RESOURCE_MAPPING: Each resource element in the mapping file
- RESOURCE: Each query result for each resource element in the mapping file
- PROPERTY: Each query result for each property element in the mapping file
Scopes are provided to the setMergeScope(scope) method. This determines how the graphs are merged during the conversion process. For example, if a merge scope of RESOURCE_MAPPING was specified, a graph would be produce for each resource element in the mapping file. In the universe example, we would produce one graph full of stars, and one graph full of planets, regardless of the input(s).
TripleGraphCollections
TripleGraphCollections are the produced by the Tripliser.generateTripleGraphCollection(). In addition to providing an iteration over the triple graphs in the collection, the collection also contains a report of its own. This report records entries that are not specific to any particular graph, such as an error reading an input.
Triple graphs can also be obtained by name, from the collection (if this is appropriate to the chosen merge scope).
Iteration vs. Collection
There are two approaches to obtaining multiple triple graphs. One is to create a TripleGraphCollection with the generateTripleGraphCollection method, the other is to create a TripleGraphIterator with the getTripleGraphIterator method.
Using getTripleGraphIterator is much more likely to be efficient as data is processed and stored only when necessary. The efficiency gain will depend on various factors, particularly the merge scope, but the following example illustrates the difference.
Assuming the following:
- A mapping file with one graph element
- 100 input files
- 100 'instances' of a graph per input file (e.g. the graph XPath query returns 100 results for each input file)
Using the collection approach, 10,000 graphs will be added to a collection, each containing a report.
Using the iteration approach, one of each of the 10,000 graphs will be return at a time. At any one time, only one input file will be open and only one triple graph and one report will be in memory.
Mapping file
The mapping file describes how an particular format of input should be converted into triples. This is done using queries, which 'extract' the values from the input and 'inject' them into a graph. Below is an example mapping file:
Constants
To avoid replication of frequently occurring values throughout the mapping file, constants can be used. They are simple key value pairs, defined as follows:
The defined constant can then be applied, using ${name} syntax, elsewhere in the mapping file:
Constants can be used inside the following attributes of property mappings (about or property):
- prepend
- append
- value
Namespaces
Tripliser supports the use of namespace prefixes. Each declared namespace can be used elsewhere in the mapping file. Only namespaces that are used in any particular graph will be declared in resulting RDF serialisations. Namespaces are defined as follows:
The defined namespace can then be applied, using prefix:entity syntax, elsewhere in the mapping file:
Direct use of namespace prefixes is supported by the following attributes of property mappings (about or property):
- name
- dataType
- value
Where @dc_property results in a valid Dublin Core property.
The RDF namespace (http://www.w3.org/1999/02/22-rdf-syntax-ns#), with a prefix of 'rdf', is built-in by default
Default namespace
The default namespace is extracted from the XML file's xmlns attribute. A minor performance gain can be achieved by indicating the default namespace in the mapping file as follows:
Graphs, Resources & Properties
The most important parts of the mapping file are the graphs, resources and properties, each defined as follows:
- Graph: A collection of resources intended to form a single triple graph
- Resource: A resource within a graph
- About: A special property to define the identity of the resource
- Property: A specific property of a resource
With the exception of the 'about' property, all the parts described above can also describe multiple things. This is the case when a query is used that returns multiple results, a multi-result-query. For example, if a multi-result-query is used on the graph element, a different graph will be produced for each result. Multiple resources and multiple properties work in the same way. Note that queries are often only potentially multi-result, where the number of results is not known until the input is supplied.
Properties common to graph and resource elements
There are several properties common to graph and resource elements. These are as follows:
- name: A name used to identify the graph or resource in reports. Also, if merge scope GRAPH_MAPPING is used, graphs that have names can be obtained from a TripleGraphCollection with the method getTripleGraphByName("name")
- query: The result(s) of this query are only used as input for the queries at a lower level. For example, XPath queries must return 'nodes' which are then passed as a 'context node' to the resources or properties.
- input: Determines the supporting input on which the query will be run by providing the input's name. If ommited, the query will run on the primary input.
- required: If true, raises the level of the report entry for a missing graph or resource from warning to error. This causes missing graphs or resources to result in a failed mapping. The default is equal to the strict attribute applied to the root element, which itself defaults to true.
- comment: A comment to describe the graph or resource mapping.
Property mapping
There are two forms of property mapping:
- The about element: This mapping generates the URI of the resource and is therefore manadatory.
- The property element: Top-level properties are placed inside the properties element and generate the properties of the resource. Properties can be nested to create anonymous resources (blank nodes). Properties that contain nested properties cannot have a value and must be a resource
Property mapping provide the following attributes to configure their behaviour:
- prepend: The object's value or URI will be preprended with this attribute's value. Constants will be converted.
- append: The object's value or URI will be appended with this attribute's value. Constants will be converted.
- value: The object's value or URI will be this attribute's value, with any prepended or appended values applied. Constants will be converted.
- query: The object's value or URI will be the result of the query, with any prepended or appended values applied.
- input: Determines the supporting input on which the query will be run by providing the input's name. If ommited, the query will run on the primary input.
- resource: If true, the final value, created from the attributes above, is expected to be a URI. If false, it is expected to be a string value, unless another data type is supplied. Defaults to false normally, true for properties with nested properties.
- dataType: A datatype URI defining the type of the value. Values for XSD datatypes will be validated. If supplied, property cannot be a resource URI.
- validationRegex: A regular expression (Java-compatible) which is applied after the value or resource URI have been generated. If a match is not found, the property is not added to the graph.
- required: If true, raises the level of the report entry for a missing property mapping from warning to error. This causes missing properties to result in a failed mapping. The default is equal to the strict attribute applied to the root element, which itself defaults to true.
- comment: A comment to describe the property mapping.
None of the attributes above are mandatory, however, a property mapping must satify one of the following conditions:
- Has a value attribute
- Has a query attribute
- Has a nested properties
Tags
The purpose of tags is to extract additional meta-data from the input, to allow post-processing of the triple graph. For example, you might extract an ID which is not used in the ontology, but uniquely identifies the resource in other systems.
Tags can be applied to graphs as follows:
Tag queries will be run when a graph is produced. The results of tag queries, instead of being added as triples, are provided as a simple key-value map, accessible as follows:
Strict mapping
The strict attribute applied to the root rdf-mapping element determines the default of the required attribute for all property mappings. If a required property fails to result in a new triple on the graph, then this is considered a failure in the report. Therefore, a strict mapping is more likely to fail.
Reports
Tripliser produces at least one report for every conversion. Reports contain detail about the conversion process including:
- Status (error, warning, advice etc...)
- Scope (resource, graph, etc...)
- Property name
- A parent mapping, if existant
- Cause (An exception class)
- Message
- Query details (from the mapping file)
- Logging: A report that logs to stdout
- Plain text: A human readable, line by line output of the report
- XML: An XML report with each field of the each entry within its own element
Example
The following is an example report, in text and XML format:
[warning@property|foaf:name] (DatatypeFormatException) Lexical form 'Test Name' is not a legal instance of Datatype[http://www.w3.org/2001/XMLSchema#int -> class java.lang.Integer] Invalid value format {query='name'}
[failure@property|foaf:name] (InvalidPropertyMappingException) A required property did not have any triples created {query='name'}
[success@resource|resource1] {query='//simple'}
-----------------------
Overall status: failure
Success: false
Statuses
Report statuses indicate the successes or failures of a conversion process. The following statuses are used:
- SUCCESS: A successful conversion was made
- ADVICE: An event of note, but not a problem
- WARNING: An allowable problem with the conversion
- FAILURE: An unacceptable problem with the conversion
- ERROR: A more fundamental error, such as an IO error
The status of a given event can vary depending on the configuration of the mapping file. Required properties will result in a FAILURE if no valid value is found, whereas unrequired properties will only result in a WARNING. The strict attribute, applied to the root element of the mapping file, determines the default value for the required attribute on all properties.
The expression 'no valid value is found' is key to understanding how statuses are assigned. The following table shows how different scenarios map to different statuses, most scenarios apply equally to graph, resource and property queries:
| Scenario | Status (required=true) | Status (required=false) |
|---|---|---|
| The input file cannot be read | ERROR | ERROR |
| The query syntax is invalid | FAILURE | FAILURE |
| The query results in no values | FAILURE | ADVICE |
| The query results in one value | SUCCESS* | SUCCESS* |
| The query results in multiple values | SUCCESS* | SUCCESS* |
| The query results in multiple values, only one of which is a valid URI (assuming resource=true) | WARNING (for each URI conversion failure) and SUCCESS* | WARNING (for each URI conversion failure) and SUCCESS* |
| The query results in multiple values, none of which is a valid URI (assuming resource=true) | WARNING (for each URI conversion failure) and FAILURE | WARNING (for each URI conversion failure) and ADVICE |
| The query results in multiple values, but all fail the validation regular expression | WARNING (for each regex mismatch) and FAILURE | WARNING (for each regex mismatch) and ADVICE |
*SUCCESS status is used only to indicate a successfully created resource, not for every successful query
XPath query language
The XPath engine used by Tripliser is the Saxon-HE library which provides support for XPath 2.0. When using XPath as the query language, the default behaviour, certain rules apply to how the context of the query affects its execution. These are outined in the following sections.
Context node inheritance
If a query returns one or multiple node results, each node is passed down the graph->resource->property hierarchy and used as context for the next. For example, assuming the source XML:
If a resource uses the XPath query:
Each 'thing' element will be passed as context to the properties, allowing a property to define a relative XPath query:
Resetting context
XPath allows the context to be reset by using the '//' prefix. This does not circumvent the multiplicity of a query result, but allows for aboslute references.
Another alternative is to switch document using the input attribute. Once the input document has been changed, the context node is ignored.
Saxon XPath Function Definitions
When creating a TripliserFactory, Saxon XPath functions can be supplied for use in the XPaths of the mapping file. These functions need to extend the ExtensionFunctionDefinition class.
These functions can be added via this method:
Below is an example function to reverse a String:
Once installed, the function can be used in the mapping file as follows:
Copyright (C) 2010, 2011 by David Rogers