This page provides an example for running queries against RDF data using SemWeb. Three query methods are supported. The first is the GraphMatch engine which does simple entailment, matching a simple graph with no disjunctions or optional edges against target data. The second method runs SPARQL queries against any data source supported by SemWeb using a SPARQL query engine. The third method passes SPARQL queries to a remote SPARQL endpoint over HTTP.
We'll use an RDF description of the people in the U.S. Congress for this example. Download the data files at http://www.govtrack.us/data/rdf/people.rdf.gz, http://www.govtrack.us/data/rdf/bills.108.rdf.gz, and http://www.govtrack.us/data/rdf/bills.108.cosponsors.rdf.gz and un-gzip them (on Windows use WinZip).
To simply some things, we'll put the contents of these three files into a single Notation3 file using the following command. (You may need to adjust the path to rdfstorage.exe. It should be in SemWeb's bin directory.)
$ mono rdfstorage.exe --out n3:congress.n3 people.rdf bills.108.rdf bills.108.cosponsors.rdf
rdfstorage.exe reads RDF files into a StatementSink, either an RdfWriter or a Store. The default is to read files in RDF/XML format (with the RdfXmlReader). We specified the output as n3:congress.n3, which means to write the data in Notation 3 (N3) format to the file congress.n3. The command outputs the following:
people.rdf 0m5s, 106423 statements, 19041 st/sec bills.108.rdf 0m13s, 212142 statements, 15866 st/sec bills.108.cosponsors.rdf 0m8s, 145743 statements, 16814 st/sec Total Time: 0m27s, 464308 statements, 16787 st/sec
The first query method is the GraphMatch method using my own "RSquary" query format, which is actually just plain RDF (think RDF-squared query because it's an RDF query over RDF data). A simple RSquary query is just a graph to be matched against the target data model, here in N3 format:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix bill: <tag:govshare.info,2005:rdf/usbill/> . ?bill rdf:type bill:SenateBill . ?bill bill:congress "108" . ?bill bill:number "1024" . ?bill bill:cosponsor ?person . ?person foaf:name ?name .
A benefit of using N3 is that it allows entity names starting with "?" which are read in by the N3Reader as Variable objects. (The Variable class is a subclass of BNode.) Actually, in queries BNodes are treated as variables too. This makes sense because a BNode in the query graph could not possibly match a BNode in the target data model since a BNode cannot appear in two documents. So only named entities (with URIs) and literals are used to match against the target data model
The query above says: Find all bindings for the variables ?bill, ?person, and ?name such that 1) ?bill is a Senate bill identified by congress 108 and number 1024, 2) ?bill has ?person as one of its cosponsors, and 3) ?name is a name of ?person.
Save the above query as congress_query.n3.
SemWeb contains a program called rdfquery.exe which runs a query against a target data model. To run the query execute:
$ mono rdfquery.exe n3:congress.n3 < congress_query.n3
rdfquery.exe reads a query from standard input (hence the redirect) and matches it against the data sources listed in arguments on the command line. It will take a few moments to load in the 710k statements from the congress.n3 file before it outputs the results. The output is by default in the standard SPARQL result XML format. Here it is, below. (Some XML comments appear at the top to tell you how the query was executed, but that is not repeated below.)
<sparql xmlns="http://www.w3.org/2005/sparql-results#"> <head> <variable name="bill" /> <variable name="person" /> <variable name="name" /> </head> <results ordered="false" distinct="false"> <result> <binding name="bill"> <uri>tag:govshare.info,2005:data/us/congress/108/bills/s1024</uri> </binding> <binding name="person"> <uri>tag:govshare.info,2005:data/us/congress/people/C001041</uri> </binding> <binding name="name"> <literal>Hillary Clinton</literal> </binding> </result> <result> <binding name="bill"> <uri>tag:govshare.info,2005:data/us/congress/108/bills/s1024</uri> </binding> <binding name="person"> <uri>tag:govshare.info,2005:data/us/congress/people/C000880</uri> </binding> <binding name="name"> <literal>Michael Crapo</literal> </binding> </result> <result> <binding name="bill"> <uri>tag:govshare.info,2005:data/us/congress/108/bills/s1024</uri> </binding> <binding name="person"> <uri>tag:govshare.info,2005:data/us/congress/people/L000174</uri> </binding> <binding name="name"> <literal>Patrick Leahy</literal> </binding> </result> <result> <binding name="bill"> <uri>tag:govshare.info,2005:data/us/congress/108/bills/s1024</uri> </binding> <binding name="person"> <uri>tag:govshare.info,2005:data/us/congress/people/M001153</uri> </binding> <binding name="name"> <literal>Lisa Murkowski</literal> </binding> </result> <result> <binding name="bill"> <uri>tag:govshare.info,2005:data/us/congress/108/bills/s1024</uri> </binding> <binding name="person"> <uri>tag:govshare.info,2005:data/us/congress/people/M001111</uri> </binding> <binding name="name"> <literal>Patty Murray</literal> </binding> </result> <result> <binding name="bill"> <uri>tag:govshare.info,2005:data/us/congress/108/bills/s1024</uri> </binding> <binding name="person"> <uri>tag:govshare.info,2005:data/us/congress/people/S000148</uri> </binding> <binding name="name"> <literal>Charles Schumer</literal> </binding> </result> </results> </sparql>
The query took 15 seconds to execute on my machine, with a good portion of that just loading the data from the file into memory. We could speed things up by first putting the RDF data into a SQL database and then querying the database directly. This way, the data is not loaded into memory and queries against the database make use of indexes already present.
For executing SPARQL queries over data sources that don't support SPARQL themselves (i.e. the MemoryStore, SQLStore, etc.), SemWeb uses the SPARQL query engine by Ryan Levering. The library is written in Java, but for SemWeb I convert it to a .NET assembly using IKVM.
The advantage of SPARQL over RSquary is that it supports much more complex queries, including optional statements, disjunctions/unions, and special filters.
The query above equivalently in SPARQL is:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX bill: <tag:govshare.info,2005:rdf/usbill/> SELECT ?bill ?person ?name WHERE { ?bill rdf:type bill:SenateBill . ?bill bill:congress "108" . ?bill bill:number "1024" . ?bill bill:cosponsor ?person . ?person foaf:name ?name . }
Put that in congress_query.sparql and then run it with:
$ mono rdfquery.exe -type sparql n3:congress.n3 < congress_query.sparql
This has the same output as above.
You can also use the rdfquery tool to query a remote SPARQL end point. Invoke it like this:
echo "DESCRIBE <tag:govshare.info,2005:data/us>" | \ mono bin/rdfquery.exe -type sparql sparql-http:http://www.govtrack.us/sparql
To run a query from a program, you need to 1) create a Query object, 2) create a QueryResultSink object that will receive the results of the query, and 3) run Run on the Query object.
There are two types of Query objects, GraphMatch objects that perform simple entailment queries (i.e. RSquary) and SPARQL objects that perform SPARQL queries.
The GraphMatch classs takes a graph with variables and figures out all of the ways the variables can be assigned to ("bound") to values in the target data model so that the statements in the query are all found in the target data model. Each set of variable assignments becomes a result.
Create a GraphMatch class (in the SemWeb.Query namespace) by passing to its constructor a StatementSource. Remember that RdfReaders and MemoryStores are StatementSources, so you can either pass it a reader over a file or a store in which you've programmatically constructed the query.
Query query = new GraphMatch(new N3Reader(queryfile));
Then load the data that the query will be run against:
MemoryStore data = new MemoryStore(); data.Import(new N3Reader(datafile));
Next, create a QueryResultSink. This class has an Add method that receives an array of variable bindings which is called for each query result. The variable bindings say how each variable in the query was bound to a resource in the target data model. There is one implementation of this class in SemWeb, then SparqlXmlQuerySink which is the standardized XML output format for SPARQL results. Note that you can use this output format with any Query object, not just the Sparql class. The constructor takes a TextWriter or XmlWriter to which the results are written.
QueryResultSink sink = new SparqlXmlQuerySink(Console.Out);
You can, of course, create your own subclass of QueryResultSink which you will have to do if you want to do anything interesting with the results of the query. Here's an example QueryResultSink which simply prints the variable bindings to the Console. (Note that there are several other methods that can be overridden which are executed at the start and end of the query.)
public class PrintQuerySink : QueryResultSink { public override bool Add(VariableBindings result) { foreach (Variable var in result.Variables) { if (var.LocalName != null && result[var] != null) { Console.WriteLine(var.LocalName + " ==> " + result[var].ToString()); } Console.WriteLine(); } return true; } }
Lastly, run the query with Run, passing it the target data model and the result sink.
query.Run(data, sink);
To create a SPARQL query instead, construct a new SparqlEngine object (in the SemWeb.Query namespace but in the separate SemWeb.Sparql.dll assembly!).
Query query = new SparqlEngine(new StreamReader(queryfile));
Run the query the same as with GraphMatch. There are several types of SPARQL queries, not all of which result a list of variable bindings. For instance, the DESCRIBE and CONSTRUCT query types return RDF triples. You can run queries generically and output the results to a TextWriter just by passing a TextWriter to Run instead of a QueryResultSink. Or, see the API documentation on the Sparql class for more control over the output of SPARQL queries.
An entire program for querying is below:
// This example runs a query. using System; using System.IO; using SemWeb; using SemWeb.Query; public class Example { public static void Main(string[] argv) { if (argv.Length < 3) { Console.WriteLine("Usage: query.exe format queryfile datafile"); return; } string format = argv[0]; string queryfile = argv[1]; string datafile = argv[2]; Query query; if (format == "rsquary") { // Create a simple-entailment "RSquary" query // from the N3 file. query = new GraphMatch(new N3Reader(queryfile)); } else { // Create a SPARQL query by reading the file's // contents. query = new SparqlEngine(new StreamReader(queryfile)); } // Load the data file from disk MemoryStore data = new MemoryStore(); data.Import(new N3Reader(datafile)); // First, print results in SPARQL XML Results format... // Create a result sink where results are written to. QueryResultSink sink = new SparqlXmlQuerySink(Console.Out); // Run the query. query.Run(data, sink); // Second, print the results via our own custom QueryResultSink... query.Run(data, new PrintQuerySink()); } public class PrintQuerySink : QueryResultSink { public override bool Add(VariableBindings result) { foreach (Variable var in result.Variables) { if (var.LocalName != null && result[var] != null) { Console.WriteLine(var.LocalName + " ==> " + result[var].ToString()); } Console.WriteLine(); } return true; } } }
It is also possible to run SPARQL queries directly against remote HTTP endpoints. The rdfquery.exe command-line program can be used to run queries directly. Take the following query in the file "dbp.q" to query the DBpedia database (a semantified Wikipedia) for all statements that use the literal "John McCain":
SELECT * WHERE { ?s ?p "John McCain" . }
Run this against the remote SPARQL endpoint at http://DBpedia.org/sparql using:
mono bin/rdfquery.exe sparql-http:http://DBpedia.org/sparql -type sparql < dbp.q
The output is given below:
<sparql> <head> <variable name="s"/> <variable name="p"/> </head> <results distinct="false" ordered="true"> <result> <binding name="s"><uri>http://dbpedia.org/resource/John_McCain</uri></binding> <binding name="p"><uri>rdfs:label</uri></binding> </result> <result> <binding name="s"><uri>http://dbpedia.org/resource/John_McCain</uri></binding> <binding name="p"><uri>http://dbpedia.org/property/name</uri></binding> </result> </results> </sparql>
It is also possible to query remote endpoints programmatically using the SemWeb.Remote.SparqlHttpSource class. For example:
SparqlHttpSource source = new SparqlHttpSource("http://DBpedia.org/sparql"); source.RunSparqlQuery("SELECT * WHERE { ?s ?p \"John McCain\" . }", Console.Out); (or) source.RunSparqlQuery("SELECT * WHERE { ?s ?p \"John McCain\" . }", new SparqlXmlQuerySink(Console.Out));
There are other overloads of RunSparqlQuery that provide better access to the results than dumping the output to a TextWriter. See the API documentation for details.