by Brian D. Eubanks

Wicked Cool Java: Crawling the Semantic Web

news
Dec 5, 200518 mins

Get started with RDF

In this article, we examine techniques for extracting and processing data in the World Wide Web and the Semantic Web. The World Wide Web completely changed the way that people access information. Before the Web existed, finding obscure pieces of information meant taking a trip to the library, along with hours or perhaps days of research. In extreme cases, it meant calling or writing a letter to an expert and waiting for a reply. Today, not only are there Websites on every imaginable topic, but there are search engines, encyclopedias, dictionaries, maps, news, electronic books, and an incredible array of other data available online. Using search engines, we can find information on any topic within a few seconds. The Google search engine has even become so well known that it is now often used as a verb: “I Googled a solution.” Online information is growing exponentially, and because of it, we have a completely new problem on our hands that is not solved by simply using keyword searches to find our data. The problem is infoglut. Keyword searches return too many documents, and most of those documents don’t have the information that we want.

Suppose that we wanted to search for a Java class library that converts data from one format to another. With all the open source projects out there, someone may have already solved the problem for us, and we’d rather not reinvent the wheel. In theory, we should be able to search for matching projects that meet our needs. But running a query on related keywords may give us many results that are not related to what we really want. In an ideal world, we should be able to ask the computer a question: “Is there an open source Java API that converts between FORMAT1 and FORMAT2?” The computer should then search the Web and give us the name of a suitable API if it exists, along with a short description of the standard and links to more detailed information. For this to happen, information about a hypothetical “J-convert-1-2” API would need to be encoded in such a way that the computer can find it easily without performing a keyword search and extracting data from the text results.

Information on the World Wide Web is mostly free-form text contained in HTML pages and is mostly not organized into categories and structures that search programs can easily query. At the very least, all Web content ought to have subject indicators similar to the Library of Congress and Dewey Decimal codes for books. This is not yet the case, although it will most likely happen soon. Several new standards are rapidly leading us in that direction. So far, all of these standards rely on Web content developers adding special tags to their data, and few developers know about these standards at the present time. In short, it’s a mess out there, and we’re trudging through this messy data looking for nuggets of gold.

The Semantic Web is the next-generation web of concepts linked to other concepts, rather than a collection of hypertext documents linked by keywords. If you think about it, an HTML anchor tag (link) is a keyword reference to another document. It supplies a word or phrase that links to another document, usually displayed as underlined text on a browser. But the link doesn’t exactly say how the two documents are related to each other. HTML hyperlinks don’t give any real indication about relationships between files, and the text in the link may be extremely vague. A new standard, the Resource Description Framework (RDF), makes it possible to be much more specific about how things are related to each other. In fact, RDF describes much more than documents—any entities or concepts can be linked together. This is the basic idea behind the Semantic Web—that concepts, rather than documents, can be linked together.

As Java developers, how can we participate in building the Semantic Web? First, you’ll need to know something about official standards such as RDF. You will then need to tag your documents appropriately. Many sites are already starting to do some of this by creating RDF Site Summary (RSS) feeds. An RSS feed syndicates the content from a Website so that it can be combined with information from other sites and delivered to the users as aggregated content. RSS makes a small portion of a site available as a summary, similar to what you see in an article or news abstract. However, RSS enabling is only the first step in moving toward a Semantic Web. In this article I discuss enough to get you started working with RDF and introduce some APIs that help in producing or consuming content.

This somethings that: A short introduction to N3 and Jena

The theory behind the RDF standard is actually quite simple. Everything has a Uniform Resource Identifier (URI), and, by this, I mean everything: not only documents, but also generic concepts and relationships between them. Even though you are not a document (Or are you?), there could be a URI assigned to represent you as an entity. This URI can then be used to make connections to other things. For the “you” URI, these connections might represent related organizations, addresses, and phone numbers. URIs do not have to return an actual document! This is what sometimes confuses developers when they see a URI referenced somewhere and find that there is nothing at the location. These addresses are often used as markers or unique identifiers to represent concepts. We make links between URIs to represent relationships between things. This functions much like a simple sentence in English: Programmers enjoy Java.

To begin with, let’s use a shorthand notation, called N3, to encode this sentence as an RDF graph. N3 is an easy way to learn RDF because the syntax is only slightly more complex than the sentence above! In essence, N3 is merely a set of triples, or “subject-predicate-object” relationships. Here is the N3 version of the sentence:

 @prefix wcj: <http://example.org/wcjava/uri/> .
wcj:programmers wcj:enjoy wcj:java .

We first define a prefix to make the N3 code less verbose. The prefix is used as the beginning part of a URI wherever it is found in the document, so that wcj:java then becomes http://example.org/wcjava/uri/java (the value is also placed within < and > markers—these have nothing to do with XML). The three items together are called a triple, and the verb is usually called a predicate. RDF makes a link by stating that a subject URI is related by a predicate URI to an object URI. The predicate represents some relationship between the subject and object—it tells how things link together. This is very different than an anchor in HTML, because here a relationship type is clearly defined. Remember that URIs in RDF could be anything: concepts, documents, or even (in some cases) string literals. In theoretical terms, we are creating a labeled directed graph of the relationship. A graph representation of the above might look like Figure 1.

As you might expect, there is a Java API for creating and managing RDF and N3 documents. Jena is an open source API for working with RDF graphs. Here is one way to create the graph in Jena and serialize it to an N3 document:

 

import com.hp.hpl.jena.rdf.model.*; import java.io.FileOutputStream;

Model model = ModelFactory.createDefaultModel(); Resource programmers = model.createResource("http://example.org/wcjava/uri/programmers"); Property enjoy = model.createProperty("http://example.org/wcjava/uri/enjoy"); Resource java = model.createResource("http://example.org/wcjava/uri/java"); model.add(programmers, enjoy, java); FileOutputStream outStream = new FileOutputStream("out.n3"); model.write(outStream, "N3"); outStream.close();

Here, Jena is using the term property to refer to the predicate and resource to refer to something used as a subject or object. The model’s write() method also has options to write out the document in other formats besides N3. With the Jena API, you can connect many entities together into very large semantic networks. Let’s make some additional relationships using the entities and relationships that we just created. We will produce the graph shown in Figure 2.

Here is the additional code to produce the network in Figure 2:

 Property typeOf =
   model.createProperty("http://example.org/wcjava/typeOf");
Property use =
   model.createProperty("http://example.org/wcjava/use");
Property understand =
   model.createProperty("http://example.org/wcjava/understand");
Resource computers =
   model.createResource("http://example.org/wcjava/computers");
Resource progLang =
   model.createResource("http://example.org/wcjava/progLang");
model.add(java, typeOf, progLang);
model.add(programmers, use, computers);
model.add(computers, understand, progLang);
model.write(new java.io.FileOutputStream("out2.n3"), "N3"); 

The N3 output of this code is the following:

 

<http://example.org/wcjava/uri/java> <http://example.org/wcjava/typeOf> <http://example.org/wcjava/progLang> .

<http://example.org/wcjava/computers> <http://example.org/wcjava/understand> <http://example.org/wcjava/progLang> .

<http://example.org/wcjava/uri/programmers> <http://example.org/wcjava/uri/enjoy> <http://example.org/wcjava/uri/java> ; <http://example.org/wcjava/use> <http://example.org/wcjava/computers> .

The semicolon in the N3 document is a shortcut that indicates we are going to attach another property to the same subject (“programmers enjoy Java, and programmers use computers”). The meanings of elements within a document are often defined in terms of a predefined set of resources and properties called a vocabulary. Your RDF data can be combined with other data in existing vocabularies to allow semantic searches and analysis of complex RDF graphs. In the next section, I illustrate how to build upon existing RDF vocabularies to build your own vocabulary.

Triple the fun: Creating an RDF vocabulary for your organization

An RDF graph creates a web of concepts. It makes assertions about logical relationships between entities. RDF was meant to fit into a dynamic knowledge representation system rather than a static database structure. Once you have information in RDF, it can be linked with graphs made elsewhere, and software can use this to make inferences. If you define how your own items are related in terms of higher-level concepts, your data can fit into a much larger web of concepts. This is the basis of the Semantic Web.

Every organization has relationships between information that is held in a datastore such as a database or flat file (or human memory!). If your data is in a relational database, your data items probably have relationships between them that are hidden or implied within the database structure itself. Your data may not be completely accessible, because there are relationships that an application cannot query. As an example, suppose that we have a relational database containing employees and departments within a company. A common approach is to create an Employee table, with columns for employee information such as ID number, date of birth, name, hire date, supervisor name, and department. There are many relationships hidden within the table and column names, and it is up to an application to know these relationships and take advantage of them. Column names alone would not give you the following information:

  • A and B are employees
  • An employee is a person
  • A supervisor is an employee who directs another employee
  • C is a company
  • A company is an organization
  • A and B work for C

Column and table names in a database are simply local identifiers and don’t automatically map to any concepts that might be defined elsewhere. But this is domain knowledge that could be used more effectively by the application if it were defined in an extensible and machine-readable way. Having such information available would give our applications more flexibility, and this knowledge could also be reused elsewhere. How can we encode this information so that applications can make use of these relationships? And how can our application relate this to other information that we might find on the Semantic Web?

It may not make sense to put this metadata in your database, but you can create an RDF mapping outside the database schema that describes each item relative to the Semantic Web as a whole. We can represent some of these concepts using existing vocabularies. The rest of them we can define in our own terms. If you don’t know where to connect a concept to an existing vocabulary, you can always define a URI for that concept now and make the connection to other systems later. At least you can use it to share data within your own organization if your vocabulary is well documented and the meaning of each item is clear. There are many basic vocabularies that RDF applications can use, and new ones are constantly being created (like yours!).

The first step is to define a URI for each concept that is even remotely related to your application. This is much like the object-oriented development process, but these entities may also be things that are not directly used by the application. By defining your terms within a larger context, you can later map these entities to existing concepts on the Web. Let’s try it with our employee example by first listing some related concepts and their meanings (in English text). Here is a simplistic attempt to define some terms:

  • http://example.org/wcjava/employee = an employee
  • http://example.org/wcjava/person = a person
  • http://example.org/wcjava/organization = an organization
  • http://example.org/wcjava/employer = an organization that employs an employee

The important point is to make sure that each concept has a unique identifier. Make sure that the URIs will still be around a few years from now; you are building a complete concept space around these identifiers! If you have control over your domain name, it might be wise to have a policy that forbids anyone placing actual content under URIs beginning with some prefix (such as http://yourdomain/uri). We are using these names as globally unique identifiers, not as URLs for retrieving documents. There is nothing wrong with a document being there, but it could lead to confusion between the concept and the document. In this example, we are using the example.org domain, which is reserved solely for illustrative purposes within documentation. If you want to define a permanent URI, there are sites that will let you define your own permanent URI independent of future domain name ownership changes. The best known of these is http://purl.org.

After you have identified some concept URIs, it’s time to define relationships between them. In the previous section, we showed how to do this in Jena using our own relationships. Now let’s use some predefined relationships created by others and apply them to our entities. Adding another entity that was defined elsewhere is easy: just add its URI to the graph we are building.

But if we want to do anything useful with these entities, we will also need to import the statements that define its related properties and resources. In our example, we will use the subClassOf property defined in the RDF schema, which works similarly to a subclass relationship in object-oriented programming. The graph in Figure 3 shows the relationships between our resources.

At first, you should do this mapping with pen and paper (archaic, but always accessible) or using an RDF visualization tool. When you have finished, you will have a graph of the relationships between entities in your system. Once you’ve created a hierarchy and vocabulary, you can create N3 or RDF/XML files that you can use as metadata. Most RDF visualization tools will do this for you automatically. You’ll want to familiarize yourself with some of the existing RDF vocabularies on which you can base your own hierarchy. Once you have designed a hierarchy, you can create and manipulate it from Jena. The next section shows how to do this.

Who’s a what? Using RDF hierarchies in Jena

Earlier we created a hierarchy of terms to use for our metadata. We used the word vocabulary to refer to this collection of terms, but it is often called an ontology if it defines relationships between the terms. According to the Wikipedia definition, an ontology (in the computer science sense) is a “data structure containing all the relevant entities and their relationships and rules (theorems, regulations) within a domain.”

In Jena, there are built-in helper classes for working with commonly used ontologies. The RDF schema is one of these. Jena has a helper class called RDFS, which has a static variable for the subClassOf property. You can create the graph in the previous section by using this code:

 Model model = ModelFactory.createDefaultModel();
model.setNsPrefix("wcj", "http://example.org/wcjava/");
Resource employee = model.createResource("wcj:employee");
Resource person = model.createResource("wcj:person");
Resource employer = model.createResource("wcj:employer");
Resource organization = model.createResource("wcj:organization");
Property hires = model.createProperty("wcj:hires");
model.add(employer, hires, employee);
model.add(employer, RDFS.subClassOf, organization);
model.add(employee, RDFS.subClassOf, person);
model.write(new FileWriter("ourEntities.rdf"), "RDF/XML"); 

The second line sets a namespace prefix for our graph, which makes the code easier to read because we can describe the URIs in a simpler way. There is nothing special about the choice of wcj as our prefix. It could have been any string of letters, but whichever value is used becomes the prefix that is sent to the output file. The RDF/XML output type is the XML representation of our RDF graph. Most applications will exchange RDF graphs using the XML format rather than N3. As you can see, Jena’s RDF model can work with either type.

Once you have an RDF vocabulary defined for your data, you will want to put it onto a Website so that applications can use it. You can use your new vocabulary to semantically tag any components within applications. For the database example above, you might create a new table to hold metadata linking each column and table name to their RDF types. It could be as simple as an entry for each table/column name and the corresponding URI from your RDF vocabulary that describes its meaning. You might use this for automatically generating documentation or in analyzing and reusing application code. Using RDF for this type of metadata is a convenient way to tag the data without changing anything in the existing data structures. For our Java classes, we could also add code annotations or Javadoc tags to semantically mark up our code to facilitate its reuse.

There are some well-known standard RDF vocabularies that you can use to build your own vocabulary. The first one to consider using is a vocabulary extension to RDF, created by the World Wide Web Consortium, called the OWL Web Ontology Language. It includes vocabulary along with formal semantics that you can use in your own definitions. OWL builds on the framework created by the RDF and RDF schema vocabularies. Although we used the RDF schema’s subClassOf property, OWL has a much more comprehensive version that adds formal semantics such as property restrictions and set operations. Jena has an OWL helper class with static variables for each of the OWL resources and properties. Another common RDF standard is the Dublin Core (DC), an element set for describing metadata about information resources of any kind. It defines generic properties such as title, creator, type, format, language, and rights. The type property uses values from the Type Vocabulary, part of the Dublin Core. Some examples of types are collection, dataset, interactive resource, and software. In Jena, there is a DC class with static Property variables for each of the Dublin Core properties. You can add a type property to an item within a model by using:

 model.add(myDatabaseResource, DC.type, DCTypes.Dataset);

This marks the resource myDatabaseResource as being a type of Dataset. Combining with RDF schema or OWL, you can create your own hierarchy of terms using these as a baseline. For example, you might create terms for “JDBC-accessible database,” “relational database table,” and “relational database column” that are RDF subclasses of Dataset. You could then define unique URIs for specific instances of these and make statements about them in RDF: “MySQL instance #743234 at OurOrganization contains data about employees, stored in the table named Employee.” Having such metadata available can make managing IT resources much easier.

Eventually, there will probably be a standard upper-level ontology for all information technology terms. Many groups are working to create standard vocabularies for various domains. One effort, the Suggested Upper Merged Ontology (SUMO), aims to develop an upper-level hierarchy for all abstract concepts. Future applications that use ontologies based on this may be able to make high-level inferences using data from entirely different domains. There are some domain-specific hierarchies that are also based on SUMO.

Brian D. Eubanks is a consultant, speaker, author, and trainer specializing in Internet technologies and the founder of Eu Technologies. He has more than 20 years of experience as a computer programmer, network engineer, and systems consultant. His current work focuses on Java, XML, and Flash.