Compiling touchlib on Ubuntu 11.04

In the event that it might be useful for someone, here's how to compile touchlib on Ubuntu 11.04:

  • Install some dependencies and utils:
    sudo apt-get install cmake libcv2.1 libcvaux2.1 libcv-dev libcvaux-dev libglut3 fftw-dev g++ libxmu-dev libglut3-dev libhighgui-dev subversion
  • Check out the code into the 'multitouch' directory: 
    svn co http://touchlib.googlecode.com/svn/trunk/ multitouch
    NB: we recieved revision 403.
  • Change into the new directory:
    cd multitouch
  • Fix an include problem: Add the line 
    #include <stdio.h>
    to the top of
    src/RectifyFilter.cpp
  • Tell the build script where to find the libraries:
    export OpenCV_ROOT_DIR=/usr/
  • Create the build scripts:
    cmake .
  • Build and compile:
    make

Tutorial: Using Kohana as a library

If you're working with an existing codebase it's often difficult to modernise the code as it would mean an complete rewrite and there's rarely the time. An alternative is to improve the codebase incrementally as best you can, gradually outsourcing code to external libraries to reduce the amount of old code there is to maintain.

This post is a tutorial on how to include the Kohana PHP framework into existing PHP applications, without having to use the routing and HMVC request handling features.

Read the rest of this post »

Privacy, Confidentiality and Linked Data

Our first post on Linked Closed Data missed some important factors when we argued the inevitability of closed Linked Data publishing, namely — as the title of this post implies — privacy and confidentiality.

The need to provide access to sensitive data while maintaining confidentiality will be a major motivation for Closed Linked Data publishing. Rather than adopt a second format for publishing sensitive data, publishers will be keen to re-use existing Linked Data publishing infrastructure. The Linked Data community needs to converge on standards and develop implementations to support this as soon as possible.

Chris Gutteridge highlighted this in a post on the institutional data of the University of Southampton. For a university there are a number of uses a student might have for their own personal data, data which is confidential and thus cannot be published publicly. 

Chris points out that in this domain there are also some complicated issues regarding student sponsors which might arise if certain assesment data was available electronically. These issues are of course not specific to Linked Data publishing, but it is good to know that people give these issues thought.

 

Linked Data as an Economic Good

To consider how reveue models for Linked Data might work, it is helpful to consider how Linked Data fits into the classification as an economic good. To begin with we will consider the simplest case, Linked Open Data where the dataset has been declared public domain.

  • Non-rivalrous
    Information, and thus Linked Data, is a non-rival good; a good which can be enjoyed simultaniously by any number of consumers (ignoring technological limitations such as network bandwidth and processing power).
  • Durable
    Informational goods are also durable; one person's use of a piece of information does not expend that resource and subsequently prevent any others from using it.
  • Non-excludable
    A Linked Open Dataset which has no restriction on access is a non-excludable good; it is not possible to prevent people who have not paid for it from enjoying access to it.
  • Intangiable
    Information goods are generally all intangiable goods, good which are themselves not physical objects. Intangiable goods are commonly also nonrival and non-excludable goods.

Goods which are both non-rivalrous and non-excludable are classed as 'public goods' in economic terms. Goods which are both rival and excludable, which are the more common sort of good, are known as 'private goods'.

Public goods are understood to be difficult to charge for directly, as the non-excludability prevents payment for access revenue models. Indeed, economists believe that markets are neither a practical or efficient means of allocating pure public goods.

Naturally, producers and sellers of public goods have a vested interest in ensuring their continued income. Historically, technology and legislation have been the methods used to achieve this; attempting to make what was a public good into something which behaves more like a private good, by making it rival and/or excludable. Digital Rights Management software and copyright law are examples of these technololical and legal methods.

Alternatively, content holders may seek revenue through other means, to offset the impact of freeloading. Advertising is perhaps the most common method, whereby paid adverts are placed alongside or sometimes integrated with the content. Sponsorship is another method, where costs are covered by from investment from another party which does not seek advertising in return, for example, government funding.

This post elaborates on our arguments on the economic nature of Linked Data, from our paper on Linked Closed Data which we recently posted about.

Linked Closed Data

The use of Linked Open Data is becoming increasingly widespread, boosted by recent moves to increase government transparency and efficiency by publishing non-sensitive datasets for free online. There is now a large 'cloud' of interlinked datasets, as evidenced by efforts to catalogue and visualise the Web of Linked Data.

Content owners governments and research institutions are in a unique position; they have the means to invest in the creation of datasets, yet none of the financial pressures of private companies which require them to turn a profit from such investments. So far, all datasets published as Linked Data have been published for free, without access restrictions. However as Linked Data technology moves beyond the Research and Development stage, and is incorporated into commercial products and services, pressures to generate return on investments will increase. In the face of those pressures it is inevitable that some will seek to monetize Linked Data.

In response to these pressures we can expect to see the rise of Linked Closed Data, datasets which are linked in adherence to Linked Data principles, but to which access or some content is restricted to paying members. It may be possible to meet these financial pressures through other means, such as advertising, however we are sceptical of this (this will be the subject of a later post).

Linked Closed Data will not mean the end of the Web of open Data; closed datasets are unlikely to displace the free alternatives, as commercial datasets are sold on their quality and depth, something which free datasets do not generally assure. It will however enable a market for high quality Semantic data, which may benefit to both companies and consumers.

My colleagues and I recently submitted a paper discussing this subject to the Consuming Linked Data workshop (COLD2010), which unfortunately was not accepted. This post explores our ideas about Linked Closed Data from the paper.

A Selective Register Gloabls Function for PHP

A friend was lamenting the lack of register_globals in his PHP setup (a tad late, as it became non-default back in April 2002). While register_globals was terrible in terms of security, it did excel in its simplicity.

function register_my_globals($allowed, $first, $second = array())
{
    $vars = array_intersect_key(
                array_merge($first, $second),
                array_combine($allowed, $allowed));
    foreach ($vars as $key => $value)
        $GLOBALS[$key] = $value;
}

// First parameter defines allowed variables
// The second and third are arrays in which to look for these keys.
// Keys in the third array override keys in the first.
register_my_globals(array('green', 'blue'), $_GET, $_POST);
// Take variables only from POST data
register_my_globals(array('cake'), $_POST);

The above snippet allows you to recreate the behaviour of register_globals, but on a case-by-case basis. You must specify the names of the variables you wish to allow, and the source arrays you’d like them taken from (almost always either HTTP GET parameters, or HTTP POST data).

On-demand TCP over DNS server

Virtual servers have become quite cheap these days, to the point where I can justify paying the monthly charge on one when I'm not sure how much I'll use it. One of the things I have been running on my VM is a TCP-over-DNS server; it will allow you access to the Internet through some access points where you're forced to login, though it relies on the network administrator neglecting to block certain types of DNS query.  The author has posted a good how-to and overview of how it works so I'm not going to go into that here. Now, I don't anticipate using this tunneller very often so it'd be nice to not run the daemon all the time, but I obviously can't enable myself unless I know in advance that I won't have Internet access. Therefore, ideally I want the server to run only when I want to use it. Fortunately Linux has long had a means of doing this with the inetd daemon. The inetd daemon will monitor a network socket, waiting for incoming traffic, and launch your daemon only when it is needed. It then passes the daemon process the existing sockets and waits for it to finish, at which point it'll go back to watching for traffic again. The config line you'll need for inetd is as follows (you may need to highlight it and copy it elsewhere, as it doesn't show up well in this theme):

domain  dgram   udp     wait    root    /usr/bin/java   java -jar /path/to/tcp-over-dns-server.jar --domain delegated.domain.com --forward-port 22 --forward-address 127.0.0.1 --mtu 1500 --log-level 1 --idle-timeout 10 --log-file /var/log/tcp-over-dns

Aside from modifying the server to support inherited channels I have:

  1. Added an idle time limit (so the program can exit if it sees no clients after a set number of seconds, and let inetd monitor the port again)
  2. Added a log file option (programs launched by inetd can't log to the standard output or error channels as inetd will pipe them into the inherited connection.)
  3. Changed the default behaviour (If a channel is inherited the server will no longer try to bind on its default port)

If you're interested you can download the source code or just the pre-compiled jar file.

Apache versions and .htaccess files

The other day I was looking into a problem someone was reporting with an Apache RewriteRule, only to conclude that it was using features of the Regex library which weren't available in their version of Apache. I found a means of detecting the different versions of Apache using the mod_version module. This allows you to write htaccess files which can fall back to other rules for older versions of Apache. Unfortunately it's only been available since version 2.0.56, but given that this was before the first release of version 2 it's fairly safe to assume that anything without mod_version will be running Apache 1. I will concede that this sounds very obvious, but there was a surprising lack of results in google for any of the keywords that I thought to try. For the benefit of anyone searching for this; below is an example of how you can use mod_version in practice:
<IfModule !mod_version.c>
        # Earlier than version 2.0.56, so almost certainly 1.x
        # as 2.0.63 was the first release of version 2.
</IfModule>
<IfModule mod_version.c>
        # Version 2.0.56 or later
        <IfVersion < 2.2>
                # Before version 2.2
        </IfVersion>
        <IfVersion >= 2.2>
                # Version 2.2 or later
        </IfVersion>
</IfModule>

Observational Identity

We argued previously that there is a need for a system of identity for Semantic Web Agents, particularly in the process of making judgements of trust. Examining the requirements of a system of identity, we recognise that such a system cannot count on universal uptake among Semantic Web agents, and therefore it cannot require each agent to state an identity for itself. Additionally even if universal uptake could be relied upon, we cannot count on the honest and benevolent behaviour of every Semantic Web agent. Thus, as we briefly mentioned at the end of our previous post, a system of identity for the Semantic Web must be primarily built around observable characteristics as a measure of identity. As an analogy; when surfing the Web you would not rely on a Website's claim that it is your bank's online portal, you would rely on the factors you can observe (such as the domain name and also the digital certificate) to inform your judgement. Digital certificates are especially important if you are connected to the Internet over an untrusted network connection. Building on our earlier example of a rudimentary HTTP-based Semantic Web agent, suppose we request a URI from it, and receive some RDF in response. The data we collect about the identity of the agent may look something like the following:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
@prefix ex: <http://example.com/ont/>.

_:agent1
        rdf:type         ex:HTTPAgent;
        ex:port          80;
        ex:host          "agent.example.com";
        ex:ip            "10.0.0.1";
        ex:time          "2010-04-14T14:37:37Z"^^xsd:dateTime.
Suppose at some later date we again communicate with the agent at the domain agent.example.com, and in the process observe that the DNS entry has changed, and the domain now refers to a new IP address. Do we then consider this to be the same agent which we have previous experience of? Further, is the information we have sufficient to make such a decision? Other attributes may influence the judgement of similarity if they significantly alter the behaviour of the agent, software version numbers or digital certificates, for example. Returning to our analogy, if your browser stored the credentials for your bank's online banking portal, you would specify very strict criteria, very similar to what we described above, to dictate which websites are permitted to see this information. Below follows a second observation record, for an interaction with the same agent at a different IP address.
_:agent2
        rdf:type         ex:HTTPAgent;
        ex:port          80;
        ex:host          "agent.example.com";
        ex:ip            "10.0.0.2";
        ex:time          "2010-04-14T14:37:37Z"^^xsd:dateTime.
It is possible to encode our criteria for equivalence using OWL (to some degree) such that a reasoner can identify that two agents are in fact the same entity. This involves declaring a class of all things which meet the criteria of being a particular agent such that those which meet the necessary and sufficient criteria may be considered the same. Unfortunately the equivalence afforded by OWL causes the effective merging of the identifiers, such that, as below, the metadata from the two different requests becomes inseparable.
_:agent1
        owl:sameAs           _:agent2;
        rdf:type         ex:HTTPAgent;
        ex:port          80;
        ex:host          "agent.example.com";
        ex:ip            "10.0.0.1";
        ex:ip            "10.0.0.2";
        ex:time          "2010-04-18T10:24:12Z"^^xsd:dateTime;
        ex:time          "2010-04-14T14:37:37Z"^^xsd:dateTime.
The problem with this approach is not the use of OWL classification (though it is somewhat ill suited to this task), rather it is the result of a simplistic ontology design. We acknowledge that this crude example ontology has many flaws (the assumption that a HTTP agent operates on a sole port and network address, for example), however to fully satisfy our potential requirements we must adopt an event-based ontology design, as these observations are inherently temporal in nature.

Trust and identity on the Semantic Web

Open Data movements are gradually gaining traction; government transparency efforts in the US and the UK have begun to release data-sets, some of which are published in Linked Data form. As the range and variety of Semantic Web data publishers grows, it is increasingly important that we address the problem of trust. Previously we discussed the challenges of a trust layer for the Semantic Web, and more recently, how we think these challenges should be faced. We are convinced that provenance and reputation information will be a crucial basis for Semantic Web trust decisions. Reputation and provenance are by no means new subjects in the domain of Computer Science, both are grounded in substantial bodies of literature. Existing techniques will likely require some adaption in order to match the challenges of the Web of Linked Data. Hartig and Zhao's provenance vocabulary for Linked Data does exactly this, taking existing provenance techniques in a Web-friendly direction, recognising the distinctions between data curation, publishing and access. To do similar for reputation mechanisms will not be prohibitively difficult, however there remains a missing piece of the technological puzzle: a system of identity. A notion of identity is necessary for any judgement of trust in order to fully link together available information. The FOAF vocabulary gives us identifiers for people, and the FOAF+SSL proposals allow us to prove the ownership of (Web of Trust, or PKI style) digital certificates, however there is of yet no accepted means of identifying a Semantic Web software agent (e.g. a Webserver) beyond the foaf:Agent type. In order to properly describe the identity of a Semantic Web agent we require more information than a single URI. For example, in the case of a HTTP-Based Semantic Web agent (a Webserver), metadata such as the hostname and network port is to some purposes integral to the identity of the agent. To avoid coining a new identity with every HTTP request we must have some criteria by which we judge that the other parties of different data exchanges are the same entity. An important point to make here is that we cannot rely on declarative identities, that is we cannot count on universal uptake among Semantic Web agents of a vocabulary in which to assert identity. Thus an appropriate identity mechanism must consider both observational identities (identities coined by another agent based on its observations) and declarative identities.