Internet Radio

Feb 4, 2008

When my dad introduced me to his internet radios, I thought, “wow, there’s something for everyone!” So I went looking for something for me. And yesterday I found it: Kink ClassX. It plays “alternative” music from the 60s to the present day. I listened to Bowie, Marilyn Manson, Arethra Franklin, Squeeze, Souxie, Otis Redding, the White Stripes: a really diverse set of music, with one thing in common – All Good.

OK, so you have to sit through german-language new bulletins every hour on the hour, but to me that just adds to the charm.

Extension primitives in XSDL

Jan 19, 2008

Michael Sperberg McQueen (CMSMcQ) has written a couple of interesting posts about datatypes in W3C’s XML Schema (XSDL). (The second is a response to a comment from John Cowan, and attempts to justify some of the seemingly arbitrary decisions made in the set of datatypes present in XSDL 1.0.) The posts are a discussion of one of the issues against XSDL 1.1 raised by Michael Kay:

Michael proposes: just specify that implementations may provide additional implementation-defined primitive types. In the nature of things, an implementation can do this however it wants. Some implementors will code up email dates and CSS lengths the same way they code the other primitives. Fine. Some implementors will expose the API that their existing primitive types use, so they choose, at the appropriate moment, to link in a set of extension types, or not. Some will allow users to provide implementations of extension types, using that API, and link them at run time. Some may provide extension syntax to allow users to describe new types in some usable way (DTLL, anyone?) without having to write code in Java or C or [name of language here].

XSLT 2.0 Q&A: Linking elements in different documents

Jan 18, 2008

The first of what will probably become a series of posts where I answer publicly questions that people post me privately (with permission, of course)…

How should I model (and store) data for Courses, while being able to pull info about a Course into a particular context (a Semester or Curriculum)?

I’m not quite sure how to do this in terms of writing the schema (and consequently the XML), and/or how to connect it with XSL (if that is appropriate). My experience with XSLT is limited to pretty much straight templates, and I’ve never cross-referenced two nodesets and used the result to provide a different output before.

The important thing is that within whatever XML you use, you have identifiers of some description that you can use to work out what a particular element means. The simplest and most general purpose identifiers are xml:id attributes, but they’re a bit limited because they have to be legal IDs.

If you have something like a course that’s identified by a number then it makes more sense to use the number of the course as the identifier; that can’t live in an xml:id attribute, so you have to use some other attribute (eg number for it). (You can use an element instead of an attribute, of course, but identifiers are usually metadata, and metadata should usually be an attribute unless it’s structured.)

You might have something that is actually uniquely identified by a combination of values. For example, a course might be identified by the department that offers the course plus the number identifying the course; it might be possible for two courses to have the same number, but be different courses, offered by different departments. Again, ultimately all that’s really important is that it’s possible to identify the department from the course, but the identifier itself might be on the element representing the course or on one of its ancestors, it doesn’t really matter.

In the XSLT, the first task is to pull in all the documents that hold the information you want to use and store them as global variables:

<xsl:variable name="curriculum" as="document-node()"
  select="document('curriculum.xml')" />
<xsl:variable name="transcript" as="document-node()"
  select="document('transcript.xml')" />
<xsl:variable name="database" as="document-node()"
  select="document('database.xml')" />

Then you should set up keys that will create indexes of the information in your documents based on their identifier(s). A simple key would look like:

<xsl:key name="departments" match="dept" use="@xml:id" />

The name attribute is a name for the key; you can call it anything you like, but for identifiers I usually just use the plural of a noun for the thing I’m identifying. The match attribute is a pattern that matches the elements that you’re indexing (don’t forget namespaces, if you have them). The use attribute holds an XPath that should return a value for a given node that you’re indexing, in the above case it’s the value of the xml:id attribute on the <dept> element.

For elements that use a combination of values as an identifier, you can use the concat() function to create a unique value that combines the identifying values. For example, if your XML looks like:

<courses dept="CMSC">
  <course number="131">...</course>
  <course number="198W">...</course>
  <course number="434">...</course>
</courses>

then you could index each course by its department and number with:

<xsl:key name="courses" match="course" use="concat(../@dept, ':', @number)" />

It’s not really necessary in this case, but I’ve put a separator in the concat() call out of habit as it helps prevent problems such as something identified as 'a' + 'bc' being given the same identifier as something identified through 'ab' + 'c'.

Note that the keys don’t indicate which document the information is held in. It’s only when you call the key that you say which document you want to use. For example:

key('courses', 'CMSC:198W', $curriculum)

would pull out the <course> element whose parent had a dept attribute equal to 'CMSC' and a number attribute with the value '198W' from the document held in the $curriculum variable. (This used to be harder to manage in XSLT 1.0, when there wasn’t a third argument; without the third argument, the XSLT processor looks in the document you’re currently in.)

You can just call the key() function directly, but if you’re using XSLT 2.0 then I’d suggest wrapping it in a function like this:

<xsl:function name="my:course" as="element(course)">
  <xsl:param name="dept" as="xs:string" />
  <xsl:param name="number" as="xs:string" />
  <xsl:variable name="identifier" select="concat($dept, ':', $number)" />
  <xsl:sequence select="key('courses', $identifier, $curriculum)" />
</xsl:function>

This means you can use

my:course('CMSC', '198W')

to locate the <course> element you’re after.

Of course, most of the time you won’t have fixed values for the arguments for that function: you’ll have some XML that refers to the course in its own way. For example, you might have:

<semester season="fall" year="2007">
  <course dept="ARTT" number="210" />
  <course dept="INFM" number="210" />
</semester>

and want to make a list of the titles of the courses. You could do this with:

<xsl:template match="course">
  <li>
    <xsl:apply-templates select="my:course(@dept, @number)/title" />
  </li>
</xsl:template>

Just a final thought: if you have control over it, it’s useful to make a clear distinction between elements that define information and elements that reference those definitions. One way of doing that is to name them differently (eg <course> and <courseRef>) or make sure that they have different sets of attributes (eg number and numberRef). Using the name of an element can be more useful because you can use that when you’re defining the types of parameters and return values. For example:

<xsl:function name="my:courses" as="element(course)+">
  <xsl:param name="semester" as="element(semester)" />
  <xsl:variable name="courseRefs" as="element(courseRef)+"
    select="$semester/courseRef" />
  <xsl:sequence select="$courseRefs/my:course(@dept, @number)" />
</xsl:function>

Web 2.0 project: privacy in genealogy

Jan 17, 2008

There were a couple of comments on my previous post about RDF and uncertainty in our Web 2.0 genealogy project concerning the problems of privacy in a genealogy app. My ideas about this aren’t fully thought-through, let alone implemented, but I thought I’d share them anyway.

First, the things we could restrict access to are:

  • sources of information (eg birth records)
  • personas (eg Charles Darwin) and assertions about them
  • events (eg the Beagle Voyage) and assertions about them
  • groups (eg the Royal Society) and assertions about them

There are different kinds of things you might do to the resources:

  • who can know it exists?
  • who can read it?
  • who can change it?
  • who can delete it?

and different levels of access for each of those things:

  • global (public)
  • group (restricted)
  • user (private)

I imagine that at any point, a user will have a default set of permissions in play. For example, they might be generating information that anyone can know exists and can read, but only a restricted set of people can change, and only the user themselves can delete.

Searchability

I’m going to side-track for a second here to explain why I’ve separated out “knowing something exists” from “readable”.

Our genealogy application is evidence-based, which means that to say that someone exists you must have a source for that information, be it a birth certificate or a picture or the transcription of an interview. The persona that you believe to exist on the basis of one source may or may not be the same persona that you believe to exist on the basis of a different source. A separate step is required to link the two personas together.

The intention is that our application will help you link personas (and events, groups and anything else that might be quoted in more than one piece of evidence) together through searching through other personas (etc) to find those that are similar. They might have similar names, but most of the evidence for similarity will come from similar assertions having been made about them.

Now say that I have some evidence about Charles Darwin and enter it into the system, and you similarly have some evidence about Charles Darwin that you enter into the system. We both create “Charles Darwin” personas based on our evidence. I’d like it to be the case that, even if we don’t want the information we’ve captured about Charles Darwin to be readable by others, we can still make it searchable so that others can find out we’re working on that person too so that (after some interpersonal negotiation) the two sets of information can be brought together.

So the “who can know it exists?” access is about making your information searchable, even if the details aren’t readable.

Default access

As I understand it, in genealogy circles it would be bad form to make any information about living people publicly available. Some would argue that any information that is currently publicly available (on the web, or even off it, in electoral rolls, phone books, wherever) is fair game: if it’s already “out there” then you can use it. But I don’t agree with that. There’s a big difference between there being multiple disparate, human-readable sources of information about an individual “out there somewhere” and providing a single aggregated public source for all this data in a computer-readable format. In short, I agree with Jeff “Coding Horror” Atwood’s recent post on privacy in which he characterises privacy as an inalienable right.

Also, some people make money from doing genealogical research, and it would be short-sighted to prevent people from using the application if they simply wanted to keep what they were doing private for their own reasons.

So there must be a setting such that information can be kept entirely private, to the extent that only the person who constructed it knows that it exists.

On the other hand, many of the benefits of Web 2.0 applications come from integrating your own material with other people’s. Genealogy is a good example of this, because there’s a distinct thrill in hooking into other people’s research and discovering other branches of the family.

So there must be a setting such that information can be made globally available to all and sundry, even to the extent of others changing and deleting it.

I think the default has to be that the information you enter is completely private. I’d like to encourage people to share their information, but I think that the way to do that is through social mechanisms (eg the more useful information you share, the more kudos you get) rather than technological ones.

User Groups

Public and private access aside, there’s a huge middle ground of access that’s somehow restricted to a subset of users. I don’t think I’m being particularly radical if I say that the current state of the art of social software access control isn’t great, that the “restricted” subset is usually to “my friends”, but people don’t have one social circle: they operate in different spheres, and have a different set of friends/colleagues in each.

The same is true in our genealogy app. A given user might be working (collaboratively) on several projects, with different teams. For example, I might work on a family tree with my mother for her side of the family and my father for his, with no need for either of them to be aware of the other (unless they overlap, in which case “searchability” comes into play).

So I think it makes sense to use the familiar notion from *nix of “user groups”, and assign each item (sources/people/events/groups) to one or more user groups. User groups will probably mostly arise from those working on particular projects, but could be more arbitrary. I’m thinking:

  • moderated (by-invitation) and open (by-subscription) user groups
  • private, public and restricted permissions on the user groups

The other thing I was considering here was having a notion of “degrees of separation”. If I entered details about our daughters into a family tree, I might be happy for immediate family, and their immediate family, to see them: a separation of two. I don’t know whether this kind of network-based user group would be useful or a pain to manage, or even implementable; it’s just an idea I had.

Web 2.0 project: RDF and uncertainty

Dec 21, 2007

I’ve been thinking a bit recently about how to deal with certainty in our Genealogical Web 2.0 application. We’ve come round to using an RDF model to represent what the Gentech data model calls “assertions”; assertions such as “Charles Darwin was a passenger on the Beagle Voyage” are represented as an RDF Statement in which (a resource representing) “Charles Darwin” is the subject, (a resource representing) “Beagle Voyage” is the object, and “was a passenger on” is the predicate/property.

All the statements in the genealogical application should be based on some source of information, either an external piece of evidence (such as a marriage certificate) or by combining existing statements. Either way, there’s certain metadata that we want to store about it, such as

  • who created the statement
  • when it was made
  • the date(s) when the statement was true
  • the certainty in the statement

The certainty factor is interesting. For statements based directly on evidence, there are three factors that come into play:

  • the reliability of the evidence itself; for example, a marriage certificate is more reliable than a diary entry for a wedding
  • the certainty the user has in drawing their conclusion based on the evidence; for example, you would be more certain in the statement that the groom named on a marriage certificate is a man than in the statement that the witness named on a marriage certificate is a friend of the groom
  • the reliability of the user who has made the statement: an expert in family history is likely to draw more accurate conclusions than someone who has only just started

So now the question is how to assess these factors. The usual Web 2.0 method is to use ratings. We could get users to rate each other to provide the third score. We could then get users to rate the reliability of particular pieces of evidence, modify that score based on the users’ reliability, and aggregate those scores.

The final certainty of the statement would be a combination of this score for evidence reliability and ratings from multiple users, again weighted according to the users’ reliability.