Web 2.0 project: RDF and uncertainty

I’ve been thinking a bit recently about how to deal with certainty in our Genealogical Web 2.0 application. We’ve come round to using an RDF model to represent what the Gentech data model calls “assertions”; assertions such as “Charles Darwin was a passenger on the Beagle Voyage” are represented as an RDF Statement in which (a resource representing) “Charles Darwin” is the subject, (a resource representing) “Beagle Voyage” is the object, and “was a passenger on” is the predicate/property.

All the statements in the genealogical application should be based on some source of information, either an external piece of evidence (such as a marriage certificate) or by combining existing statements. Either way, there’s certain metadata that we want to store about it, such as

  • who created the statement
  • when it was made
  • the date(s) when the statement was true
  • the certainty in the statement

The certainty factor is interesting. For statements based directly on evidence, there are three factors that come into play:

  • the reliability of the evidence itself; for example, a marriage certificate is more reliable than a diary entry for a wedding
  • the certainty the user has in drawing their conclusion based on the evidence; for example, you would be more certain in the statement that the groom named on a marriage certificate is a man than in the statement that the witness named on a marriage certificate is a friend of the groom
  • the reliability of the user who has made the statement: an expert in family history is likely to draw more accurate conclusions than someone who has only just started

So now the question is how to assess these factors. The usual Web 2.0 method is to use ratings. We could get users to rate each other to provide the third score. We could then get users to rate the reliability of particular pieces of evidence, modify that score based on the users’ reliability, and aggregate those scores.

The final certainty of the statement would be a combination of this score for evidence reliability and ratings from multiple users, again weighted according to the users’ reliability.

Comments

Re: Web 2.0 project: RDF and uncertainty

Hi Jenny. really Interesting topic! I’ve started playing around with GEDCOM exports from popular family history sites lately, and wondering along similar lines how to reflect out uncertainties in RDF. The biggest change in the RDF universe since I last looked into this is the standardisation of “named graph” stuff within SPARQL, allowing us to talk about properties of sets of triples, and even construct queries that mix across these two layers, eg. “find me birthdates for parents of the person who is primary topic of document x, according to sources with a quality of 4 or more” kinda thing. Do you have any RDFS/OWL or instance data samples, which could be basis for trying out test cases in SPARQL using such a technique?

Data privacy in this area is also going to be interesting. By which I mean, hard. I was chatting with a cab driver a few weeks ago, who initially claimed to be barely capable of sending a text message, then revealed that he’s a wizard with these family history sites. He was telling me that, after some investigation, you end up realising quite private secrets of families from the other side of the world. It’s far from clear how automation will interact with all that. Some of these sites now have adverts for dna-based services too, which reminds me that I put foaf:dna_checksum into FOAF some years ago, largely as a (slightly grim) joke. But now it’s rather topical. As far as I know there’s no cross-site interchange format for the bio data, although presumably things are happening in the govt-to-govt world (passports etc).

Re: Web 2.0 project: RDF and uncertainty

Have you considered the idea of evidences supporting assertions?

Evidences have interpretations, which backup assertions (or claims) of existence of individuals or relationships. [I hope I haven’t misused the word assertion]

Most genealogy systems record assertions of individuals or relationships as fact which is problematic, and provides no simple way for recognizing or resolving conflicts, or even safely maintaining conflicts indefinitely!! Some allow recording of supporting evidences and allow doubts to be expressed freestyle.

The large problem is the merging of individual researchers “trees of doubt” into a larger tree of doubt.

Historic individuals can only be represented by themselves (and they’re usually not around to bestow their identity on the various assertions) but researchers may agree (and later disagree) on whether or not the same individual would accept the assertions (or claims) as referring to themselves - or in other words “we’re looking at the same person”

Each researcher then must have their own view of the tree and own definition of assertions which comprise an identity; and may have aliases which refer to an identity record of another researcher which probably uses many of the same assertions.

While the assertions may be semantically identical (or similar - perhaps some assert to a lesser degree of accuracy) they may be based on a largely common set of interpretations which may be based on a largely common set of evidences.

And it all wants versioning! So if a researcher A accepts from researcher B evidence Ba and interpretation Ba1 and assertion Ba1c and identity Ba1c99 and then researcher B re-interprets the evidence as Ba2 for assertion Ba2p to support and add information to individual Ba2p238, researcher A will need to be able to see the change and follow the conclusion and change of B or not. And then change their mind later if B is discredited, ore evidence is found, etc.

What are you thoughts on that>

Re: Web 2.0 project: RDF and uncertainty

Well, the first thing is that your description of evidence-based assertions is exactly how the GENTECH data model works, and that’s what we’re using as the basis of our application. In the GENTECH data model, if a person is mentioned in a piece of evidence, a persona is created for that person. If they are mentioned again in a different piece of evidence, another persona is created. If you want to say that those individuals are actually one and the same, that is a separate assertion, with all the uncertainty that that entails.

(The same goes for groups and events, by the way.)

I think what we’ll end up with is a kind of hierarchy of personas: the bottom layer will be personas that are directly generated based on the evidence. Higher layers will be created through “these are the same person” assertions. At any point it should be possible to snip off a subtree by saying “actually, these aren’t the same person after all”.

Another, more webby, way of viewing this is to say that the higher levels are aggregations (like feed aggregations). What you see about a persona are the assertions made about that persona directly. If an assertion is made that two personas are the same, you get a new persona who is an aggregation of the two that it’s based on: you see an aggregation of the assertions about the two personas.

We could do some clever things when aggregating assertions, such as comparing the statements to see if they’re contradictory, but (at least to begin with) I think we might just rely on people-power to correct mistaken assertions.

Versioning is going to be interesting. I’ll have to think about that some more.

Re: Web 2.0 project: RDF and uncertainty

Hi Dan. No RDFS/OWL or instance data yet, I’m afraid, but I can let you know when we do (and I’ll probably blog about the ontology as it’s developed, to get the experts’ opinions). I’ll write up my ideas about privacy in another post…

Re: Web 2.0 project: RDF and uncertainty

Historically, privacy has been a significant concern for genealogy research. Aside from problems arising from ‘who to filter’, there is the more obvious problem (that you allude to) of actually maintaining privacy.

So far as I’m aware, there’s no foolproof way to completely privatize current information about an individual, other than hoping that programs abide by protocols of decency and either implement filtering on their own or abide by the filtering rules established by others (to be fair, it’s more that filtering rules are applied to GEDCOM files before they are submitted, by tagging records as private).

GEDCOM (Or perhaps an extension of it. I’m not entirely clear.) also supports embedded encryption of data (albeit just the textual values) that depends on having the proper key to decrypt some data. I’m not entirely sure that RDF has a similar mechanism other than… say… through some external agreement that the object of some enc:encryptedData property is to be interpreted as decryptable RDF triples that may only be inserted into the data store if it’s able to decrypt it.

It still doesn’t solve issues with potentially making a local triple store public while containing data that violates privacy (again without some sort of implicit agreement implemented in the SPARQL endpoint to prevent certain triples from being returned).

Re: Web 2.0 project: RDF and uncertainty

Some interesting musings on certainty, especially with regards to genealogy, here. I myself have played around with trying to apply a modified version of the Gentech data model in RDF to describe genealogical facts as a side project in a Semantic Web class recently given by Jim Hendler. I’ve also played around with ontologies for this purpose by interacting with Hilton of the Distributed Family Tree project.

In any case, with a particular look into certainty, it’s interesting that there’s really two different paradigms here that interact. Reliability of a fact from the document and reliability of a document itself are rather closely related (as they are based on documentary evidence), but reliability of a researcher’s evidence is a secondary scale that qualifies the entire previous reliability, almost like an annotation on the researcher than annotation on the assertion itself.

As for my own experiences with attempting to implement reliability, they are rather disparate: My experience with DFT did not directly relate or deal with reliability, though some conceptualization there mostly dealt with a related, but not identical, concept of proofs of certainty, especially using named graphs for indicating provenance, and presumably certainty as well.

My independent project had less to apply with regards to certainty (I didn’t move far enough into the Gentech schemata to really be able to explore how that would be denoted.) but instead applied rules-based logic through cwm to attempt to derive information from assertions. This might not in and of itself be directly useful to producing useful certainty values, but the general idea of using custom rules to qualify individual certainty of assertions might be beneficial in a Web 2.0 format. In particular, people might want to filter assertions made based on individual aspects of rated certainty as well as judging assertions based on the number of supporting pieces of evidence of a given certainty. It might be worthwhile to look into this…

On a somewhat different vein, it seems almost quaintly coincidental that I ran into some of your other work when trying to solve another issue with trying to hold genealogical data in an RDF format: namely your work on DTLL, and trying to apply it to unambiguously specify dates in formats other than (proleptic) Gregorian dating, especially in cases where imported GEDCOM data might have non-Gregorian dating systems. I’m not sure what, if any, work might have been done to link RDF datatyping with DTLL, and that was something I’m personally eager to look into… But I ramble on.

In any case, I hope that there was SOME sort of usefulness to be gleaned from my comment, but if there isn’t… Well at least it’s not spam. :-)

Re: Web 2.0 project: RDF and uncertainty

I'd be interested to hear how you go about this, the certainty question is one I've wondered about for a while. Still haven't seen anything to suggest one approach over another in practice - starting with: reification or named graphs?

Anyhow, if you find you need a term for simple ratings, there's one in our Review vocab (suggestions welcome).

btw, I've left a note for Ian Davis - he's spent time with genealogy in RDF.