Contents |
Possibilities to store metadata
Update I have not updated this for quite some time. So i will just write down the current state. I will still keep the first ideas below.
I have been testing using SemWeb for a Metadata Storage and had some problems. So right now i am thinking about how to solve those issues and i think there would be two ways - one using SemWeb and trying to fix what can be fixed and another one putting everything directly into the lucene DB. I will first try to explain the problems and then list some ideas for solutions for both cases i came up with. Right now i am in favour of the lucene only way but i will discuss in the next days.
Well but now the problems i ran into:
- File Locks - Files are beeing locked when indexing new items or reading from the store. Beagle deals with this for Lucene Indexes quite well. We might have to implement a queue for accessing SemWeb as well.
- Entitys and even Statements that already exist may be doubled - have to query for existance before inserting.
- If all Hits get Properties from Statements almost doubles response times: This is a conceptual flaw! We have to search the Lucene Storage first and come back with a URI we can then use to query SemWeb. So we go through all our data twice!
- It's not possible to delete (all connected) statements in the SemWeb db if a document has been deleted because you can't tell if they have been indexed because of that document. For example: can you remove the information about a persons email address if the addressbook entry is removed? Maybe the email address has been indexed from an rss feed. That's where the policy for metadata comes in.
So i decided to take a step back and look at it from the begining - Why did we want SemWeb in first Place?...
- Cluechaining at indexing time instead of Querytime (Cluechaining is easier if you just have to follow the Rdf Graph)
- Adding Properties to Documents we are not indexing at the moment we find out about them (for example email address and homepage of people from an rss feed)
- Adding Statements about Things that are not linked to any document (email address of people who are not in our addressbook)
- Complex Queries in a Rdf Query Language would be possible
- Rdf IO
So going through that list (i'm pretty sure i missed something)...
- In order to solve the speed problem we might only read Properties from SemWeb for Hits that have been selected for detailed view. If we do this with a new special query we don't really gain much here. The Query is faster than searching the Lucene index for a string - but there would be room for improvement in the way we do it with Lucene now (see 1) below). We could still store the id of the entity of a document in the SemWeb storage to increase query speed but i doubt that helpes much and it only works if the duplicates issue gets solved i.e. if an entity has one unique id.
- The main problem with the way we do it now from my PoV is not that we can't Add new fields - i'm pretty sure we can but that they might be overwritten the next time the original document is indexed. Of course that's not what should happen but just putting everything in SemWeb does not really solve this problem. Instead of removing everything that's not in the original document we would just keep everything. Actually we would need a policy on when to remove Information from SemWeb as well. (See last problem above).
- SemWeb would solve this. But the question remains - when would we delete such data.
- It does not look like this will be possible because Sparql in SemWeb depends on IKVM - so we don't get that for free either.
- That would be really easy with SemWeb. But we'd still need a policy for it. If you can just read an rdf file into the beagle storage - how do we make sure this information is usefull and does not just clutter your store - when would we remove it etc...
So what would be needed to achieve (almost) the same with just Lucene?
- Cluechaining etc.: The advantage of Rdf is that the Objects can be Entities which are Subjects in other statements. In Lucene something similar could be reached if we had fields that pointed to other documents. Like a fixme:mailfrom that pointed to contact:...
- I'd like to introduce a new type of Property for those - a reference. A reference Property would always hold the Uri of a different doc. Querying for references should be fast because we only search the uri fields and could be speeded up if we can map Uris to Indexes. Then we would only have to query that one index. Looking up references might still have to be triggert by the Client per Hit - otherwise we would end up Querying for tons of contacts for all the email Hits returned at once.
- We could have documents build from different sources. Properties could be stored in fields with a number for the source they came from. So if a source is removed they could be removed as well.
- Another possibility would be to have different documents for each source but link them with the references mentioned in 1).
- We could create a new document just for the metadata. We'd have to make sure that when more information about the same object (say person f.e.) is added it really gets added here or in a linked document.
- This might really difficult due to the different structure in Lucene. Join etc. would be necessary. I don't see any direct use for it now, so i would drop that and if we'd really want it later we could also export things to an explicit rdf based structure like tracker.
- see 4.
outdated :
The Lucene way...(LuceneCommon.cs)
Lucene stores fields with key value pairs for documents in AddPropertyToDocument and reads them using GetPropertyFromDocument. So using a seperate Lucene index might be a possibility as Lukas pointed out: http://mail.gnome.org/archives/dashboard-hackers/2006-May/msg00061.html
Advantages:
- The key, value pairs can be indexed automatically. This is really integrated into beagle.
- The flexibility beagle offers when moving or deleting files etc.
- It would integrate really nicely with the queries as well.
Disadvantages:
- It's difficult to read the information with other applications.
Outsourcing
In some cases beagle does not store the metadata itself but rather refers to the information of a different application. For example information about IM Buddys is retrieved directly from the gaim/kopete files.
Advantages:
- The integration with the other apps is no problem
- Changes in the app can be synced using inotify
Disadvantages:
- These files have to be queried every time.
- Fileformat depends on applications. Not all metadata can be associated to an app.
- Some items would have to be stored in more than one place - the link between IM and email in Kopete, Gaim and Evolution for example.
Tagging
There are different apps using tags: leaftag and f-spot as well as the emblems in nautilus. I looked into the way leaftag and f-spot store their tags. It's quite similar the only real difference is that f-spot does have a tag hierarchy.
Advantages:
- if this could be integrated with leaftag the leaftag bindings and tools would be really nice to integrate with other apps.
- Tagging enables an easy way to browse through search results:
I would prefer to have a way to store things like
Person has IM contact "mawx@jabber.org"
Person has email address "mwiehle2 at ix.urz.uni-heidelberg.de"
This could be done by tagging them both with "person:Max Wiehle" The tags could be shown in the details pane. The tag display could include an image and a context menu that would be reused everywhere the tag appears. A click on the tag would start a search for "person: Max Wiehle" that would return the Contact as a top search result.
- tagging all mails on an email list with could be done by tagging the mlist property. The tag would appear on all search results of that list.
Disadvantages
- The main problem with this way of tagging is that it lacks some RDF features. The tags themselves can't be nodes. So storing something like: file://.../beagle.tgz was downloaded from "http://www.beagle-project.org" is not possible. And there is no way of expressing the relation besides putting it into the tag or URI like this: "downloaded_from:http..." or
"mailto:mwiehle2...".
- It's difficult to return all IMLogs containing mawx... if "person: Max Wiehle" was searched.
- all metainformation has to be queried for tags on it.
- inotify could be used to update but it's not specific. As soon as the db changes all tags and metainfos would have to be updated.
MetadataStorage
MetadataStorage is based on SemWeb and enables RDF like storage and retrieval of metadata from sqlite and xml and other formats. It is currently used to load metadata from sqlite db of f-spot for example.
Advantages
- High flexibility of data format - easy integration for other apps
- you can create both simple tags as well as links
- There are some working integrations like f-spot
- Maybe joined sql queries are possible
Disadvantages
- Not Lucene - lacking the nice field search support, fuzzy searches etc.
