Contents |
Introduction
A filter in Beagle is an object which extracts metadata and/or text from a blob of data. These blobs may come from files, emails, or anywhere else. For more info on filters and the whole indexing process, see the Beagle Architecture Overview.
This document was written by Joe Shaw but is largely based on a previous filter tutorial by Debajyoti Bera.
Hello World Filter
using System.IO;
using Beagle.Daemon;
namespace Beagle.Filters {
public class HelloWorldFilter : Beagle.Daemon.Filter {
public HelloWorldFilter ()
{
}
override protected void RegisterSupportedTypes ()
{
AddSupportedFlavor (FilterFlavor.NewFromMimeType ("text/x-patch"));
}
override protected void DoOpen (FileInfo info)
{
}
override protected void DoPullProperties ()
{
AddProperty (Beagle.Property.New ("fixme:comment", "Hello world!"));
}
override protected void DoPullSetup ()
{
}
override protected void DoPull ()
{
AppendText ("Hello World");
Finished ();
}
override protected void DoClose ()
{
}
}
}
This dumb filter creates a single property with the name "fixme:comment" and value "Hello world" and adds the text "Hello" and "World" as text to the index. This filter overrides all of the methods that a filter can, but doesn't do anything in most of them. It's not necessary to override all the methods like we did.
Technical details of a filter
- Filters are typically (but not necessarily) in the Beagle.Filters namespace
- Filters should include "using Beagle.Daemon;"
- Filters must derive from Beagle.Daemon.Filter or a subclass
- Filters must advertise the MIME types and/or file extensions it supports. In our example above, we handle the "text/x-patch" type. (Files created with the "diff" command.)
- Filters are versioned. If a filter version changes, all files handled by that filter are reindexed. You can set the version of a filter with the SetVersion() method in the filter constructor.
Filters run in the index helper process. A new instance of a filter is created for every file to be indexed. Performance can be improved by declaring constant data like lookup tables as static variables. Several more suggestions are included in [1].
Filter methods
In our HelloWorldFilter we chose to override all of Beagle.Daemon.Filter's virtual methods, but we did very little in most of them. We'll go over each method and examine for what it is used. Note that with the exception of DoPull(), all of these methods are called only once per instance. '
- The constructor - A constructor taking no parameter is required. Filters should set the SnippetMode variable to true if content from this filter should be shown as snippets in searches. Set the OriginalIsText variable to true only if the text extracted is in a plain text format and not in a structured format or marked-up. To increase version of a filter (e.g. when the new filter extracts more properties), call SetVersion(version) (default initial version is 0).
- RegisterSupportedTypes() - Supported MIME types and extensions should be registered here.
- DoOpen(FileInfo file) - This method is called when your filter starts. Override this method if you need to set one-time state for your filter or you need to open the file specially. The contents of the file are available to you through the Stream variable, so you only need to open the file if you need to handle it outside of the normal .NET Stream class.
- DoPullProperties() - This method should extract metadata from the file and add them as properties. In our example filter we set a fixme:comment property. Obviously in a real filter, we won't hardcode the value. Almost all filters will override this method.
- DoPullSetup() - Do any one-time setup needed to pull the text content from the data here. The underlying stream is reset to the beginning before this method is called. Few filters override this method, and it's completely unnecessary if you don't override DoPull(). All properties and child indexables must be added by this point; if they are added later they will be silently dropped. Child indexables are typically created and added in this method.
- DoPull() - This method is called repeatedly until either Finished() or Error() is called. This method extracts text content from the underlying data. The filter should call AppendText() for all text, AppendWhiteSpace() whenever appropriate (such as a linebreak) and AppendStructuralBreak() for things like paragraph breaks.
- DoClose() - This method should clean up after anything done in DoOpen().
At any point during the sequence, the Finished() method can be called to indicate a successful run of the filter and Error() can be called to indicate some sort of failure. It is very important to call one of these methods if you override the DoPull() method or else the filter will loop forever.
Implementing the methods
Unfortunately there is no simple filter which implements all of the methods above. Ones that do tend to be complicated and have to deal with parsing a complex data format. So instead of showing a complete example, we'll show snippets from existing filters to see how to implement each piece.
DoOpen() and DoClose()
The email filter users a library called GMime to process emails. GMime uses low-level POSIX file descriptors instead of .NET streams. In our DoOpen() method, we open the file using POSIX I/O and create a GMime.Message which persists for the life of our filter, which we dispose of in DoClose():
// This has been edited for illustrative purposes
protected override void DoOpen (FileInfo info)
{
int mail_fd = Mono.Unix.Native.Syscall.open (info.FullName, Mono.Unix.Native.OpenFlags.O_RDONLY);
if (mail_fd < 0) {
Log.Error ("Unable to open mail for reading!");
Error ();
return;
}
GMime.StreamFs stream = new GMime.StreamFs (mail_fd);
GMime.Parser parser = new GMime.Parser (stream);
this.Message = parser.ConstructMessage ();
stream.Dispose (); // this closes mail_fd
parser.Dispose ();
if (this.message == null)
Error ();
}
protected override void DoClose ()
{
if (this.message != null)
this.message.Dispose ();
}
DoPullProperties()
Again in the email filter, we set a number of properties based on fields in the GMime.Message instance:
// This has been edited for illustrative purposes
protected override void DoPullProperties ()
{
string subject = GMime.Utils.HeaderDecodePhrase (this.message.Subject);
AddProperty (Property.New ("dc:title", subject));
AddProperty (Property.NewDate ("fixme:date", message.Date.ToUniversalTime ());
using (GMime.InternetAddressList addrs = this.message.GetRecipients (GMime.RecipientType.To)) {
foreach (GMime.InternetAddress ia in addrs)
AddProperty (Property.NewKeyword ("fixme:to", ia.ToString ());
}
}
}
In this shortened example we add the email subject, the date the mail was sent, and everyone it was sent to as properties. There are a few different ways to add properties:
- Property.New () - The text is analyzed and the original text is stored in the index for retrieval later.
- Property.NewUnstored () - The text is analyzed but the original text is not stored.
Analyzed text is stemmed, and each word can be searched on. For example, "bowl" and "cherry" will both match on "A bowl of cherries". The original text ("A bowl of cherries") cannot be retrieved if you use NewUnstored().
- Property.NewKeyword () - The text is not analyzed, is searchable. The value is stored in the index.
- Property.NewUnsearched () - The text is not analyzed or searched. This is only for storing the value in the index for retrieval later.
This text is not stemmed, and the entire value must be matched. For NewKeyword(), "A bowl of cherries" is the only thing that will match "A bowl of cherries". Nothing will match in the NewUnsearched() case.
- Property.NewBool () - For storing boolean values in the index.
- Property.NewDate () - For storing date values in the index.
NewBool() is similar to NewUnsearched() but for boolean values. Like NewSearched(), it's only use is for pulling information out later, for example whether or not an email has been read. NewDate() is similar to NewKeyword(), but for dates. Dates can be searched on.
DoPullSetup()
The shell script filter only indexes up to the first 20k of a shell script. The DoPullSetup() is very simple:
override protected void DoPullSetup ()
{
this.count = 0;
}
Each time data is read in DoPull(), this.count increases and is checked whether it passes the 20k threshold.
DoPull()
The simplest DoPull() implementation is in the plain text filter. It reads the text file line by line and adds the text.
override protected void DoPull ()
{
string str = TextReader.ReadLine ();
if (str == null)
Finished ();
else if (str.Length > 0) {
AppendText (str);
AppendStructuralBreak ();
}
}
All text added with AppendText() is analyzed before being added to the index. That is why individual words match documents. As mentioned before, remember to call Finished() when all the data has been processed. DoPull() is called repeatedly until Finished() or Error() is called.
In addition to regular text, some filters generate "hot" text. Hot text is text that is somehow highlighted in a document to make it more important. Examples might include words that are bolded in a word processor document, or headings in an HTML document. You can index hot text in one of two ways:
- Call HotUp(), append text using AppendText (string text), and then call HotDown(). All text added between the HotUp() and HotDown() calls will be considered hot text.
- Call AppendText (string regular, string hot).
Registering filter with beagle
You will have to register your filter with beagle. Beagle searches the environment variable BEAGLE_FILTER_PATH and a few locations to search for filter dlls. If you are building the filter as part of the beagle source tree, you can add your filter to beagle/Filters/AssemblyInfo.cs. If you are building the filter as an independent dll, you have to add the following at the beginning of the filter file:
using Beagle.Filters;
[assembly: Beagle.Daemon.FilterTypes (
typeof (FilterCompressedFiles)
)]
Building filter
If you are building the filter as part of the beagle source tree, you have to add the filter filename to the appropriate makefile.
You can also build the filter without beagle source. Use the following command for building the filter outside beagle source. You need to know the path to Util.dll, Beagle.dll and BeagleDaemonPlugins.dll, they are generally installed in /usr/lib/beagle or /usr/local/lib/beagle.
$ gmcs FilterFile.cs -target:library -out:filtername.dll -r:/path/to/Util.dll -r:/path/to/Beagle.dll -r:/path/to/BeagleDaemonPlugins.dll
(For Beagle < 0.2.15, use mcs instead of gmcs.)
Then place the generated filtername.dll in a location pointed to by the environment variable BEAGLE_FILTER_PATH or the default location (usually /usr/lib/beagle/Filters/ or /usr/local/lib/beagle/Filters/).
Testing filters using beagle-extract-content
Once you've written and built your filter, you'll want to test it. Instead of waiting for the Beagle daemon to come across a file, you can use the beagle-extract-content program to test it. beagle-extract-content outputs the MIME type, all of the extracted properties, and both the regular and hot text extracted by the filter.
Example output on a PDF file:
Filename: file:///home/joe/nytimes-firefox-final.pdf Debug: Loaded 46 filters from /home/joe/cvs/beagle/Filters/Filters.dll Filter: Beagle.Filters.FilterPdf MimeType: application/pdf Properties: dc:appname = Adobe PDF Library 6.0 dc:creator = Adobe InDesign CS (3.0.1) dc:title = nytimes_spread_v4.indd fixme:page-count = 2 Content: Firefox 1.0 “I installed Firefox on my laptop today. It’s so fast — I never knew there could be that much of a difference.” — Stephen Cropp, New Zealand “I was tired of my browser crashing everyday, so I tried Firefox. Now I can’t live without it. Pop-up blocking, secure browsing and no spyware. Best of all... not a crash since I switched.” — Justin Henderson, USA “I thought changing and learning a new web browser would be difficult, but with Firefox I had no problems at all. Browsing is now smooth.” — Jouni Hätinen, Finland Firefox is the free, open source web browser from the Mozilla Foundation that lets you surf faster and more efficiently and helps avoid annoying pop-ups and spyware. Join us and make the switch today — Firefox imports your Favorites, settings and other information, so there’s nothing to lose. Find out what more than 10 million users from around the world already know: there is an alternative! Introducing Mozilla GetFirefox.com This message has been brought to you by the thousands who contributed funds to the Mozilla Foundation, a non-profit organization dedicated to promoting choice and innovation on the Internet. Special thanks to the employees of Haberman & Associates, MozSource, Oracle, Red Herring, Red Hat, Sourceforge.net, Speakeasy and Sun Microsystems for helping to make this possible. Download today from (no hot content)
