Categories
Java Web Development

Deleting a document with Lucene

Lucene keeps on blowing my mind, but find how to do rudimentary things with it is not too simple.
Suppose you want the index to no longer show a doucment that you deleted. As far as I understand – after some research pain – this involves six steps: [of course, there is definitely more than one way to do this, and I am by no means a Lucene expert]

1. Find the document’s id. That is the id Lucene, not you, gave the document.
2. Get an Directory object for the index directory.
3. Get an IndexReaderfor that directory
4. Unlock that directory
5. Delete the document
6. Close the IndexReader object

Each step is almost its own procedure.
1. Find the document’s id
This is the more elaborate step. You need to search your index for the doucment you wish to delete. To do so, I ran a query against the index.
(This sample query will show you the names and indexs of documents that match on a field called “contents”):

Directory fsDir = FSDirectory.getDirectory(indexDir, false);
IndexSearcher is = new IndexSearcher(fsDir);
Query query = QueryParser.parse(search_term, "contents", new StandardAnalyzer());
Hits hits = is.search(query);
System.out.println("Found " + hits.length() + " document(s) that matched query '" + q + "':");
for (int i = 0; i < hits.length(); i++) { Document doc = hits.doc(i); System.out.println(doc.get("filename") + " score: " + hits.score(i) + " id: " + hits.id(i)); }

Finding the id, as you see, involves the Hits
object, which holds the precious id(int hit_position) method that returns you the id.

Now that you have the id, you can proceed and start the real deletion process:

2. Get an Directory object
Similar to what we did above, you get a Directory object from the FSDirectory. That is easy enough.

3. Get an IndexReader object
The IndexReader is an abstract class, so in order to get the concrete implementation for it, you instantiate it using a call like:
IndexReader ir = IndexReader.open(fsDir);
where fsDir is the Directory object we created in step 2.

4. Unlock the Direcotry
Lucene uses file locks to secure the index and the updates happening to it. To delete a document, you have to first unlock the directory, and the IndexReader object will be happy to do that for you:
ir.unlock(fsDir);

5. Delete the document
Finally, we ask the IndexReader to delete the document using the id we found in step 1 - which we intuitively put in a variable called docId:
ir.delete(docId);

6. Close the IndexReader object
Nothing will happen unless you close the IndexReader object - the document will not be deleted. Easy enough, close it then:
ir.close();

Voila.

Share
Categories
Computing Java Web Development

Lucene and friends

As my current project draws to a close, I need to deliver document search capabilities.
SQL Server, despite how much I love it, requires you to have the documents reside *inside* the database for it to index them. That is ugly. So I ventured out and went with much-hyped and totally-cool Lucene from Apache. I am an Apache bitch, and why not, I love them and they love everyone.

So I ran some tests and everything works great and I can sleep at night.

Today, Saturday, after we release the first iteration of the live site and I am in post-stress bliss, I discover that Lucene does not index PDFs out of the box. Text is cool, PDF ain’t cool.

Rapid searches turn into slow searches, during which I find bizarro projects such as Docco or Multivalent – both of which are as hostile and not helpful as they get. Do you REALLY expect to documentation? Blech! RTF(non-existent)M!
I even stumble across over-the-top solutions that provide you a complete web application that will index the ass off your website – Zilverline, and that’s very cool but I need a component or library I can integrate into my code and Lucene was so neat and fit that bill.

Finally, I come across PDF Box – which pretty much was it. It reads PDFs, converts them to text, and Lucene can now play with PDFs!

Now on to Word where Apache’s Jakarta POI is supposed to provide the goods for Excel, Word and even PowerPoint… am I being too optimistic?

Share
Categories
Java

JSP/Servlets and Form Checkboxes

This one caught me off-guard.

Suppose you have a form with multiple checkboxes that share the same name. You check off some boxes and submit the form.

In ColdFusion, checkbox values submitted from a form are presented to you inside a variable that contains a comma-delimited list of the checked values.

Java does not play the same way.

In Java, what you get is an array of Strings that is returned when you as the request object for the parameter of the name the checkboxes share. You then need to iterate through the array to get all the values and do whatever you wish with them.

What surprised me is that there is really no mention of this behavior in any of the books I checked or around the web. Well, it is noted now.

Share
Share