Author: yuval

Java

NoClassDefFound Error in Tomcat

Post author By yuval
Post date October 20, 2004
No Comments on NoClassDefFound Error in Tomcat

This is like missing a dot somewhere or a semicolon that screwes everything up.
Tomcat DOES NOT HOT DEPLOY JAR files when they are placed in its common/lib directory.
YOU NEED TO RESTART TOMCAT.

Thank you.

Java Web Development

Deleting a document with Lucene

Post author By yuval
Post date October 20, 2004
No Comments on Deleting a document with Lucene

Lucene keeps on blowing my mind, but find how to do rudimentary things with it is not too simple.
Suppose you want the index to no longer show a doucment that you deleted. As far as I understand – after some research pain – this involves six steps: [of course, there is definitely more than one way to do this, and I am by no means a Lucene expert]

1. Find the document’s id. That is the id Lucene, not you, gave the document.
2. Get an Directory object for the index directory.
3. Get an IndexReaderfor that directory
4. Unlock that directory
5. Delete the document
6. Close the IndexReader object

Each step is almost its own procedure.
1. Find the document’s id
This is the more elaborate step. You need to search your index for the doucment you wish to delete. To do so, I ran a query against the index.
(This sample query will show you the names and indexs of documents that match on a field called “contents”):
Directory fsDir = FSDirectory.getDirectory(indexDir, false); IndexSearcher is = new IndexSearcher(fsDir); Query query = QueryParser.parse(search_term, "contents", new StandardAnalyzer()); Hits hits = is.search(query); System.out.println("Found " + hits.length() + " document(s) that matched query '" + q + "':"); for (int i = 0; i < hits.length(); i++) { Document doc = hits.doc(i); System.out.println(doc.get("filename") + " score: " + hits.score(i) + " id: " + hits.id(i)); }

Finding the id, as you see, involves the Hits
object, which holds the precious id(int hit_position) method that returns you the id.

Now that you have the id, you can proceed and start the real deletion process:

2. Get an Directory object
Similar to what we did above, you get a Directory object from the FSDirectory. That is easy enough.


3.  Get an IndexReader object

The IndexReader is an abstract class, so in order to get the concrete implementation for it, you instantiate it using a call like:

IndexReader ir = IndexReader.open(fsDir);

where fsDir is the Directory object we created in step 2.
4. Unlock the Direcotry

Lucene uses file locks to secure the index and the updates happening to it. To delete a document, you have to first unlock the directory, and the IndexReader object will be happy to do that for you:

ir.unlock(fsDir);
5. Delete the document

Finally, we ask the IndexReader to delete the document using the id we found in step 1 - which we intuitively put in a variable called docId:

ir.delete(docId);
6. Close the IndexReader object

Nothing will happen unless you close the IndexReader object - the document will not be deleted. Easy enough, close it then:

ir.close();

Voila.

Computing Java Web Development

Lucene and friends

As my current project draws to a close, I need to deliver document search capabilities.
SQL Server, despite how much I love it, requires you to have the documents reside *inside* the database for it to index them. That is ugly. So I ventured out and went with much-hyped and totally-cool Lucene from Apache. I am an Apache bitch, and why not, I love them and they love everyone.

So I ran some tests and everything works great and I can sleep at night.

Today, Saturday, after we release the first iteration of the live site and I am in post-stress bliss, I discover that Lucene does not index PDFs out of the box. Text is cool, PDF ain’t cool.

Rapid searches turn into slow searches, during which I find bizarro projects such as Docco or Multivalent – both of which are as hostile and not helpful as they get. Do you REALLY expect to documentation? Blech! RTF(non-existent)M!
I even stumble across over-the-top solutions that provide you a complete web application that will index the ass off your website – Zilverline, and that’s very cool but I need a component or library I can integrate into my code and Lucene was so neat and fit that bill.

Finally, I come across PDF Box – which pretty much was it. It reads PDFs, converts them to text, and Lucene can now play with PDFs!

Now on to Word where Apache’s Jakarta POI is supposed to provide the goods for Excel, Word and even PowerPoint… am I being too optimistic?