As my current project draws to a close, I need to deliver document search capabilities.
SQL Server, despite how much I love it, requires you to have the documents reside *inside* the database for it to index them. That is ugly. So I ventured out and went with much-hyped and totally-cool Lucene from Apache. I am an Apache bitch, and why not, I love them and they love everyone.
So I ran some tests and everything works great and I can sleep at night.
Today, Saturday, after we release the first iteration of the live site and I am in post-stress bliss, I discover that Lucene does not index PDFs out of the box. Text is cool, PDF ain’t cool.
Rapid searches turn into slow searches, during which I find bizarro projects such as Docco or Multivalent – both of which are as hostile and not helpful as they get. Do you REALLY expect to documentation? Blech! RTF(non-existent)M!
I even stumble across over-the-top solutions that provide you a complete web application that will index the ass off your website – Zilverline, and that’s very cool but I need a component or library I can integrate into my code and Lucene was so neat and fit that bill.
Finally, I come across PDF Box – which pretty much was it. It reads PDFs, converts them to text, and Lucene can now play with PDFs!
Now on to Word where Apache’s Jakarta POI is supposed to provide the goods for Excel, Word and even PowerPoint… am I being too optimistic?