My client was experiencing difficulties when trying to index Word files into Lucene.
I am using the text extraction library from TextMining.org but the issue occurs also when using Apache POI (which TextMining.org is related to).
The exception being thrown is:
Exception while extracting Word file: Invalid header signature
After opening one of the questionable files I found out that they were actually RTF files saved as Word doc files. Only after saving the file under a different name (using Save As…) and explicitly specifying the file to be a Word Document did was the file properly saved and summarily had its text extracted succesfully.
Also, make sure that Word is not using the Fast Save option as it will also cause issues when extracting text.