Friday, March 6, 2009

Speeding up PDF indexing - Alfresco Hack #3

What is your document type distribution like? If the majority of your stored documents in Alfresco are PDFs, this article is for you. It describes how to speed up the full-text indexing process dramatically by factor of 4! But relax, no coding required:)

To estimate the impact on your installation, some rough distribution numbers can be gathered like this:

dmc@alfresco: find ./alf_data/contentstore -type f -exec file -inb {} \;| sort |uniq -c|sort -nr
Output for example from our installation:
image/png          28 %
application/msword 13 %
application/pdf    10 %
image/gif           8 %
other              41 %

The PDF format is the most used text format after microsoft word. From these numbers it is clear, that a speed up of PDF indexing would be a great benefit for the overall system. The indexing process first extracts the plain text content from the PDF document by the help of a pdf->text transformer based on the PDFBox library. The extracted text is then fed to the lucene indexer component.

For testing the text extraction, I created a small PDF document collection consisting of 15 PDF documents, total of 15MB. To rule out java startup and library loading times, I used this small sample program to time the PDFBox text stripper:

package de.dmc.alfresco.pdfbox;

import java.io.IOException;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;

public class ExtractText {
        // Extract the text from the pdfs given on the command line
 public static void main(String[] args) throws IOException {
  long start = System.currentTimeMillis();
  for (String pdfFile : args) {
   PDDocument document = PDDocument.load(pdfFile);
   PDFTextStripper stripper = new PDFTextStripper();
   stripper.getText(document);
   document.close();
  }
  long stop = System.currentTimeMillis();
  long diff = stop - start;
  System.err.printf("pdfs: %d, total: %d seconds, average: %.3f seconds per document\n", args.length, diff/1000, diff/1000d/args.length );
 }
}
Output on my 2Ghz dual core:
pdfs: 15, total: 10 seconds, average: 0.727 seconds per document
Results: about a whole small second is spend on text extraction in every pdf added to Alfresco!
In our projects, we replaced the PDFBox transformer with the pdftotext console tool.
lothar@lothar-laptop:~/devenv/lib/collections/pdf$ time for pdffile in *.pdf; do pdftotext $pdffile - >/dev/null;done

real 0m2.738s
user 0m2.460s
sys 0m0.136s
Results: about 3 seconds for all 15 pdf files! average: 0.2 seconds per document

Configuration of the transformer:

Add this configuration to shared/classes/alfresco/extension/pdf-transformer-context.xml:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE beans PUBLIC "-//SPRING//DTD BEAN//EN" "http://www.springframework.org/dtd/spring-beans.dtd">
<beans>
        <!-- disable standard pdfbox text transformer -->
        <bean id="transformer.PdfBox" class="java.lang.String"/>
        <!-- has the above injected, is newly created below -->
    <bean id="transformer.complex.OpenOffice.PdfBox" class="java.lang.String"/>

        <!-- pdftotext command line binary -->
        <bean id="transformer.PdfToTextTool"
                class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformer"
                parent="baseContentTransformer">
                <property name="transformCommand">
                        <bean name="transformer.pdftotext.Command"
                                class="org.alfresco.util.exec.RuntimeExec">
                                <property name="commandMap">
                                        <map>
                                                <entry key="Linux.*">
                                                        <value>${catalina.base}/webapps/alfresco/WEB-INF/bin/pdftotext-linux -enc UTF-8 ${options} ${source} ${target}</value>
                                                </entry>
                                                <entry key="Windows.*">
                                                        <value>${catalina.base}/webapps/alfresco/WEB-INF/bin/pdftotext-win32.exe -enc UTF-8 ${options} ${source} ${target}</value>
                                                </entry>
                                        </map>
                                </property>
                                <property name="defaultProperties">
                                        <props>
                                                <prop key="options"></prop>
                                        </props>
                                </property>
                        </bean>
                </property>
                <!-- ensure executable bits of binaries on unix -->
                <property name="checkCommand">
                        <bean name="transformer.pdftotext.checkCommand"
                                class="org.alfresco.util.exec.RuntimeExec">
                                <property name="commandMap">
                                        <map>
                                                <entry key="Linux.*">
                                                        <value>chmod 775 ${catalina.base}/webapps/alfresco/WEB-INF/bin/pdftotext-linux</value>
                                                </entry>
                                                <entry key="Windows.*">
                                                <!--  dummy value -->
                                                        <value>cmd.exe /C dir</value>
                                                </entry>
                                        </map>
                                </property>
                                <property name="defaultProperties">
                                        <props>
                                                <prop key="options"></prop>
                                        </props>
                                </property>
                        </bean>
                </property>
                <property name="explicitTransformations">
                        <list>
                                <bean
                                        class="org.alfresco.repo.content.transform.ContentTransformerRegistry$TransformationKey">
                                        <constructor-arg>
                                                <value>application/pdf</value>
                                        </constructor-arg>
                                        <constructor-arg>
                                                <value>text/plain</value>
                                        </constructor-arg>
                                </bean>
                        </list>
                </property>
        </bean>

   <!-- replaces bean transformer.complex.OpenOffice.PdfBox -->
   <bean id="transformer.complex.OpenOffice.PdfToTextTool"
        class="org.alfresco.repo.content.transform.ComplexContentTransformer"
        parent="baseContentTransformer" >
      <property name="transformers">
         <list>
            <ref bean="transformer.OpenOffice" />
            <ref bean="transformer.PdfToTextTool" />
         </list>
      </property>
      <property name="intermediateMimetypes">
         <list>
            <value>application/pdf</value>
         </list>
      </property>
   </bean>
</beans>

Let me know how it worked for you!

7 comments:

Anonymous said...

This was very helpful, although for a different application. We found a bug when extracting text from PDFs created by a Fujitsu fi-6010N network scanner. The scanner performs OCR and creates the PDF. The default Alfresco text extractor does not add spaces between words, so every line ran together as one word and was not indexed properly.

Using your helpful hack, we were able to replace PDFBox with pdftotext which does not have the same extraction issue and content is indexed as expected.

Thanks for the helpful post!

Martin Wildam said...

Does not seem to work with current Labs 3.2:
https://issues.alfresco.com/jira/browse/ALFCOM-3288

Ganesh Kolhe said...

Hi,
Can you please let me know when we setup alfresco for FULL index recovery mode will it have significant time difference.

Ganesh Kolhe said...

I have followed the steps given in blog. but it is not working for me. How can I make sure that Full Reindexing using pdftotext tool.

Anonymous said...

Hi,

I have a message for the webmaster/admin here at thinkalfresco.blogspot.com.

Can I use part of the information from this blog post above if I give a link back to this website?

Thanks,
Oliver

Anonymous said...

Nice work, Thanks

KRUTIK JAYSWAL said...

I am using alfresco 5.0b.I am tring to use tesseract ocr for tiif to pdf but it is not picking my cutom transformer.It is taking oob feature when i am appling rule?any one can help me on this?

There was an error in this gadget