Friday, March 6, 2009

Speeding up PDF indexing - Alfresco Hack #3

What is your document type distribution like? If the majority of your stored documents in Alfresco are PDFs, this article is for you. It describes how to speed up the full-text indexing process dramatically by factor of 4! But relax, no coding required:)

To estimate the impact on your installation, some rough distribution numbers can be gathered like this:

dmc@alfresco: find ./alf_data/contentstore -type f -exec file -inb {} \;| sort |uniq -c|sort -nr
Output for example from our installation:
image/png          28 %
application/msword 13 %
application/pdf    10 %
image/gif           8 %
other              41 %

The PDF format is the most used text format after microsoft word. From these numbers it is clear, that a speed up of PDF indexing would be a great benefit for the overall system. The indexing process first extracts the plain text content from the PDF document by the help of a pdf->text transformer based on the PDFBox library. The extracted text is then fed to the lucene indexer component.

For testing the text extraction, I created a small PDF document collection consisting of 15 PDF documents, total of 15MB. To rule out java startup and library loading times, I used this small sample program to time the PDFBox text stripper:

package de.dmc.alfresco.pdfbox;

import java.io.IOException;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;

public class ExtractText {
        // Extract the text from the pdfs given on the command line
 public static void main(String[] args) throws IOException {
  long start = System.currentTimeMillis();
  for (String pdfFile : args) {
   PDDocument document = PDDocument.load(pdfFile);
   PDFTextStripper stripper = new PDFTextStripper();
   stripper.getText(document);
   document.close();
  }
  long stop = System.currentTimeMillis();
  long diff = stop - start;
  System.err.printf("pdfs: %d, total: %d seconds, average: %.3f seconds per document\n", args.length, diff/1000, diff/1000d/args.length );
 }
}
Output on my 2Ghz dual core:
pdfs: 15, total: 10 seconds, average: 0.727 seconds per document
Results: about a whole small second is spend on text extraction in every pdf added to Alfresco!
In our projects, we replaced the PDFBox transformer with the pdftotext console tool.
lothar@lothar-laptop:~/devenv/lib/collections/pdf$ time for pdffile in *.pdf; do pdftotext $pdffile - >/dev/null;done

real 0m2.738s
user 0m2.460s
sys 0m0.136s
Results: about 3 seconds for all 15 pdf files! average: 0.2 seconds per document

Configuration of the transformer:

Add this configuration to shared/classes/alfresco/extension/pdf-transformer-context.xml:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE beans PUBLIC "-//SPRING//DTD BEAN//EN" "http://www.springframework.org/dtd/spring-beans.dtd">
<beans>
        <!-- disable standard pdfbox text transformer -->
        <bean id="transformer.PdfBox" class="java.lang.String"/>
        <!-- has the above injected, is newly created below -->
    <bean id="transformer.complex.OpenOffice.PdfBox" class="java.lang.String"/>

        <!-- pdftotext command line binary -->
        <bean id="transformer.PdfToTextTool"
                class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformer"
                parent="baseContentTransformer">
                <property name="transformCommand">
                        <bean name="transformer.pdftotext.Command"
                                class="org.alfresco.util.exec.RuntimeExec">
                                <property name="commandMap">
                                        <map>
                                                <entry key="Linux.*">
                                                        <value>${catalina.base}/webapps/alfresco/WEB-INF/bin/pdftotext-linux -enc UTF-8 ${options} ${source} ${target}</value>
                                                </entry>
                                                <entry key="Windows.*">
                                                        <value>${catalina.base}/webapps/alfresco/WEB-INF/bin/pdftotext-win32.exe -enc UTF-8 ${options} ${source} ${target}</value>
                                                </entry>
                                        </map>
                                </property>
                                <property name="defaultProperties">
                                        <props>
                                                <prop key="options"></prop>
                                        </props>
                                </property>
                        </bean>
                </property>
                <!-- ensure executable bits of binaries on unix -->
                <property name="checkCommand">
                        <bean name="transformer.pdftotext.checkCommand"
                                class="org.alfresco.util.exec.RuntimeExec">
                                <property name="commandMap">
                                        <map>
                                                <entry key="Linux.*">
                                                        <value>chmod 775 ${catalina.base}/webapps/alfresco/WEB-INF/bin/pdftotext-linux</value>
                                                </entry>
                                                <entry key="Windows.*">
                                                <!--  dummy value -->
                                                        <value>cmd.exe /C dir</value>
                                                </entry>
                                        </map>
                                </property>
                                <property name="defaultProperties">
                                        <props>
                                                <prop key="options"></prop>
                                        </props>
                                </property>
                        </bean>
                </property>
                <property name="explicitTransformations">
                        <list>
                                <bean
                                        class="org.alfresco.repo.content.transform.ContentTransformerRegistry$TransformationKey">
                                        <constructor-arg>
                                                <value>application/pdf</value>
                                        </constructor-arg>
                                        <constructor-arg>
                                                <value>text/plain</value>
                                        </constructor-arg>
                                </bean>
                        </list>
                </property>
        </bean>

   <!-- replaces bean transformer.complex.OpenOffice.PdfBox -->
   <bean id="transformer.complex.OpenOffice.PdfToTextTool"
        class="org.alfresco.repo.content.transform.ComplexContentTransformer"
        parent="baseContentTransformer" >
      <property name="transformers">
         <list>
            <ref bean="transformer.OpenOffice" />
            <ref bean="transformer.PdfToTextTool" />
         </list>
      </property>
      <property name="intermediateMimetypes">
         <list>
            <value>application/pdf</value>
         </list>
      </property>
   </bean>
</beans>

Let me know how it worked for you!