Friday, March 6, 2009

Speeding up PDF indexing - Alfresco Hack #3

What is your document type distribution like? If the majority of your stored documents in Alfresco are PDFs, this article is for you. It describes how to speed up the full-text indexing process dramatically by factor of 4! But relax, no coding required:)

To estimate the impact on your installation, some rough distribution numbers can be gathered like this:

dmc@alfresco: find ./alf_data/contentstore -type f -exec file -inb {} \;| sort |uniq -c|sort -nr
Output for example from our installation:
image/png          28 %
application/msword 13 %
application/pdf    10 %
image/gif           8 %
other              41 %

The PDF format is the most used text format after microsoft word. From these numbers it is clear, that a speed up of PDF indexing would be a great benefit for the overall system. The indexing process first extracts the plain text content from the PDF document by the help of a pdf->text transformer based on the PDFBox library. The extracted text is then fed to the lucene indexer component.

For testing the text extraction, I created a small PDF document collection consisting of 15 PDF documents, total of 15MB. To rule out java startup and library loading times, I used this small sample program to time the PDFBox text stripper:

package de.dmc.alfresco.pdfbox;

import java.io.IOException;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;

public class ExtractText {
        // Extract the text from the pdfs given on the command line
 public static void main(String[] args) throws IOException {
  long start = System.currentTimeMillis();
  for (String pdfFile : args) {
   PDDocument document = PDDocument.load(pdfFile);
   PDFTextStripper stripper = new PDFTextStripper();
   stripper.getText(document);
   document.close();
  }
  long stop = System.currentTimeMillis();
  long diff = stop - start;
  System.err.printf("pdfs: %d, total: %d seconds, average: %.3f seconds per document\n", args.length, diff/1000, diff/1000d/args.length );
 }
}
Output on my 2Ghz dual core:
pdfs: 15, total: 10 seconds, average: 0.727 seconds per document
Results: about a whole small second is spend on text extraction in every pdf added to Alfresco!
In our projects, we replaced the PDFBox transformer with the pdftotext console tool.
lothar@lothar-laptop:~/devenv/lib/collections/pdf$ time for pdffile in *.pdf; do pdftotext $pdffile - >/dev/null;done

real 0m2.738s
user 0m2.460s
sys 0m0.136s
Results: about 3 seconds for all 15 pdf files! average: 0.2 seconds per document

Configuration of the transformer:

Add this configuration to shared/classes/alfresco/extension/pdf-transformer-context.xml:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE beans PUBLIC "-//SPRING//DTD BEAN//EN" "http://www.springframework.org/dtd/spring-beans.dtd">
<beans>
        <!-- disable standard pdfbox text transformer -->
        <bean id="transformer.PdfBox" class="java.lang.String"/>
        <!-- has the above injected, is newly created below -->
    <bean id="transformer.complex.OpenOffice.PdfBox" class="java.lang.String"/>

        <!-- pdftotext command line binary -->
        <bean id="transformer.PdfToTextTool"
                class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformer"
                parent="baseContentTransformer">
                <property name="transformCommand">
                        <bean name="transformer.pdftotext.Command"
                                class="org.alfresco.util.exec.RuntimeExec">
                                <property name="commandMap">
                                        <map>
                                                <entry key="Linux.*">
                                                        <value>${catalina.base}/webapps/alfresco/WEB-INF/bin/pdftotext-linux -enc UTF-8 ${options} ${source} ${target}</value>
                                                </entry>
                                                <entry key="Windows.*">
                                                        <value>${catalina.base}/webapps/alfresco/WEB-INF/bin/pdftotext-win32.exe -enc UTF-8 ${options} ${source} ${target}</value>
                                                </entry>
                                        </map>
                                </property>
                                <property name="defaultProperties">
                                        <props>
                                                <prop key="options"></prop>
                                        </props>
                                </property>
                        </bean>
                </property>
                <!-- ensure executable bits of binaries on unix -->
                <property name="checkCommand">
                        <bean name="transformer.pdftotext.checkCommand"
                                class="org.alfresco.util.exec.RuntimeExec">
                                <property name="commandMap">
                                        <map>
                                                <entry key="Linux.*">
                                                        <value>chmod 775 ${catalina.base}/webapps/alfresco/WEB-INF/bin/pdftotext-linux</value>
                                                </entry>
                                                <entry key="Windows.*">
                                                <!--  dummy value -->
                                                        <value>cmd.exe /C dir</value>
                                                </entry>
                                        </map>
                                </property>
                                <property name="defaultProperties">
                                        <props>
                                                <prop key="options"></prop>
                                        </props>
                                </property>
                        </bean>
                </property>
                <property name="explicitTransformations">
                        <list>
                                <bean
                                        class="org.alfresco.repo.content.transform.ContentTransformerRegistry$TransformationKey">
                                        <constructor-arg>
                                                <value>application/pdf</value>
                                        </constructor-arg>
                                        <constructor-arg>
                                                <value>text/plain</value>
                                        </constructor-arg>
                                </bean>
                        </list>
                </property>
        </bean>

   <!-- replaces bean transformer.complex.OpenOffice.PdfBox -->
   <bean id="transformer.complex.OpenOffice.PdfToTextTool"
        class="org.alfresco.repo.content.transform.ComplexContentTransformer"
        parent="baseContentTransformer" >
      <property name="transformers">
         <list>
            <ref bean="transformer.OpenOffice" />
            <ref bean="transformer.PdfToTextTool" />
         </list>
      </property>
      <property name="intermediateMimetypes">
         <list>
            <value>application/pdf</value>
         </list>
      </property>
   </bean>
</beans>

Let me know how it worked for you!

Tuesday, February 3, 2009

Indexing Freemind MindMaps with Alfresco - Alf Hack # 2

The idea of this Alfresco hack is to use a command line tool for text extraction of the Freemind .mm file. Steps to include this into Alfresco will be:
  1. Add Mimetype application/x-freemind for .mm
  2. Add transformer from appplication/x-freemind to text/plain
This article will talk about the second step. For adding a new MIME type please refer to the Alfresco Wiki. The MIME type of Freemind mid maps is application/x-freemind. There is also a nice blog post about adding the freemind MIME type and a nice map integration available.

Extract the text

An example shows how Freemind stores this sample map in a XML file:
<map version="0.7.1">
  <node text="Alfresco Hack No 2">
    <node text="Explore how Freemind XML looks like" position="right">
    </node>
  </node>
</map>
Quite simple XML without namespaces. The text of the map nodes is stored in a the value of the attribute text. To extract the text I will use a quick-and-dirty XSLT:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output omit-xml-declaration="yes" indent="no"/>
    <xsl:template match="/">
     <xsl:call-template name="t1"/>
   </xsl:template>
   <xsl:template name="t1">
     <xsl:for-each select="//node">
       <xsl:value-of select="@TEXT"/>
       <xsl:value-of select="' '"/>
     </xsl:for-each>
  </xsl:template>
</xsl:stylesheet>

Throwing this XSLT on the Freemind XML results in the extracted text:
Alfresco Hack No 2 Explore how Freemind XML looks like

Add transformer to Alfresco
To keep things simple, I will use the Alfrescos feature to do content transformations with external tools or programs. This is done by configuring a RuntimeExecutableContentTransformer bean. But first, the command line of the external tool has to be figured out. I will use the xmlstarlet command line tool from http://xmlstar.sourceforge.net/. Depending on your linux distribution the executable will be called just xml or xmlstarlet. There is also a Windows version available from the download page. Transforming the above XSLT to xmlstarlets commandline results in:
xmlstarlet sel -t -m //node -v @TEXT -o ' ' Alfresco\ Hack\ No\ 2.mm
Sadly, the output always go to stdout and no output file can be specified. But this is required for the RuntimeExecutableContentTransformer, so a simple script wrapper can be used. I put the following to a file /home/lothar/bin/freemind2text.sh (made executable with chmod 775) which will be configured to the transformer bean:
#!/bin/bash
# save arguments to variables
SOURCE=$1
TARGET=$2

# to see what gets extracted append arguments to logfile
echo "from $SOURCE to $TARGET" >>/tmp/freemindtransform.log

# call xmlstarlet tool and redirect output to $TARGET
xmlstarlet sel --text --encoding UTF-8 -t -m //node -v @TEXT -o ' ' "$SOURCE" > "$TARGET"
Now we are ready to configure the RuntimeExecutableContentTransformer bean:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE beans PUBLIC "-//SPRING//DTD BEAN//EN" "http://www.springframework.org/dtd/spring-beans.dtd">
<beans>
  <bean id="transformer.freemindToText" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformer" parent="baseContentTransformer">
    <property name="transformCommand">
      <bean name="transformer.freemind.Command" class="org.alfresco.util.exec.RuntimeExec">
        <property name="commandMap">
          <map>
            <entry key="Linux.*">
              <value>/home/lothar/bin/freemind2text.sh ${source} ${target}</value>
            </entry>
            <entry key="Windows.*">
              <value>...whatever windows needs here....</value>
            </entry>
          </map>
        </property>
        <property name="defaultProperties">
          <props>
            <prop key="options"/>
          </props>
        </property>
      </bean>
    </property>
    <property name="explicitTransformations">
      <list>
        <bean class="org.alfresco.repo.content.transform.ContentTransformerRegistry$TransformationKey">
          <constructor-arg>
            <value>application/x-freemind</value>
          </constructor-arg>
          <constructor-arg>
            <value>text/plain</value>
          </constructor-arg>
        </bean>
      </list>
    </property>
  </bean>
</beans>



Finished!
Now indexing of Freemind mindmaps will take place. On the plus side: No Java coding, just configuration of the standard Alfresco features. On the down side: ...is there anything? Anybody who could contribute the Windows batch file wrapper for the xmlstarlet call?

Monday, February 2, 2009

CMIS Link collection

Random link collection about CMIS: Blogs, Specs, Samples from Alfresco, EMC and others John Newton F2F
http://craigrandall.net/archives/2008/09/cmis/
https://community.emc.com/servlet/JiveServlet/previewBody/1606-102-1-2762/h3951-cmis-wp_2.pdf
http://chucksblog.typepad.com/chucks_blog/2008/09/cmis----its-not.html
https://community.emc.com/docs/DOC-1606

OASIS
CMIS home: http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=cmis
Members: http://www.oasis-open.org/committees/membership.php?wg_abbrev=cmis
JIRA: http://tools.oasis-open.org/issues/browse/CMIS
CMIS TC list:http://lists.oasis-open.org/archives/cmis/
CMIS comments list:http://lists.oasis-open.org/archives/cmis-comment/ http://xml.coverpages.org/cmis.html http://info.emc.com/mk/get/DAP_RE?P.ctp_program_execution.Source_ID=16706
https://community.emc.com/community/labs/cmis
http://roy.gbiv.com/untangled/2008/no-rest-in-cmis
http://intertwingly.net/blog/?q=cmis
http://www-01.ibm.com/software/data/content-management/cm-interoperablity-services.html
http://blogs.msdn.com/ecm/archive/2008/09/09/announcing-the-content-management-interoperability-services-cmis-specification.aspx
http://blogs.msdn.com/ecm/
http://blogs.nuxeo.com/sections/blogs/florent_guillaume/2009_02_02_cmis-meeting-notes
http://blogs.the451group.com/information_management/2008/09/10/cmis-and-industry-standards-in-ecm/
Also a nice link collection on CMIS
http://weblogs.goshaky.com/weblogs/test/search?q=collaboration

Tuesday, January 27, 2009

Groovy Scripting for Alfresco - Alf Hack # 1

This is the first post of my Alfresco Hacks series, showing some very useful tricks for the development with Alfresco. After working with Alfresco since beginning of 2007 with Version 1.4, I feel now well prepared for sharing some code. I do appreciate any feedback, feel free to comment and add suggestions. I'm tired of doing the somewhat lengthy process of editing java source, compiling, alfresco.war building, deploying to tomcat and finally starting tomcat. Just to try something out, this is too tedious. Therefore I gave the Groovy Server from http://iterative.com/GroovyServer.tar.gz a try. It will give you access to a Groovy shell using:
telnet localhost 6789
The simple steps I did:
  • build the groovyserver.jar
  • copy groovyserver.jar, groovy-all*.jar, jline*.jar to WEB-INF/lib/
  • add this to a spring context file grooyserver-context.xml in the extensions directory:
<bean id="groovyService" abstract="true" method="initialize" method="destroy">
<property name="bindings">
<map>
<entry key="ServiceRegistry" ref="ServiceRegistry">
</map>
</property>
</bean>

<bean id="groovyShellService" class="com.iterative.groovy.service.GroovyShellService" parent="groovyService">
<property name="socket" value="6789">
<property name="launchAtStart" value="true">
</bean>
After starting Alfresco up, the Groovy shell can be access with a telnet client connection to port 6789 on the Alfresco server:
lothar@lothar-laptop:~$ telnet localhost 6789
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Groovy Shell (1.6-RC-2, JVM: 1.6.0_11)
Type 'go' to execute statements; Type 'help' for more information.
groovy>
As an example, searching for the term "alfresco":
import org.alfresco.service.cmr.repository.StoreRef;
import org.alfresco.repo.transaction.RetryingTransactionHelper.RetryingTransactionCallback;
workspaceStoreRef = new StoreRef("workspace://SpacesStore");
ServiceRegistry.getAuthenticationService().authenticate("admin", "admin".toCharArray());
retryTXService = ServiceRegistry.getRetryingTransactionHelper();
def doWork() {
results = ServiceRegistry.getSearchService().query(workspaceStoreRef, "lucene", "alfresco");
for(r in results) { out.println(r.document); }
}
def sow = [ execute: { doWork() } ] as RetryingTransactionCallback<Void>;
retryTXService.doInTransaction(sow);
go
Looks simple? Not at the first glance, but it is easy. The real work has to be put into the doWork() function lines 6 to 9. The rest can stay the same, it is just plumbing code. Conclusion: Now the Alfresco API is just a very small step away. Would like to try the VersionService? Just do a ServiceRegistry.getVersionService()..... and fire up your Groovy script. Other links to Alfresco and Groovy: WebScripts with Groovy: http://gradecak.blogspot.com/2008/04/alfresco-webscripts-with-groovy.html Alfresco with Grails: http://forge.alfresco.com/projects/minigrails/
Mission statement:
  • Share thoughts about Alfresco ECM in general
  • Share some of my favorite Alfresco hacks
  • Share thoughts about Alfresco architecture
  • Talk about computing books
  • achieve at least some points from the above:)