Canoo Webtest WebTest Canoo

PDF Step pdfToTextFilter

Description

Extracts all text content from within the current PDF document.

In general, PDF documents can place text in documents using a variety of mechanisms. They may contain text as a stream of characters in an expected order, the order may not be expected but explicit positioning will place it in the correct position or it may contain graphical representations of the characters. For these reasons, this filter may not always produce what you expect. You will have to experiment to see what will work for you.

You can use the mode attribute with a value of "groupByLines" to try to parse content which is placed with explcit coordinates to try to extract the text in about the order that you might expect. The algorithm used to determine which fragments of text belong together on one line is fairly simple. It may not work for some documents, e.g. it probably fails for documents with vertical text.

Parameters

description
Required? no
The description of this test step.
fragSep
Required? no, default is a single space
The fragment separator string to use, e.g. "" or " " or "," or " | ". Only used if mode is "groupByLines".
lineSep
Required? no, default is platform line separator
The line separator string to use, e.g. " " or "\n".
mode
Required? no, default is normal
The mode to use when extracting text, either normal, which extracts text in the order it is found, or groupByLines, which tries to group text together into lines.
pageSep
Required? no, default is [+++ NEW PAGE +++]\n
The page separator string to use, e.g. "\n" or "------".

Details

Here is an example of using pdfToTextFilter:

pdfToTextFilter example
<steps>
    <invoke url="testDocBookmarks.pdf"/>
    <compareToExpected saveFiltered="truereadFiltered="falsetoFile="${expectedFile}">
        <pdfToTextFilter mode="groupByLineslineSep="\ndescription="extract PDF text"/>
        <lineSeparatorFilter description="normalise line separators"/>
    </compareToExpected>
</steps>

As a result of invoking the above steps a file would be created containing something like the following:

pdfToTextFilter output
Heading One
Subheading
[+++ NEW PAGE +++]
Heading Two
[+++ NEW PAGE +++]

news

Latest build: R_1705
Posted: 14-May-2008 13:13

WebTest @ JavaOne
Dierk König will present "Functional testing of web applications: scaling with Java" on Wed May 7, 13:30 at JavaOne in the Tools and Scripting Languages track.
Posted: 6 May 2008

WebTest 2.6 released, featuring upgrades to Ant 1.7, Groovy 1.5.4, and HtmlUnit 1.14.
The release includes support for data-driven testing, testing of drag-and-drop, advanced AJAX support, high-concurrency testing and reporting, real-time monitoring, and - as usual - lots of handling improvements.
Posted: 18 March 2008

New WebTest screencast available:
Data Driven WebTest
Posted: 13 November 2007

First WebTest screencast available:
Creating a first Webtest Project

Extend WebTest with Groovy! Groovy in Action is available in every good bookstore.
Groovy in Action
Posted: 29 January 2007