PDF Step pdfToTextFilter
Description
Extracts all text content from within the current PDF document.
In general, PDF documents can place text in documents using a variety of mechanisms. They may contain text as a stream of characters in an expected order, the order may not be expected but explicit positioning will place it in the correct position or it may contain graphical representations of the characters. For these reasons, this filter may not always produce what you expect. You will have to experiment to see what will work for you.
You can use the mode attribute with a value of "groupByLines" to try to parse content which is placed with explcit coordinates to try to extract the text in about the order that you might expect. The algorithm used to determine which fragments of text belong together on one line is fairly simple. It may not work for some documents, e.g. it probably fails for documents with vertical text.
Parameters
- description
- Required? no
- The description of this test step.
- fragSep
- Required? no, default is a single space
- The fragment separator string to use, e.g. "" or " " or "," or " | ". Only used if mode is "groupByLines".
- lineSep
- Required? no, default is platform line separator
- The line separator string to use, e.g. " " or "\n".
- mode
- Required? no, default is normal
- The mode to use when extracting text, either normal, which extracts text in the order it is found, or groupByLines, which tries to group text together into lines.
- pageSep
- Required? no, default is [+++ NEW PAGE +++]\n
- The page separator string to use, e.g. "\n" or "------".
Details
Here is an example of using pdfToTextFilter:
<invoke url="testDocBookmarks.pdf"/>
<compareToExpected saveFiltered="true" readFiltered="false" toFile="${expectedFile}">
<pdfToTextFilter mode="groupByLines" lineSep="\n" description="extract PDF text"/>
<lineSeparatorFilter description="normalise line separators"/>
</compareToExpected>
</steps>
As a result of invoking the above steps a file would be created containing something like the following:
Subheading
[+++ NEW PAGE +++]
Heading Two
[+++ NEW PAGE +++]
WebTestRecorder