ImageParser Syntax

The ImageParser(#ImageParser) is a CodX PostOffice module for controlling the processing of images. An XML configuration file specifies how the image is to be processed.

The XML file consists of rules (ParserRule) with screen elements (Element) and evaluation criteria (Criteria).

Example ImageParser configuration file

<ImageParser Name="Parser 1" Timeout="10000" Reference="SN-AZD, FE" Remark="This is a remark">
<ParserRule Name="Role 1" Rotate="0, 30, 60, 90" Origin="top_left" Trim="Bottom, Right " FirstPage="1" LastPage="3">
<Element Name="Element 1" x="100 px" y="200 px" h="50 mm" w="40 mm" Type="Text" Rotate="0, 90, 180, 270" ValidValue="[a-z]+" Prerequisite="Mandatory" SBB-CF="Alternative code"></element>
<element Name="Element 2" x="200 px" y="400 px" h="50 mm" w="40 mm" Type="Barcode" BarcodeType="All" Rotate="0" PreProcess="BRectS" InvalidValue="(?<Invalid>[^[0-9a-zA-Z()-/\s\öäüéàèÖÄÜ])"></Element>
<Element Name="Element 3" x="300 px" y="500 px" h="50 mm" w="40 mm" Type="UPOC" Rotate="0" ValidValue="[0-9]+"></element>

<element Name="PostAdr1" Type="PostalAddress"ValidValue="[AdrLevel_City]"></element>
<Criteria Name="Criteria 1" Operation="AND" ElementName="Element 1" RegEx="[a-z]+"></Criteria>
<Criteria Name="Criteria 2" Operation="AND" ElementName="Element2" RegEx="[0-9]+"></Criteria>
<Criteria Name="Criteria 3" Operation="OR" ElementName="Element3" RegEx="04[0-9]+"></Criteria>
<Criteria Name="Criteria 3" Operation="AND" ElementName="PostAdr1" RegEx="[AdrLevel_City]"></Criteria>
</ParserRule>
<ParserRule Name="Role 2" Rotate="0, -30, -60, -90" Origin="top_left" Trim="Top">
<Element Name="Element 1" x="100 px" y="200 px" h="50 mm" w="40 mm" Type="Text" Rotate="0, 90, 180, 270" ValidValue="[a-z]+"></element>
<element name="Element 2" x="200 px" y="400 px" h="50 mm" w="40 mm" Type="Barcode" Rotate="0" ValidValue="[a-z]+"></element>
<Element Name="Element 3" x="300 px" y="500 px" h="50 mm" w="40 mm" Type="UPOC" Rotate="0" ValidValue="[0-9]+"></element>
<Criteria Name="Criteria 1" Operation="AND" ElementName="Element1" RegEx="[a-z]+"></Criteria>
<Criteria Name="Criteria 2" Operation="AND" ElementName="Element 2" RegEx="[0-9]+"></Criteria>
<Criteria Name="Criteria 3" Operation="OR" ElementName="Element 3" RegEx="04[0-9]+"></Criteria>
</ParserRule>

</ImageParser>Tags

The XML configuration file is structured as follows:

ImageParser

Contains 1 or more tags from 'ParserRule'.

..
AttributeDescription
NameName of the ImageParser
TimeoutMaximum processing time in milliseconds. If the timeout is reached, processing is aborted and no result is returned. this attribute is optional, the default value is 3500 ms and the maximum permitted timeout is 100000 ms.
If the timeout is 0 or less than 0, the timeout is set to the default value; if it is greater than the permitted maximum value, it is reset to the maximum value.
ReferenceReference to the module/function that uses this ImageParser.
If an ImageParser is used in several modules, the module names are listed separated by commas.

The following values are possible:
  • R-SCAN
    CxLetterScan in 'R-Scan' mode
    CxLetterScan in 'Maintenance' mode
    Module R-Scan
  • SCANNER
    CxLetterScan in 'Scanner' mode
  • CAPTURE
    CxLetterScan in 'Capture' mode
  • SORT
    CxLetterScan in 'Sort' mode
  • DIGITAL
    Digitization
Remark Remark on the ImageParser.
This attribute is optional.
SafeModeOptional, value range: 0/1 (false/true), default: 0 (false)
If 1 (true), the ImageParser is operated in safe mode
Only one thread and various optimizations for minimum RAM memory consumption are used internally.
If the image to be processed is multicoloured or has a higher resolution than defined in the SafeModeResolution attribute, the image is automatically converted to an 8-bit greyscale image before analysis.
ATTENTION!
Only use this option if very large images are being processed!
This greatly increases the processing time (approx. 5..10 times), adjust the Timeout attribute accordingly.
SafeModeResolutionOptional, value range: 100..300 [DPI], default: 200 [PDI].
Is only used if the SafeMode attribute is set.
Defines the image resolution [DPI] used in safe mode (see above)

ParserRule

Contains 1 or more tags of 'Element' and 'Criteria

'...
AttributeDescription
NameName of the ParserRule
RotateList of angles with which the value is to be read.
The angles are separated by commas.
The angles increase clockwise from 0 to 360°.
ATTENTION: Each additional angle multiplies the processing time!
OriginOrigin of the image.
All coordinates refer to this origin.
Possible values:
- top_left
- top_right
- bottom_left (default)
- bottom_right
This attribute is optional
TrimOptional.
If this attribute is defined, the corresponding part of the image that is significantly darker than the rest of the image is cut off.
Possible values are a combination of the following values, separated by commas:
- Top
- Bottom
- Left
- Right
FirstPageThis attribute is only used if several documents are processed at the same time, e.g. in digitization.
This ParserRule is applied from this page onwards
If the page to be processed is smaller, the ParserRule is ignored
LastPageThis attribute is only used if several documents are being processed at the same time, e.g. in digitization.
This ParserRule is applied up to this page.
If the page to be processed is larger, the ParserRule is ignored
.

Element

Describes the element to be parsed

.
AttributeDescription
NameName of the element, must be present!
The CxLetterScan requires certain elements depending on the mode (use case)
..
See online help for the corresponding mode.
TypeType of element; optional, default value: Text.
The following types are possible:
- Text: The section of the element is analyzed with OCR and the recognized text is output [*1].
- Barcode: The section of the element is analyzed for a barcode [*1].
- UPOC: The section of the element is analyzed for a barcode and checked to see whether it is a valid UPOC (type, client, ID, checksum) [*1].
- PostalAddress: The section of the element is analyzed with OCR and the address tokenizer and the recognized address is output [*2].
See also Postal addresses.
- Subject: The subject is searched for in the defined section of the element (most prominent line of text), seeSubject

.
OverwriteModeOptional, defines whether/how existing data of a consignment is overwritten
Default: NotEmpty
The rule is only applied if the ImageParser reads a valid value for the consignment attribute.
If no value or an empty value is read, the original value of the consignment is never overwritten.

can be specified optionally, default =ally, a unit can be specified, default ==...that...
Possible values:
- NotEmpty: Only overwrite the existing send attribute if it is empty
- Never: Never overwrite the existing send attribute
- Always: Always overwrite the existing send attribute
xX coordinate of the top left corner of the element.
The coordinate refers to the origin of the image.
Optionally, a unit can be specified, default = px (pixels).
The following units are possible:
- px: pixels (default)
- %:
Percent, in relation to the entire image
- mm: Millimenter (only if resolution is known)
yY coordinate of the top left corner of the element.
The coordinate refers to the origin of the image.
A unit
px (pixels).
The following units are possible:
- px: Pixel (default)
- %:
Percent, related to the whole image
- mm: Millimenter (only if resolution is known).
hHeight of the element.
Option
px (pixels).
The following units are possible:
- px: Pixel (default)
- %:
Percent, related to the whole image
- mm: Millimenter (only if resolution is known).
wWidth of the element.
Optionally, a unit can be specified, default
px (pixels).
The following units are possible:
- px: Pixel (default)
- %:
Percent, related to the entire image
- mm: Millimenter (only if resolution is known)
BarcodeTypeType of barcode. Only relevant for the 'Barcode' type. Several barcode types can be specified. These are separated by commas.
The following barcode types are possible:
- All (or no specification): All of the barcode types listed below are searched for.
- All1D: All 1D barcode types are searched for
- All2D: All 2D barcode types are searched for
- AustralianPostCode
- Aztec
- Circular2of5
- Codabar
- CodablockF
- Code128
- Code16K
- Code39
- Code39Extended
- Code39Mod43
- Code39Mod43Extended
- Code93
- DataMatrix
- EAN13
- EAN2
- EAN5
- EAN8
- GS1
- GS1DataBarExpanded
- GS1DataBarExpandedStacked
- GS1DataBarLimited
- GS1DataBarStacked
- GS1DataBarOmnidirectional
- GTIN12 (UPC-A with 12 symbols)
- GTIN13 (EAN-13)
- GTIN14 (I2of5 with 14 digits)
- GTIN8 (EAN-8)
- IntelligentMail
- Interleaved2of5
- ITF14 (I2of5 with 14 digits)
- MaxiCode
- MICR
- MicroPDF
- MSI
- PatchCode
- PDF417
- Pharmacode
- PostNet
- PZN
- QRCode
- RoyalMail
- RoyalMailKIX
- TriopticCode39
- UPCA
- UPCE
- UPU
This attribute is optional, the default value is 'All'.
RotateOptional, default = 0 (0°).
Angle in degrees [°] by which the element must be rotated so that it can be read horizontally from left to right .
The angles are entered separated by commas.
The angles increase clockwise from 0 to 360°

Negative entries are permitted; these are automatically converted to the corresponding positive angle, e.g. -20° = + 340°.
The element is rotated after the rotation of the full image (see Rotate attribute of the ParserRule tag) and after the element is cut according to the y,x,h, w attributes.
ATTENTION: Each additional angle multiplies the processing time!
To avoid incorrect readings of barcode and especially UPOC elements, it is important that the possible angles are defined correctly.
The internally used barcode OCR engine supports the following discrete rotation angles: 0°, 11°, 22°, 45°, 90°, 135°, 158°, 169°, 180°, 191°, 202°, 225°, 270°, 315°, 338°, 349°. The specified angles are rounded up/down to the nearest of these angles.
Additional angle specifications are mandatory if the twist is greater than the arc tangent (tan-1) of the ratio of bar height to barcode length.
Example: Bar height = 10 mm, total length = 50 mm. Thus: Arctan(10/50)=11.3°.
So: If the barcode can be rotated by more than 11°, the value 11 must be added as the rotation angle.
PreProcessOptional, pre-processing of the image for barcode OCR reading, is only available for the barcode type! Default value: <blank> (no preprocessing)
The following preprocessing is possible:
- BRectS: Cuts out small used areas and individual processing. Improves the reading of barcodes and DataMatrix. Long processing time.
- BRectM: Cutting out of medium used areas and individual processing. Improves the reading of barcodes and DataMatrix. Medium processing time.
- BRectL: Cutting out of large used areas and individual processing. Improves the reading of barcodes and DataMatrix.
ValidValueRegEx expression, which defines a valid value of the element
The element only has a valid value if it has been checked according to the RegEx expression. Otherwise, the value of the element is empty.
Not to be confused with 'RegEx' from the criterion.
If the 'ValidValue' attribute is not specified or is empty, the following default values apply:
- Text: "[0-9,a-z,A-Z]+"
- Barcode: "[0-9,a-z,A-Z]+"
- UPOC: <empty>, the value is checked by verifying the UPOC syntax.
If a RegEx is specified, it is also evaluated according to the UPOC syntax.
InvalidValueRegEx expression that defines all invalid values of an element
The element must not contain any of these invalid values, in which case the element is valid.
If this element is recorded, it has priority over the "ValidValue" element.
If the 'InvalidValue' attribute is not specified or is empty, the default value remains empty and the rule for 'ValidValue' is active.
OCRMinConfidenceOptional, default = 0, only available for the Text and PostalAddress types.
The attribute defines the minimum quality (confidence)
the complete text recognised by the OCR recognition must have for further processing
The global setting xxx is always used for each individual line!
Range: 0% ...
100%
Well-recognized texts have a confidence >= 50%, bad ones < 30%.
PrerequisiteThis attribute is only used for digitization. It defines whether the value of this element must be present in order to complete the digitization.
If the element is not found, manual post-processing or capture is mandatory.
The following values are possible:
- Optional
- Mandatory
- None
SBB-CFDefines the name of the SBB-Custfield (extended consignment attributes) in which the determined value of the element is to be saved

ATTENTION: Only SBB-Custfields of type text are supported!
- Enter the name and type (fixed text) of the SBB-Custfield
- If the SBB-Custfield is to be displayed in the UI of the entry modules, it must be configured accordingly in theservices.
RefOptional, defines a relative element, see Relative elements.
Defines the name of the parent element.
RefOriginOptional, defines the reference point on the parent element.
Must be defined for relative element

Possible values:
- top_left
- top_right
- bottom_left (default)
- bottom_right
dxOptional, defines the position of the relative element based on RefOrigin
relative to the parent the reference point on the parent element.

Optional.
dy
Style
, defines the style of the element.
Several values are possible, separated by a comma (",")

Valid values:
- Normal: Text is not bold
- Bold: Text is bold
- Larger: Text is larger than the referenced element
- Smaller: Text is smaller than the referenced element
- Size: XX: Text has exactly this size (in points)

Example: Style="Size:10,Bold"
Restrictions:

..
Not permitted for elements of type Barcode and UPOC
Values Normal and Bold are not both permitted (mutual exclusion)
Values Larger and Smaller are not both permitted (mutual exclusion)
Values Larger or Smaller are only permitted on relative elements
Values Larger or Smaller are permitted if the parent is not Barcode or UPOC
FontsizeDetModeOptional; defines how the Fontsize property is calculated based on the font size of the individual words

Is only used for elements of type Subject and Text (if contained in Style Larger or Smaller or Size: XX ).
Default: MajorityChars
Possible values:
- MajorityChars:
Fontsize of the majority of characters
- MajorityWords: Fontsize of the majority of words
- LargestWord: Fontsize of the largest word
- SmallestWord: Fontsize of the smallest word
BoldDetModeOptional; defines how the Bold property is determined based on the individual words.
Is only used for elements of type Subject and Text (if contained in
Style
Normal or Bold )

Default: MajorityWords
Possible values:
- MajorityChars: Line is Bold if the majority of characters are Bold
- MajorityWords: Line is Bold if the majority of words are Bold
- AtLeastOneWord: Line is Bold if at least one word is Bold
- AllWords: Line is Bold if all words are Bold

[*1]: If Type = Text, Barcode, UPOC all attributes x,y,h,w must be defined!
[*2]:

If Type =

PostalAddress

,

all or no attributes

x

,y,h,w

must be defined!

Criteria

Describes the criteria that must be met for this ParserRule

to take effect.

If the criteria are not met, the next ParserRule is processed.
This must not be confused with 'ValidValue' of the element!

Spec
AttributeDescription
NameName of the criterion
Operation
ifies'
how the criterion is logically linked.
Possible values: 'AND', 'OR
ElementNameName of the element that is to be checked.
RegExRegEx expression that must be fulfilled.
Elements of type PostalAddress
do
NOT support criteria, the validity is determined on the basis of internal rules!

Postal addresses

Elements of type PostalAddress are processed as follows:
The attributes x, y, h, w are optional. If these are not specified (or are all 0), the entire image is scanned for text blocks.
If the attributes x, y, h, w are defined, only the defined section is used (analogous to the element type "Text").
The text blocks found are sorted according to certain criteria and analyzed with OCR and the SortTree.

The first/best recognized address is output.
If NO value is defined as ValidValue , all detected address blocks are returned as a string, there is NO analysis with the SortTree!

Specific attributes for element type "PostalAddress"

The following attributes are specific to elements of type "PostalAddress"

.


ATTENTION: The attributes CutLength/CutWidth and ExclusionZone* are mutually exclusive, only one of the two variants can be used!
For new definitions, the exclusion zone should be defined with ExclusionZone*.

.onsides of the image....
AttributeDescription
ValidValueThe following pseudo-regex (default: <empty>) are supported:
- <empty> (default)
- [AdrLevel_Country]
- [AdrLevel_City]
- [AdrLevel_Street]
- [AdrLevel_House]
- [AdrLevel_Name]
CutLengthOptional, default = 0 (no exclusion zone).
Defines an exclusion zone in [mm] for automatically detected text blocks (x,y,h,w = 0) on both long sides of the image.
Detected text blocks that overlap this zone are ignored
CutWidthOptional, default = 0 (no exclusion zone).
Defines an exclusion zone in [mm] for automatically detected text blocks (x,y,h,w = 0)
both short
Detected text blocks that overlap this zone are ignored.
ValidateAddressOptional, value range: 0/1, default: 0
Defines how the addresses found are checked according to ValidValue.
0: The check is carried out by splitting via Tokenizer and testing for non-empty (up to the level defined by ValidValue )
1: The check is carried out against the area data of the district administration
The address must be valid up to at least the level specified by
ValidValue
ExclusionZoneTopOptional, if not
available the default setting is used
.
Defines an exclusion zone at the top [mm].
Text blocks that overlap this zone are ignored.
ExclusionZoneBottomOptional, if not available the default setting is used

an exclusion zone at themargin [mm]Text blocks that overlap this zone are ignored.Optional, if not available, the default setting is used
Defines
bottom
.
ExclusionZoneLeft
.
Defines an exclusion zone at the left margin [mm].
Text blocks that overlap this zone are ignored.
ExclusionZoneRightOptional, if not available, the default setting is used.

mmText blocks that overlap this zone are ignored.), the internally defined. .
Defines an "exclusion zone" at the right margin [
].
DisableCodingZoneOptional, value range: 0/1 (false/true), default: 0 (false)
If 1 (true
fixed
coding zones are not applied.
DisableFrankingZoneOptional, value range: 0/1 (false/true), default: 0 (false)
If 1 (true), the internally defined fixed franking zones are not applied
RefPointModeOptional, default = 0 (no exclusion zone).
The following pseudo regex (default: <empty>) are supported:
- AutoBasedFullImage
- AutoBasedExclZone
- ValueBasedExclZone

If AutoBasedExclZone or ValueBasedExclZone is defined, the CutLength and CutWidthattributes mustNOT be defined
The exclusion zones must be defined by ExclusionZone*.

If ValueBasedExclZone is defined:

-
both attributes RefPointX and RefPointY must be defined
- Attributes CutLength and CutWidth must NOT be defined
- DisableCodingZone and DisableFrankingZone must be True
RefPointXReference point X-Koodinate in [mm],
[%] or pixels (from the left)
Is only used if RefPointMode = ValueBasedExclZone
RefPointY
Reference point
Y-Koodinate in [mm], [%] or pixels (from above)
Is only used if RefPointMode = ValueBasedExclZone

Relative elements

TODO

Regex

Regex expressions must be entered according to DEELX Regular Expression Syntax

.


ATTENTION: To embed regex expressions in XML attributes, these may have to be escaped!

Use

an appropriate online tool for this

, e.g.

Code Beauty.

Named capture groups

The image parser

supports

named capture groups

.

The group name must be the same as the name of the element.
Example: (?<subject>.*)
Complete element including escaping: <Element Name="Subject" Type="Text" ValidValue="(?&lt;Subject&gt;.*)">

Single-/Multiline, Mode Modifier

Example: To match only the first line in a multi-line text: (?

-

Subject

The subject is searched for in the defined section of the element (most prominent line of text)

.

A score is calculated for each line of text read; the line with the highest score contains the subject

.

Formula for calculating the line score:
LineScore = (BlockScore * OwnBlockFactor) + (ParagraphScore * OwnParaFactor) + (BoldScore * BoldFactor) + (FontsizeScore * FontsizeFactor) + (PositionScore * PositionFactor) + (KeywordScore * KeywordFactor)

The following attributes are specific to elements of type Subject.

Optional;defines the keywordfor calculating the line score.uteOptional
AttributeDescriptionDefault value
LinePosDetModeOptional; defines which reference point is used to calculate the PositionScore (relative to the center of the cropped image).
Possible values:
- LineCenter:
Center/center of gravity of the line
- ShortestCorner: Shortest distance of a corner point of the line
LineCenter
KeywordsOptional; defines keywords to identify the subject.
Enter a list of keywords, separated by commas.
<empty>
KeywordFactor
factor
1.0
OwnBlockFactorOptional; defines the OwnBlock factor for calculating the line score.0.0
OwnParaFactorOptional; defines the OwnParagraph factor for calculating the line score.0.05
BoldFactorOptional; defines the bold factor for calculating the line score.
See also
element attrib
BoldDetMode0.40
FontsizeFactor
; defines the font size factor for calculating the line score.
See also element attribute FontsizeDetMode
0.40
PositionFactorOptional; defines the position factor
for calculating the line score.
0.30
MinLengthOptional; defines the minimum length of a line.4
MaxLengthOptional; defines the maximum length of a line.0 (no maximum length)

Functionality

The functionality and further information can be found in the AdminDoc (search for 'ImageParser')

See also:



CodX Software CodX Software AG
Sinserstrasse 47
6330 Cham
Switzerland
support
http://support.codx.ch
CxSpickel