ImageParser Syntax

The ImageParser(#ImageParser) is a CodX PostOffice module for controlling the processing of images. An XML configuration file specifies how the image is to be processed.

The XML file consists of rules (ParserRule) with screen elements (Element) and evaluation criteria (Criteria).

Example ImageParser configuration file

<ImageParser Name="Parser 1" Timeout="10000" Reference="SN-AZD, FE" Remark="This is a remark">
<ParserRule Name="Role 1" Rotate="0, 30, 60, 90" Origin="top_left" Trim="Bottom, Right " FirstPage="1" LastPage="3">
<Element Name="Element 1" x="100 px" y="200 px" h="50 mm" w="40 mm" Type="Text" Rotate="0, 90, 180, 270" ValidValue="[a-z]+" Prerequisite="Mandatory" SBB-CF="Alternative code"></element>
<element Name="Element 2" x="200 px" y="400 px" h="50 mm" w="40 mm" Type="Barcode" BarcodeType="All" Rotate="0" PreProcess="BRectS" InvalidValue="(?<Invalid>[^[0-9a-zA-Z()-/\s\öäüéàèÖÄÜ])"></Element>
<Element Name="Element 3" x="300 px" y="500 px" h="50 mm" w="40 mm" Type="UPOC" Rotate="0" ValidValue="[0-9]+"></element>
<element Name="PostAdr1" Type="PostalAddress"ValidValue="[AdrLevel_City]"></element>
<Criteria Name="Criteria 1" Operation="AND" ElementName="Element 1" RegEx="[a-z]+"></Criteria>
<Criteria Name="Criteria 2" Operation="AND" ElementName="Element2" RegEx="[0-9]+"></Criteria>
<Criteria Name="Criteria 3" Operation="OR" ElementName="Element3" RegEx="04[0-9]+"></Criteria>
<Criteria Name="Criteria 3" Operation="AND" ElementName="PostAdr1" RegEx="[AdrLevel_City]"></Criteria>
</ParserRule>
<ParserRule Name="Role 2" Rotate="0, -30, -60, -90" Origin="top_left" Trim="Top">
<Element Name="Element 1" x="100 px" y="200 px" h="50 mm" w="40 mm" Type="Text" Rotate="0, 90, 180, 270" ValidValue="[a-z]+"></element>
<element name="Element 2" x="200 px" y="400 px" h="50 mm" w="40 mm" Type="Barcode" Rotate="0" ValidValue="[a-z]+"></element>
<Element Name="Element 3" x="300 px" y="500 px" h="50 mm" w="40 mm" Type="UPOC" Rotate="0" ValidValue="[0-9]+"></element>
<Criteria Name="Criteria 1" Operation="AND" ElementName="Element1" RegEx="[a-z]+"></Criteria>
<Criteria Name="Criteria 2" Operation="AND" ElementName="Element 2" RegEx="[0-9]+"></Criteria>
<Criteria Name="Criteria 3" Operation="OR" ElementName="Element 3" RegEx="04[0-9]+"></Criteria>
</ParserRule>

</ImageParser>Tags

The XML configuration file is structured as follows:

ImageParser

Contains 1 or more tags from 'ParserRule'.

Attribute	Description
Name	Name of the ImageParser
Timeout	Maximum processing time in milliseconds. If the timeout is reached, processing is aborted and no result is returned. this attribute is optional, the default value is 3500 ms and the maximum permitted timeout is 100000 ms.
If the timeout is 0 or less than 0, the timeout is set to the default value; if it is greater than the permitted maximum value, it is reset to the maximum value.
Reference	Reference to the module/function that uses this ImageParser. If an ImageParser is used in several modules, the module names are listed separated by commas.
The following values are possible: R-SCAN CxLetterScan in 'R-Scan' mode CxLetterScan in 'Maintenance' mode Module R-Scan SCANNER CxLetterScan in 'Scanner' mode CAPTURE CxLetterScan in 'Capture' mode SORT CxLetterScan in 'Sort' mode DIGITAL Digitization
Remark Remark on the ImageParser.
This attribute is optional.
SafeMode	Optional, value range: 0/1 (false/true), default: 0 (false) If 1 (true), the ImageParser is operated in safe mode
Only one thread and various optimizations for minimum RAM memory consumption are used internally. If the image to be processed is multicoloured or has a higher resolution than defined in the SafeModeResolution attribute, the image is automatically converted to an 8-bit greyscale image before analysis. ATTENTION! Only use this option if very large images are being processed!
This greatly increases the processing time (approx. 5..10 times), adjust the Timeout attribute accordingly.
SafeModeResolution	Optional, value range: 100..300 [DPI], default: 200 [PDI]. Is only used if the *SafeMode* attribute is set. Defines the image resolution [DPI] used in safe mode (see above)

ParserRule

Contains 1 or more tags of 'Element' and 'Criteria

'...

Attribute	Description
Name	Name of the ParserRule
Rotate	List of angles with which the value is to be read.
The angles are separated by commas. The angles increase clockwise from 0 to 360°. ATTENTION: Each additional angle multiplies the processing time!
Origin	Origin of the image.
All coordinates refer to this origin. Possible values: - top_left - top_right - bottom_left (default) - bottom_right This attribute is optional
Trim	Optional.
If this attribute is defined, the corresponding part of the image that is significantly darker than the rest of the image is cut off. Possible values are a combination of the following values, separated by commas: - Top - Bottom - Left - Right
FirstPage	This attribute is only used if several documents are processed at the same time, e.g. in digitization. This ParserRule is applied from this page onwards
If the page to be processed is smaller, the ParserRule is ignored
LastPage	This attribute is only used if several documents are being processed at the same time, e.g. in digitization. This ParserRule is applied up to this page.

If the page to be processed is larger, the ParserRule is ignored

Element

Describes the element to be parsed

Attribute	Description
Name	Name of the element, must be present! The CxLetterScan requires certain elements depending on the mode (use case)

See online help for the corresponding mode.
Type	Type of element; optional, default value: Text. The following types are possible: - Text: The section of the element is analyzed with OCR and the recognized text is output [1]. - Barcode: The section of the element is analyzed for a barcode [1]. - UPOC: The section of the element is analyzed for a barcode and checked to see whether it is a valid UPOC (type, client, ID, checksum) [1]. - PostalAddress: The section of the element is analyzed with OCR and the address tokenizer and the recognized address is output [2].
See also Postal addresses. - Subject: The subject is searched for in the defined section of the element (most prominent line of text), seeSubject

OverwriteMode	Optional, defines whether/how existing data of a consignment is overwritten
Default: NotEmpty The rule is only applied if the ImageParser reads a valid value for the consignment attribute. If no value or an empty value is read, the original value of the consignment is never overwritten.

can be specified optionally, default =ally, a unit can be specified, default ==...that...

Possible values: - NotEmpty: Only overwrite the existing send attribute if it is empty - Never: Never overwrite the existing send attribute - Always: Always overwrite the existing send attribute
x	X coordinate of the top left corner of the element. The coordinate refers to the origin of the image. Optionally, a unit can be specified, default = px (pixels). The following units are possible: - px: pixels (default) - %:
Percent, in relation to the entire image - mm: Millimenter (only if resolution is known)
y	Y coordinate of the top left corner of the element. The coordinate refers to the origin of the image. A unit
px (pixels). The following units are possible: - px: Pixel (default) - %:
Percent, related to the whole image - mm: Millimenter (only if resolution is known).
h	Height of the element. Option
px (pixels). The following units are possible: - px: Pixel (default) - %:
Percent, related to the whole image - mm: Millimenter (only if resolution is known).
w	Width of the element. Optionally, a unit can be specified, default
px (pixels). The following units are possible: - px: Pixel (default) - %:
Percent, related to the entire image - mm: Millimenter (only if resolution is known)
BarcodeType	Type of barcode. Only relevant for the 'Barcode' type. Several barcode types can be specified. These are separated by commas. The following barcode types are possible: - All (or no specification): All of the barcode types listed below are searched for. - All1D: All 1D barcode types are searched for - All2D: All 2D barcode types are searched for - AustralianPostCode - Aztec - Circular2of5 - Codabar - CodablockF - Code128 - Code16K - Code39 - Code39Extended - Code39Mod43 - Code39Mod43Extended - Code93 - DataMatrix - EAN13 - EAN2 - EAN5 - EAN8 - GS1 - GS1DataBarExpanded - GS1DataBarExpandedStacked - GS1DataBarLimited - GS1DataBarStacked - GS1DataBarOmnidirectional - GTIN12 (UPC-A with 12 symbols) - GTIN13 (EAN-13) - GTIN14 (I2of5 with 14 digits) - GTIN8 (EAN-8) - IntelligentMail - Interleaved2of5 - ITF14 (I2of5 with 14 digits) - MaxiCode - MICR - MicroPDF - MSI - PatchCode - PDF417 - Pharmacode - PostNet - PZN - QRCode - RoyalMail - RoyalMailKIX - TriopticCode39 - UPCA - UPCE - UPU This attribute is optional, the default value is 'All'.
Rotate	Optional, default = 0 (0°). Angle in degrees [°] by which the element must be rotated so that it can be read horizontally from left to right . The angles are entered separated by commas. The angles increase clockwise from 0 to 360°	Negative entries are permitted; these are automatically converted to the corresponding positive angle, e.g. -20° = + 340°. The element is rotated after the rotation of the full image (see Rotate attribute of the ParserRule tag) and after the element is cut according to the y,x,h, w attributes. ATTENTION: Each additional angle multiplies the processing time! To avoid incorrect readings of barcode and especially UPOC elements, it is important that the possible angles are defined correctly. The internally used barcode OCR engine supports the following discrete rotation angles: 0°, 11°, 22°, 45°, 90°, 135°, 158°, 169°, 180°, 191°, 202°, 225°, 270°, 315°, 338°, 349°. The specified angles are rounded up/down to the nearest of these angles. Additional angle specifications are mandatory if the twist is greater than the arc tangent (^tan-1) of the ratio of bar height to barcode length. Example: Bar height = 10 mm, total length = 50 mm. Thus: Arctan(10/50)=11.3°.
So: If the barcode can be rotated by more than 11°, the value 11 must be added as the rotation angle.
PreProcess	Optional, pre-processing of the image for barcode OCR reading, is only available for the barcode type! Default value: <blank> (no preprocessing) The following preprocessing is possible: - BRectS: Cuts out small used areas and individual processing. Improves the reading of barcodes and DataMatrix. Long processing time. - BRectM: Cutting out of medium used areas and individual processing. Improves the reading of barcodes and DataMatrix. Medium processing time. - BRectL: Cutting out of large used areas and individual processing. Improves the reading of barcodes and DataMatrix.
ValidValue	RegEx expression, which defines a valid value of the element
The element only has a valid value if it has been checked according to the RegEx expression. Otherwise, the value of the element is empty. Not to be confused with 'RegEx' from the criterion. If the 'ValidValue' attribute is not specified or is empty, the following default values apply: - Text: "[0-9,a-z,A-Z]+" - Barcode: "[0-9,a-z,A-Z]+" - UPOC: <empty>, the value is checked by verifying the UPOC syntax.
If a RegEx is specified, it is also evaluated according to the UPOC syntax.
InvalidValue	RegEx expression that defines all invalid values of an element
The element must not contain any of these invalid values, in which case the element is valid.
If this element is recorded, it has priority over the "ValidValue" element. If the 'InvalidValue' attribute is not specified or is empty, the default value remains empty and the rule for 'ValidValue' is active.
OCRMinConfidence	Optional, default = 0, only available for the Text and PostalAddress types. The attribute defines the minimum quality (confidence)
the complete text recognised by the OCR recognition must have for further processing
The global setting xxx is always used for each individual line! Range: 0% ...
100% Well-recognized texts have a confidence >= 50%, bad ones < 30%.
Prerequisite	This attribute is only used for digitization. It defines whether the value of this element must be present in order to complete the digitization.
If the element is not found, manual post-processing or capture is mandatory. The following values are possible: - Optional - Mandatory - None
SBB-CF	Defines the name of the SBB-Custfield (extended consignment attributes) in which the determined value of the element is to be saved
ATTENTION: Only SBB-Custfields of type *text* are supported! - Enter the name and type (fixed *text*) of the SBB-Custfield - If the SBB-Custfield is to be displayed in the UI of the entry modules, it must be configured accordingly in theservices.
Ref	Optional, defines a relative element, see Relative elements. Defines the name of the parent element.
RefOrigin	Optional, defines the reference point on the parent element. Must be defined for relative element

Possible values: - top_left - top_right - bottom_left (default) - bottom_right
dx	Optional, defines the position of the relative element based on RefOrigin relative to the parent the reference point on the parent element.

Optional.

Style

, defines the style of the element.
Several values are possible, separated by a comma (",")

Valid values:
- Normal: Text is not bold
- Bold: Text is bold
- Larger: Text is larger than the referenced element
- Smaller: Text is smaller than the referenced element
- Size: XX: Text has exactly this size (in points)

Example: Style="Size:10,Bold"
Restrictions:

Not permitted for elements of type Barcode and UPOC Values Normal and Bold are not both permitted (mutual exclusion) Values Larger and Smaller are not both permitted (mutual exclusion) Values Larger or Smaller are only permitted on relative elements Values Larger or Smaller are permitted if the parent is not Barcode or UPOC
FontsizeDetMode	Optional; defines how the Fontsize property is calculated based on the font size of the individual words
Is only used for elements of type Subject and Text (if contained in Style Larger or Smaller or Size: XX ). Default: MajorityChars Possible values: - MajorityChars:
Fontsize of the majority of characters - MajorityWords: Fontsize of the majority of words - LargestWord: Fontsize of the largest word - SmallestWord: Fontsize of the smallest word
BoldDetMode	Optional; defines how the Bold property is determined based on the individual words. Is only used for elements of type Subject and Text (if contained in
Style
Normal or Bold )

Default: MajorityWords
Possible values:
- MajorityChars: Line is Bold if the majority of characters are Bold
- MajorityWords: Line is Bold if the majority of words are Bold
- AtLeastOneWord: Line is Bold if at least one word is Bold
- AllWords: Line is Bold if all words are Bold

[*1]: If Type = Text, Barcode, UPOC all attributes x,y,h,w must be defined!
[*2]:

If Type =

PostalAddress

all or no attributes

,y,h,w

must be defined!

Criteria

Describes the criteria that must be met for this ParserRule

to take effect.

If the criteria are not met, the next ParserRule is processed.
This must not be confused with 'ValidValue' of the element!

Spec

Attribute	Description
Name	Name of the criterion
Operation

ifies'

how the criterion is logically linked. Possible values: 'AND', 'OR
ElementName	Name of the element that is to be checked.

RegEx

RegEx expression that must be fulfilled.
Elements of type PostalAddress

NOT support criteria, the validity is determined on the basis of internal rules!

Postal addresses

Elements of type PostalAddress are processed as follows:
The attributes x, y, h, w are optional. If these are not specified (or are all 0), the entire image is scanned for text blocks.
If the attributes x, y, h, w are defined, only the defined section is used (analogous to the element type "Text").
The text blocks found are sorted according to certain criteria and analyzed with OCR and the SortTree.

The first/best recognized address is output.
If NO value is defined as ValidValue , all detected address blocks are returned as a string, there is NO analysis with the SortTree!

Specific attributes for element type "PostalAddress"

The following attributes are specific to elements of type "PostalAddress"

ATTENTION: The attributes CutLength/CutWidth and ExclusionZone* are mutually exclusive, only one of the two variants can be used!
For new definitions, the exclusion zone should be defined with ExclusionZone*.

.onsides of the image....

Attribute	Description
ValidValue	The following pseudo-regex (default: <empty>) are supported: - <empty> (default) - [AdrLevel_Country] - [AdrLevel_City] - [AdrLevel_Street] - [AdrLevel_House] - [AdrLevel_Name]
CutLength	Optional, default = 0 (no exclusion zone). Defines an exclusion zone in [mm] for automatically detected text blocks (x,y,h,w = 0) on both long sides of the image.
Detected text blocks that overlap this zone are ignored
CutWidth	Optional, default = 0 (no exclusion zone). Defines an exclusion zone in [mm] for automatically detected text blocks (x,y,h,w = 0)
both short
Detected text blocks that overlap this zone are ignored.
ValidateAddress	Optional, value range: 0/1, default: 0 Defines how the addresses found are checked according to *ValidValue. 0: The check is carried out by splitting via Tokenizer and testing for non-empty* (up to the level defined by *ValidValue* ) 1: The check is carried out against the area data of the district administration
The address must be valid up to at least the level specified by
*ValidValue*
ExclusionZoneTop	Optional, if not
available the default setting is used
. Defines an exclusion zone at the top [mm]. Text blocks that overlap this zone are ignored.
ExclusionZoneBottom	Optional, if not available the default setting is used

an exclusion zone at themargin [mm]Text blocks that overlap this zone are ignored.Optional, if not available, the default setting is used

Defines
bottom
.
ExclusionZoneLeft
. Defines an exclusion zone at the left margin [mm]. Text blocks that overlap this zone are ignored.
ExclusionZoneRight	Optional, if not available, the default setting is used.

mmText blocks that overlap this zone are ignored.), the internally defined. .

Defines an "exclusion zone" at the right margin [
].
DisableCodingZone	Optional, value range: 0/1 (false/true), default: 0 (false) If 1 (true
fixed
coding zones are not applied.
DisableFrankingZone	Optional, value range: 0/1 (false/true), default: 0 (false) If 1 (true), the internally defined fixed franking zones are not applied
RefPointMode	Optional, default = 0 (no exclusion zone). The following pseudo regex (default: <empty>) are supported: - AutoBasedFullImage - AutoBasedExclZone - ValueBasedExclZone If AutoBasedExclZone or ValueBasedExclZone is defined, the CutLength and CutWidthattributes mustNOT be defined	The exclusion zones must be defined by ExclusionZone. If ValueBasedExclZone* is defined:

both attributes RefPointX and RefPointY must be defined - Attributes CutLength and CutWidth must NOT be defined - DisableCodingZone and DisableFrankingZone must be True
RefPointX	Reference point X-Koodinate in [mm],

[%] or pixels (from the left)
Is only used if RefPointMode = ValueBasedExclZone

RefPointY

Reference point

Y-Koodinate in [mm], [%] or pixels (from above)
Is only used if RefPointMode = ValueBasedExclZone

Relative elements

TODO

Regex

Regex expressions must be entered according to DEELX Regular Expression Syntax

ATTENTION: To embed regex expressions in XML attributes, these may have to be escaped!

Use

an appropriate online tool for this

, e.g.

Code Beauty.

Named capture groups

The image parser

supports

named capture groups

The group name must be the same as the name of the element.
Example: (?<subject>.*)
Complete element including escaping: <Element Name="Subject" Type="Text" ValidValue="(?<Subject>.*)">

Single-/Multiline, Mode Modifier

Example: To match only the first line in a multi-line text: (?

Subject

The subject is searched for in the defined section of the element (most prominent line of text)

A score is calculated for each line of text read; the line with the highest score contains the subject

Formula for calculating the line score:
LineScore = (BlockScore * OwnBlockFactor) + (ParagraphScore * OwnParaFactor) + (BoldScore * BoldFactor) + (FontsizeScore * FontsizeFactor) + (PositionScore * PositionFactor) + (KeywordScore * KeywordFactor)

The following attributes are specific to elements of type Subject.

Optional;defines the keywordfor calculating the line score.uteOptional

Attribute	Description	Default value
LinePosDetMode	Optional; defines which reference point is used to calculate the PositionScore (relative to the center of the cropped image). Possible values: - LineCenter:
Center/center of gravity of the line - ShortestCorner: Shortest distance of a corner point of the line	LineCenter
Keywords	Optional; defines keywords to identify the subject. Enter a list of keywords, separated by commas.
<empty>
KeywordFactor
factor
1.0
OwnBlockFactor	Optional; defines the OwnBlock factor for calculating the line score.	0.0
OwnParaFactor	Optional; defines the OwnParagraph factor for calculating the line score.	0.05
BoldFactor	Optional; defines the bold factor for calculating the line score. See also
element attrib
*BoldDetMode*	0.40
FontsizeFactor
; defines the font size factor for calculating the line score. See also element attribute *FontsizeDetMode*	0.40
PositionFactor	Optional; defines the position factor

for calculating the line score.

0.30
MinLength	Optional; defines the minimum length of a line.	4
MaxLength	Optional; defines the maximum length of a line.	0 (no maximum length)

Functionality

The functionality and further information can be found in the AdminDoc (search for 'ImageParser')