ImageParser Syntax

The ImageParser is a module of CodX PostOffice to control the processing of images. An XML configuration file specifies how the image should be processed.

The XML file consists of rules (ParserRule) with picture elements (Element) and evaluation criteria (Criteria).

Example ImageParser configuration file

<ImageParser Name="Parser 1" Timeout="1000" Reference="SN-AZD, FE" Remark="This is a remark">
<ParserRule Name="Role 1" Rotate="0, 30, 60, 90" Origin="top_left" Trim= "Bottom, Right" FirstPage="1" LastPage="3">
<Element Name="Element 1" x="100 px" y="200 px" h="50 mm" w="40 mm" Type="Text" Rotate="0, 90, 180, 270" ValidValue="[a-z]+" Prerequisite="Mandatory" SBB-CF="Alternative code"></element>
<Element Name="Element 2" x="200 px" y="400 px" h="50 mm" w="40 mm" Type="Barcode" BarcodeType="All" Rotate="0" PreProcess="BRectS" InvalidValue="(?<Invalid>[^[0-9a-zA-Z()-/\s\öäüéàèÖÄÜ])"></Element>
<element name="element 3" x="300 px" y="500 px" h="50 mm" w="40 mm" Type="UPOC" Rotate="0" ValidValue="[0-9]+"></Element>
<Element Name="PostAdr1" Type="PostalAddress" ValidValue="[AdrLevel_City]"></element>
<Criteria Name="Criteria 1" Operation="AND" ElementName="Element 1" RegEx="[a-z]+"></Criteria>
<Criteria Name="Criteria 2" Operation="AND" ElementName="Element 2" RegEx="[0-9]+"></Criteria>
<Criteria Name="Criteria 3" Operation="OR" ElementName=" Element3" RegEx="04[0-9]+"></Criteria>
<Criteria Name="Criteria 3" Operation="AND" ElementName="PostAdr1" RegEx="[AdrLevel_City]"></Criteria>
</ParserRule>
<ParserRule Name="Role 2" Rotate="0, -30, -60, -90" Origin="top_left" Trim="Top" >
<Element Name="Element 1" x="100 px" y="200 px" h="50 mm" w="40 mm" Type="Text" Rotate="0, 90, 180, 270" ValidValue="[a-z]+"></element>
<Element Name="Element 2" x="200 px" y="400 px" h="50 mm" w="40 mm" Type="Barcode" Rotate="0" ValidValue="[a-z]+"></Element>
<Element Name="Element 3" x="300 px" y="500 px" h="50 mm" w="40 mm" Type="UPOC" Rotate="0" ValidValue="[0-9]+"></element>
<Criteria Name="Criteria 1" Operation="AND" ElementName="Element 1" RegEx="[a-z]+"></Criteria>
<Criteria Name="Criteria 2" Operation="AND" ElementName="Element 2" RegEx="[0-9]+"></Criteria>
<Criteria Name="Criteria 3" Operation="OR" ElementName="Element 3" RegEx="04[0-9]+"></Criteria>
</ParserRule>
</ImageParser>

tags

The XML configuration file is structured as follows:

ImageParser

Contains 1 or more tags of 'ParserRule'.

Attribute	Description
Name	Name of the ImageParser
Timeout	Maximum processing time in milliseconds. When the timeout is reached, processing is aborted and no result is returned.This attribute is optional, the default value is 1000 ms and the maximum allowed timeout is 99999 ms. If the timeout is 0 or less, the timeout is set to the default value, if it is greater than the maximum allowed value, it is reset to the maximum value.
Reference	Reference to the module/function that uses this ImageParser. If an ImageParser is used in multiple modules, the module names are listed separated by commas. The following values are possible: R-SCAN CxLetterScan in 'R-Scan' mode CxLetterScan in 'Maintenance' mode Module R-Scan SCANNER CxLetterScan in 'Scanner' mode CAPTURE CxLetterScan in 'Capture' mode SORT CxLetterScan in 'Sort' mode DIGITAL Digitizing
Remark	Remark about the ImageParser. This attribute is optional

ParserRule

Contains 1 or more tags of 'Element' and 'Criteria'.

Attribute	Description
Name	Name of the ParserRule
Rotate	List of angles with which the value is to be read. The angles are separated by commas. The angles increase clockwise from 0 to 360°. ATTENTION: Each additional angle multiplies the processing time additionally!
Origin	Origin of the image. All coordinates refer to this origin. Possible values: - top_left - top_right - bottom_left (default) - bottom_right This attribute is optional.
Trim	Optional. If this attribute is defined, the corresponding part of the image is cut off, which is significantly darker than the rest of the image. Possible values are a combination of the following values, separated by commas: - Top - Bottom - Left - Right
FirstPage	This attribute is only used when several documents are processed at the same time, for example, in the digitization. From this page on, this ParserRule is applied. If the page to be processed is smaller, the ParserRule is is ignored.
LastPage	This attribute is only used if several documents are processed at the same time, e.g. in digitizing. digitization. Up to this page this ParserRule is applied. If the page to be processed is larger, the ParserRule is is ignored.

Element

Describes the element to be parsed.

Attribute	Description
Name	Name of the element, must be present! The CxLetterScan requires certain elements depending on the mode (use case). See the online help for the corresponding mode.
Type	Type of element; optional, default value: Text. The following types are possible: - Text: The section of the element is analyzed with OCR and the recognized text is output. - Barcode: The section of the element is analyzed for a barcode. - UPOC: The section of the element is checked for a barcode and whether it is a valid UPOC (type, client, ID, checksum). - PostalAddress: The section of the element is analyzed with OCR and the address tokenizer and the recognized address is output. See also Postal Addresses.
x	X-coordinate of the upper left corner of the element. The coordinate refers to the origin of the image. Optionally, a unit can be specified, default = px (pixel). The following units are possible: - px: Pixel (default) - %: Percent, related to the whole image - mm: Millimenter (only if resolution is known)
y	Y-coordinate of the upper left corner of the element. The coordinate refers to the origin of the image. Optionally, a unit can be specified, default = px (pixel). The following units are possible: - px: Pixel (default) - %: Percent, related to the whole image - mm: Millimenter (only if resolution is known).
h	Height of the element. Optionally, a unit can be specified, default = px (pixel). The following units are possible: - px: Pixel (default) - %: Percent, related to the whole image - mm: Millimenter (only if resolution is known).
w	Width of the element. Optionally, a unit can be specified, default = px (pixel). The following units are possible: - px: Pixel (default) - %: Percent, related to the whole image - mm: Millimenter (only if resolution is known).
BarcodeType	Barcode type. Only relevant for type'Barcode'. Several barcode types can be specified. These are specified separated by commas. The following barcode types are possible: - All (or no specification): All barcode types listed below are searched for. - All1D: All 1D barcode types are searched for. - All2D: Searches for all 2D barcode types. - AustralianPostCode - Aztec - Circular2of5 - Codabar - CodablockF - Code128 - Code16K - Code39 - Code39Extended - Code39Mod43 - Code39Mod43Extended - Code93 - DataMatrix - EAN13 - EAN2 - EAN5 - EAN8 - GS1 - GS1DataBarExpanded - GS1DataBarExpandedStacked - GS1DataBarLimited - GS1DataBarStacked - GS1DataBarOmnidirectional - GTIN12 (UPC-A with 12 symbols) - GTIN13 (EAN-13) - GTIN14 (I2of5 with 14 digits) - GTIN8 (EAN-8) - IntelligentMail - Interleaved2of5 - ITF14 (I2of5 with 14 digits) - MaxiCode - MICR - MicroPDF - MSI - PatchCode - PDF417 - Pharmacode - PostNet - PZN - QRCode - RoyalMail - RoyalMailKIX - TriopticCode39 - UPCA - UPCE - UPU This attribute is optional, default value is 'All'.
Rotate	Optional, default = 0 (0°). Angle in degrees [°] by which the element must be rotated to be readable horizontally from left to right . The angles are recorded separated by commas. The angles increase clockwise from 0 to 360°. Negative entries are allowed, these are automatically converted to the corresponding positive angle, e.g. -20° = + 340°. The rotation of the element is done after the rotation of the full image (see attribute Rotate of Tag ParserRule) and after cutting the element according to the attributes y,x,h, w. ATTENTION: Each additional angle multiplies the processing time additionally! To avoid incorrect readings of barcode and especially UPOC elements, it is important that the possible angles are defined correctly. defined. The internally used barcode OCR engine supports the following discrete rotation angles: 0°, 11°, 22°, 45°, 90°, 135°, 158°, 169°, 180°, 191°, 202°, 225°, 270°, 315°, 338°, 349°. The specified angles are rounded up/down to the corresponding nearest of these angles. Additional angle specifications are mandatory if the twist is greater than the arc tangent (tan-1) of the ratio of bar height to bar code length. Ex: bar height = 10 mm, total length = 50 mm. Thus: Arctan(10/50)=11.3°. So: If the barcode can be rotated by more than 11° the value 11 must be added as rotation angle.
PreProcess	Optional, preprocessing of the image for barcode OCR reading, is only available for the type barcode! Default value: <blank> (no preprocessing) The following preprocessings are possible: - BRectS: Cut out small used areas and single processing. Improves the reading of barcodes and DataMatrix. Long processing time. - BRectM: Cutting of medium used areas and single processing. Improves the reading of barcodes and DataMatrix. Medium processing time. - BRectL: Cutting of large used areas and single processing. Improves the reading of barcodes and DataMatrix. Short processing time.
ValidValue	RegEx expression defining a valid value of the element. The element has a valid value only if it was checked according to the RegEx expression. Otherwise the value of the element is empty. Not to be confused with 'RegEx' from the criterion. If the attribute 'ValidValue' is not specified or empty, the following default values apply: - Text: "[0-9,a-z,A-Z]+" - Barcode: "[0-9,a-z,A-Z]+" - UPOC: <empty>, the value is checked by verifying the UPOC syntax. If a RegEx is specified, it is additionally evaluated according to the UPOC syntax.
InvalidValue	RegEx expression which defines all invalid values of an element. The element must not contain any of these invalid values, in which case the element is valid. If this element is captured, it has priority over the "ValidValue" element. If the attribute 'InvalidValue' is not specified or empty, the default value remains empty and the rule for 'ValidValue' is active.
OCRMinConfidence	Optional, default = 0, only available for the types Text and PostalAddress. The attribute defines the minimum quality(confidence) that the text recognized by OCR must have for further processing. must have for further processing. Range: 0% ... 100% Well recognized texts have a confidence >= 50%, bad < 30%.
Prerequisite	This attribute is only used for digitization. application. It defines whether the value of this element element must be present in order to complete digitization. to complete the digitization. If the element is not found, manual manual postprocessing or capture is mandatory. must be carried out. The following values are possible: - Optional - Mandatory - None
SBB-CF	This attribute is only used for digitization. application. It defines, in which consignment-custfield the value of the current element is to be stored. should be stored.

Criteria

Describes the criteria that must be met for this ParserRule to take effect. If the criteria are not met, the next ParserRule is processed.
This must not be confused with 'ValidValue' of the element!

Attribute	Description
Name	Name of the criterion
Operation	Specifies how the criterion is logically linked. Possible values: 'AND', 'OR'.
ElementName	Name of the element to be checked.
RegEx	RegEx expression which must be fulfilled. Elements of type PostalAddress do NOT support criterias, the validity is determined by internal rules!

Postal addresses

Elements of the type PostalAddress are processed as follows:
The attributes x, y, h, w are optional. If they are not specified (or all are 0), the entire image is scanned for text blocks.
If the attributes x, y, h, w are defined then only the defined section is used (analogous to element type "Text").
The found text blocks are sorted according to certain criteria and analyzed with OCR and the SortTree.
The first/best recognized address is output.
If NO value is defined as ValidValue

, all detected address blocks are returned as string, there is NO analysis with the SortTree!

Specific attributes for element type "PostalAddress

The following attributes can be defined specifically for elements of the type "PostalAddress":

Attribute	Description
ValidValue	The following pseudo regex (default: <empty>) are supported: - <empty> (default) - [AdrLevel_Country] - [AdrLevel_City] - [AdrLevel_Street] - [AdrLevel_House] - [AdrLevel_Name]
CutLength	Optional, default = 0 (no exclusion zone). Defines an "exclusion zone" in[mm] for automatically found text blocks (x,y,h,w = 0) on both long sides of the image. Detected text blocks that overlap this zone are ignored.
CutWidth	Optional, default = 0 (no exclusion zone). Defines an "exclusion zone" in[mm] for automatically found text blocks (x,y,h,w = 0) on both short sides of the image. Detected text blocks overlapping this zone are ignored.
ValidateAddress	Optional, value range: 0/1, default: 0 Defines how the addresses found according to *ValidValue* are checked. 0: The check is performed by decomposition via tokenizer and test for non-empty (except for the address defined by *ValidValue* defined level) 1: The check is made against the area data of the district administration. The address must be valid up to at least the *ValidValue* defined level.

Functionality

The ImageParser works as follows:

The image is captured with the corresponding device (OCR-Station, document scanner, CxLetterScan, etc.).
The ImageParser loads the corresponding XML configuration file and reads it out.
The image is parsed according to the first ParserRule. Thereby the elements are cut out and processed accordingly (e.g. OCR, barcode reading, etc.), barcode reading etc.).
- The original image is freed from dark edges according to the ParserRule.Trim attribute.
- With the trimmed image, any rotation is performed according to the attribute ParserRule.Trim
- The ROI to be processed is calculated based on the attributes ParserRule.Origin and Element.x,y,w,h
- The ROI is rotated and cropped according to the Element.Rot ate attribute.
- The image section is processed with OCR/BCR.
The criteria are evaluated on the basis of the read data. If individual elements contain rotations, they are processed one after the other until the processed until the criterion is fulfilled or until no further rotations are are present.
Once all criteria have been evaluated, a check is made as to whether all criteria are valid according to the operations are valid. If this is the case, processing is is aborted and the read values are returned. If the ParserRule contains contains rotations, the above steps are performed for each rotation, until all criteria are met or there are no more rotations.
If no variant fulfills the defined criteria, the image is parsed is processed analogously with the second ParserRule.
Practically any number of ParserRules can be defined. These are processed according to the order of the XML configuration file until all criteria of one criteria of a ParserRule are fulfilled.
If no ParserRule fulfills the defined criteria or the timeout is reached, the processing is aborted and empty values are returned. processing is aborted and empty values are returned. The following processing step defines the further procedure in the process.

Sequence of element processing

The order in which elements are processed is as follows:

All elements that are part of criteria (from all ParserRules, in the order of the ParserRules in the XML).
All text elements from all ParserRules, in the order of the ParserRules in the XML.
Remaining elements of all ParserRules in the order of the ParserRules in the XML

Tips for optimal performance

Keep the number of variants to be processed as low as possible!
Define only the absolutely necessary rotations per parser rule (attribute ParserRule.Rotate).
Record the order of rotations in ascending order of probability (attribute ParserRule.Rotate)
If not all elements of a parser rule need a rotation then use rotation on element (attribute Element.Rotate) instead of rotation on parser rule.
Define only the really needed criterias
Text elements: define position/area of element not larger than necessary
Barcode/UPOC elements: define position/area of element not larger than necessary, specify attributes BarcodeType and PreProcess attributes
Postal addresses (Element Type = "PostalAddress"): Use attributes "CutLength" and "CutWidth", as high a value as possible, this avoids unnecessary text blocks