ImageParser Syntax
The ImageParser is a module of CodX PostOffice to control the processing of images.
An XML configuration file specifies how the image should be processed.
The XML file consists of rules (ParserRule) with picture elements (Element) and evaluation criteria (Criteria).
Example ImageParser configuration file
<ImageParser Name="Parser
1" Timeout="1000"
Reference="SN-AZD, FE"
Remark="This is a remark">
<ParserRule Name="Role
1" Rotate="0, 30, 60, 90" Origin="top_left" Trim= "Bottom, Right"
FirstPage="1" LastPage="3">
<Element Name="Element
1" x="100 px" y="200 px" h="50 mm" w="40 mm" Type="Text" Rotate="0, 90, 180, 270" ValidValue="[a-z]+" Prerequisite="Mandatory" SBB-CF="Alternative code"></element>
<Element Name="Element
2" x="200 px" y="400 px" h="50 mm" w="40 mm" Type="Barcode" BarcodeType="All" Rotate="0" PreProcess="BRectS" InvalidValue="(?<Invalid>[^[0-9a-zA-Z()-/\s\öäüéàèÖÄÜ])"></Element>
<element name="element
3" x="300 px" y="500 px" h="50 mm" w="40 mm" Type="UPOC" Rotate="0" ValidValue="[0-9]+"></Element>
<Element Name="PostAdr1" Type="PostalAddress"
ValidValue="[AdrLevel_City]"></element>
<Criteria Name="Criteria
1" Operation="AND" ElementName="Element 1" RegEx="[a-z]+"></Criteria>
<Criteria Name="Criteria
2" Operation="AND" ElementName="Element 2" RegEx="[0-9]+"></Criteria>
<Criteria Name="Criteria
3" Operation="OR" ElementName=" Element3" RegEx="04[0-9]+"></Criteria>
<Criteria Name="Criteria
3" Operation="AND" ElementName="PostAdr1" RegEx="[AdrLevel_City]"></Criteria>
</ParserRule>
<ParserRule Name="Role
2" Rotate="0, -30, -60, -90" Origin="top_left" Trim="Top" >
<Element Name="Element
1" x="100 px" y="200 px" h="50 mm" w="40 mm" Type="Text" Rotate="0, 90, 180, 270" ValidValue="[a-z]+"></element>
<Element Name="Element
2" x="200 px" y="400 px" h="50 mm" w="40 mm" Type="Barcode" Rotate="0" ValidValue="[a-z]+"></Element>
<Element Name="Element
3" x="300 px" y="500 px" h="50 mm" w="40 mm" Type="UPOC" Rotate="0" ValidValue="[0-9]+"></element>
<Criteria Name="Criteria
1" Operation="AND" ElementName="Element 1" RegEx="[a-z]+"></Criteria>
<Criteria Name="Criteria
2" Operation="AND" ElementName="Element 2" RegEx="[0-9]+"></Criteria>
<Criteria Name="Criteria
3" Operation="OR" ElementName="Element 3" RegEx="04[0-9]+"></Criteria>
</ParserRule>
</ImageParser>tags .
The XML configuration file is structured as follows:
ImageParser
Contains 1 or more tags of 'ParserRule'.
Attribute |
Description |
Name |
Name of the ImageParser |
Timeout |
Maximum processing time in milliseconds. When the timeout is reached, processing is aborted and no result is returned.This attribute is optional, the default value is 1000 ms and the maximum allowed timeout is 99999 ms. If the timeout is 0 or less, the timeout is set to the default value, if it is greater than the maximum allowed value, it is reset to the maximum value. |
Reference |
Reference to the module/function that uses this ImageParser. If an ImageParser is used in multiple modules, the module names are listed separated by commas.
The following values are possible:
-
R-SCAN
CxLetterScan in 'R-Scan' mode
CxLetterScan in 'Maintenance' mode
Module R-Scan
-
SCANNER
CxLetterScan in 'Scanner' mode
-
CAPTURE
CxLetterScan in 'Capture' mode
-
SORT
CxLetterScan in 'Sort' mode
-
DIGITAL
Digitizing
|
Remark |
Remark about the ImageParser.
This attribute is optional
|
ParserRule
Contains 1 or more tags of 'Element' and 'Criteria'.
Attribute |
Description |
Name |
Name of the ParserRule |
Rotate |
List of angles with which the value is to be read. The angles are separated by commas.
The angles increase clockwise from 0 to 360°.
ATTENTION: Each additional angle multiplies the processing time additionally! |
Origin |
Origin of the image. All coordinates refer to this origin.
Possible values:
- top_left
- top_right
- bottom_left (default)
- bottom_right
This attribute is optional. |
Trim |
Optional. If this attribute is defined, the corresponding part of the image is cut off, which is significantly darker than the rest of the image.
Possible values are a combination of the following values, separated by commas:
- Top
- Bottom
- Left
- Right |
FirstPage |
This attribute is only used when several
documents are processed at the same time, for example, in the
digitization.
From this page on, this ParserRule is applied. If the
page to be processed is smaller, the ParserRule is
is ignored. |
LastPage |
This attribute is only used if several
documents are processed at the same time, e.g. in digitizing.
digitization.
Up to this page this ParserRule is applied. If
the page to be processed is larger, the ParserRule is
is ignored. |
Element
Describes the element to be parsed.
Attribute |
Description |
Name |
Name of the element, must be present!
The CxLetterScan requires certain elements depending on the mode (use case). See the online help for the corresponding mode. |
Type |
Type of element; optional, default value: Text.
The following types are possible:
- Text: The section of the element is analyzed with OCR and the recognized text is output.
- Barcode: The section of the element is analyzed for a barcode.
- UPOC: The section of the element is checked for a barcode and whether it is a valid UPOC (type, client, ID, checksum).
- PostalAddress: The section of the element is analyzed with OCR and the address tokenizer and the recognized address
is output. See also Postal Addresses. |
x |
X-coordinate of the upper left corner of the element.
The coordinate refers to the origin of the image.
Optionally, a unit can be specified, default = px (pixel).
The following units are possible:
- px: Pixel (default)
- %: Percent, related to the whole image
- mm: Millimenter (only if resolution is known) |
y |
Y-coordinate of the upper left corner of the element.
The coordinate refers to the origin of the image.
Optionally, a unit can be specified, default = px (pixel).
The following units are possible:
- px: Pixel (default)
- %: Percent, related to the whole image
- mm: Millimenter (only if resolution is known). |
h |
Height of the element.
Optionally, a unit can be specified, default = px (pixel).
The following units are possible:
- px: Pixel (default)
- %: Percent, related to the whole image
- mm: Millimenter (only if resolution is known).
|
w |
Width of the element.
Optionally, a unit can be specified, default = px (pixel).
The following units are possible:
- px: Pixel (default)
- %: Percent, related to the whole image
- mm: Millimenter (only if resolution is known).
|
BarcodeType |
Barcode type. Only relevant for type'Barcode'. Several barcode types can be specified. These are specified separated by commas.
The following barcode types are possible:
- All (or no specification): All barcode types listed below are searched for.
- All1D: All 1D barcode types are searched for.
- All2D: Searches for all 2D barcode types.
- AustralianPostCode
- Aztec
- Circular2of5
- Codabar
- CodablockF
- Code128
- Code16K
- Code39
- Code39Extended
- Code39Mod43
- Code39Mod43Extended
- Code93
- DataMatrix
- EAN13
- EAN2
- EAN5
- EAN8
- GS1
- GS1DataBarExpanded
- GS1DataBarExpandedStacked
- GS1DataBarLimited
- GS1DataBarStacked
- GS1DataBarOmnidirectional
- GTIN12 (UPC-A with 12 symbols)
- GTIN13 (EAN-13)
- GTIN14 (I2of5 with 14 digits)
- GTIN8 (EAN-8)
- IntelligentMail
- Interleaved2of5
- ITF14 (I2of5 with 14 digits)
- MaxiCode
- MICR
- MicroPDF
- MSI
- PatchCode
- PDF417
- Pharmacode
- PostNet
- PZN
- QRCode
- RoyalMail
- RoyalMailKIX
- TriopticCode39
- UPCA
- UPCE
- UPU
This attribute is optional, default value is 'All'. |
Rotate |
Optional, default = 0 (0°).
Angle in degrees [°] by which the element must be rotated to be readable horizontally from left to right .
The angles are recorded separated by commas.
The angles increase clockwise from 0 to 360°.
Negative entries are allowed, these are automatically converted to the corresponding positive angle, e.g. -20° = + 340°.
The rotation of the element is done after the rotation of the full image (see attribute Rotate of Tag ParserRule) and after cutting the element according to the attributes
y,x,h, w.
ATTENTION: Each additional angle multiplies the processing time additionally!
To avoid incorrect readings of barcode and especially UPOC elements, it is important that the possible angles are defined correctly.
defined.
The internally used barcode OCR engine supports the following discrete rotation angles: 0°, 11°, 22°, 45°, 90°, 135°, 158°, 169°, 180°, 191°, 202°, 225°, 270°, 315°, 338°, 349°. The
specified angles are rounded up/down to the corresponding nearest of these angles.
Additional angle specifications are mandatory if the twist is greater than the arc tangent (tan-1) of the ratio of bar height to bar code length.
Ex: bar height = 10 mm, total length = 50 mm. Thus: Arctan(10/50)=11.3°. So: If the barcode can be rotated by more than 11° the value 11 must be added as rotation angle.
|
PreProcess |
Optional, preprocessing of the image for barcode OCR reading, is only available for the type barcode! Default value: <blank> (no preprocessing)
The following preprocessings are possible:
- BRectS: Cut out small used areas and single processing. Improves the reading of barcodes and DataMatrix. Long processing time.
- BRectM: Cutting of medium used areas and single processing. Improves the reading of barcodes and DataMatrix. Medium processing time.
- BRectL: Cutting of large used areas and single processing. Improves the reading of barcodes and DataMatrix. Short processing time. |
ValidValue |
RegEx expression defining a valid value of the element. The element has a valid value only if it was checked according to the RegEx expression. Otherwise the value of the element is empty.
Not to be confused with 'RegEx' from the criterion.
If the attribute 'ValidValue' is not specified or empty, the following default values apply:
- Text: "[0-9,a-z,A-Z]+"
- Barcode: "[0-9,a-z,A-Z]+"
- UPOC: <empty>, the value is checked by verifying the UPOC syntax.
If a RegEx is specified, it is additionally evaluated according to the UPOC syntax. |
InvalidValue |
RegEx expression which defines all invalid values of an element. The element must not contain any of these invalid values, in which case the element is valid. If this element is captured, it has priority over the "ValidValue" element.
If the attribute 'InvalidValue' is not specified or empty, the default value remains empty and the rule for 'ValidValue' is active. |
OCRMinConfidence |
Optional, default = 0, only available for the types Text and PostalAddress.
The attribute defines the minimum quality(confidence) that the text recognized by OCR must have for further processing.
must have for further processing.
Range: 0% ... 100%
Well recognized texts have a confidence >= 50%, bad < 30%.
|
Prerequisite |
This attribute is only used for digitization.
application. It defines whether the value of this element
element must be present in order to complete digitization.
to complete the digitization. If the element is not found, manual
manual postprocessing or capture is mandatory.
must be carried out.
The following values are possible:
- Optional
- Mandatory
- None
|
SBB-CF |
This attribute is only used for digitization.
application. It defines, in which consignment-custfield the
value of the current element is to be stored.
should be stored. |
Criteria
Describes the criteria that must be met for this ParserRule to take effect. If the criteria are not met, the next ParserRule is processed.
This must not be confused with 'ValidValue' of the element!
Attribute |
Description |
Name |
Name of the criterion |
Operation |
Specifies how the criterion is logically linked.
Possible values: 'AND', 'OR'. |
ElementName |
Name of the element to be checked. |
RegEx |
RegEx expression which must be fulfilled.
Elements of type PostalAddress do NOT support criterias, the validity is determined by internal rules!
|
Postal addresses
Elements of the type PostalAddress are processed as follows:
The attributes x, y, h, w are optional. If they are not specified (or all are 0), the entire image is scanned for text blocks.
If the attributes x, y, h, w are defined then only the defined section is used (analogous to element type "Text").
The found text blocks are sorted according to certain criteria and analyzed with OCR and the SortTree.
The first/best recognized address is output.
If NO value is defined as ValidValue , all detected address blocks are returned as string, there is NO analysis with the SortTree!
Specific attributes for element type "PostalAddress
The following attributes can be defined specifically for elements of the type "PostalAddress":
Attribute |
Description |
ValidValue |
The following pseudo regex (default: <empty>) are supported:
- <empty> (default)
- [AdrLevel_Country]
- [AdrLevel_City]
- [AdrLevel_Street]
- [AdrLevel_House]
- [AdrLevel_Name] |
CutLength |
Optional, default = 0 (no exclusion zone).
Defines an "exclusion zone" in[mm] for automatically found text blocks (x,y,h,w = 0) on both long sides of the
image. Detected text blocks that overlap this zone are ignored. |
CutWidth |
Optional, default = 0 (no exclusion zone).
Defines an "exclusion zone" in[mm] for automatically found text blocks (x,y,h,w = 0) on both short sides of the
image. Detected text blocks overlapping this zone are ignored. |
ValidateAddress |
Optional, value range: 0/1, default: 0
Defines how the addresses found according to ValidValue are checked.
0: The check is performed by decomposition via tokenizer and test for non-empty (except for the address defined by ValidValue
defined level)
1: The check is made against the area data of the district administration. The address must be valid up to at least the ValidValue
defined level. |
Functionality
The ImageParser works as follows:
-
The image is captured with the corresponding device (OCR-Station,
document scanner, CxLetterScan, etc.).
-
The ImageParser loads the corresponding XML configuration file and
reads it out.
-
The image is parsed according to the first ParserRule. Thereby the
elements are cut out and processed accordingly (e.g. OCR, barcode reading, etc.),
barcode reading etc.).
-
The original image is freed from dark edges according to the ParserRule.Trim attribute.
-
With the trimmed image, any rotation is performed according to the attribute ParserRule.Trim
-
The ROI to be processed is calculated based on the attributes ParserRule.Origin and Element.x,y,w,h
-
The ROI is rotated and cropped according to the Element.Rot ate attribute.
-
The image section is processed with OCR/BCR.
-
The criteria are evaluated on the basis of the read data.
If individual elements contain rotations, they are processed one after the other until the
processed until the criterion is fulfilled or until no further rotations are
are present.
-
Once all criteria have been evaluated, a check is made as to whether all criteria are valid according to the
operations are valid. If this is the case, processing is
is aborted and the read values are returned. If the ParserRule contains
contains rotations, the above steps are performed for each rotation,
until all criteria are met or there are no more rotations.
-
If no variant fulfills the defined criteria, the image is parsed
is processed analogously with the second ParserRule.
-
Practically any number of ParserRules can be defined. These
are processed according to the order of the XML configuration file until all criteria of one
criteria of a ParserRule are fulfilled.
-
If no ParserRule fulfills the defined criteria or the timeout is reached, the processing is aborted and empty values are returned.
processing is aborted and empty values are returned. The following
processing step defines the further procedure in the process.
Sequence of element processing
The order in which elements are processed is as follows:
-
All elements that are part of criteria (from all ParserRules, in the order of the ParserRules in the XML).
-
All text elements from all ParserRules, in the order of the ParserRules in the XML.
-
Remaining elements of all ParserRules in the order of the ParserRules in the XML
Tips for optimal performance
-
Keep the number of variants to be processed as low as possible!
-
Define only the absolutely necessary rotations per parser rule (attribute ParserRule.Rotate).
-
Record the order of rotations in ascending order of probability (attribute ParserRule.Rotate)
-
If not all elements of a parser rule need a rotation then use rotation on element (attribute Element.Rotate) instead of rotation on parser rule.
-
Define only the really needed criterias
-
Text elements: define position/area of element not larger than necessary
-
Barcode/UPOC elements: define position/area of element not larger than necessary, specify attributes BarcodeType and
PreProcess attributes
-
Postal addresses (Element Type = "PostalAddress"): Use attributes "CutLength" and "CutWidth", as high a value as possible,
this avoids unnecessary text blocks
See also:
|