Blocks
Classification
|
§ Back § |
This data set have been used to try different
simplification methods for decision trees.
The problem consists in classifying all the blocks
of the page layout of a document that has been
detected by a segmentation process. This is an
essential step in document analysis in order to
separate text from graphic areas. Indeed, the five
classes are: text (1), horizontal line (2), picture
(3), vertical line (4) and graphic (5).
For a detailed presentation of the problem see:
Esposito F., Malerba D., & Semeraro G.
Multistrategy Learning for Document Recognition
Applied Artificial Intelligence, 8, pp. 33-84, 1994
The 5473 examples comes from 54 distinct documents.
Each observation concerns one block. All attributes
are numeric.
Number of Instances: 5473.
Number of Attributes
Class Distribution:
Class | Frequency | Percent | Valid Percent | Cumulative Percent |
text | 4913 | 89.8 | 89.8 | 89.8 |
horiz. line | 329 | 6.0 | 6.0 | 95.8 |
graphic | 28 | .5 | .5 | 96.3 |
vert. line | 88 | 1.6 | 1.6 | 97.9 |
picture | 115 | 2.1 | 2.1 | 100.0 |
Total | 5470 | 100.0 | 100.0 |
Summary Statistics:
Variable Mean Std Dev Minimum Maximum Correlation
HEIGHT 10.47 18.96 1 804 .3510
LENGTH 89.57 114.72 1 553 -.0045
AREA 1198.41 4849.38 7 143993 .2343
ECCEN 13.75 30.70 .007 537.00 .0992
P_BLACK .37 .18 .052 1.00 .2130
P_AND .79 .17 .062 1.00 -.1771
MEAN_TR 6.22 69.08 1.00 4955.00 .0723
BLACKPIX 365.93 1270.33 7 33017 .1656
BLACKAND 741.11 1881.50 7 46133 .1565
WB_TRANS 106.66 167.31 1 3212 .0337
Vista visualization: 442 unseparated items(8%)
* Can't separate some small clusters well.