Blocks Classification
§ Back §

 

 

This data set have been used to try different
simplification methods for decision trees.

The problem consists in classifying all the blocks
of the page layout of a document that has been
detected by a segmentation process. This is an
essential step in document analysis in order to
separate text from graphic areas. Indeed, the five
classes are: text (1), horizontal line (2), picture
(3), vertical line (4) and graphic (5).
For a detailed presentation of the problem see:
Esposito F., Malerba D., & Semeraro G.
Multistrategy Learning for Document Recognition
Applied Artificial Intelligence, 8, pp. 33-84, 1994

The 5473 examples comes from 54 distinct documents.
Each observation concerns one block. All attributes
are numeric.

Number of Instances: 5473.

Number of Attributes

Class Distribution:

Class Frequency Percent Valid Percent Cumulative Percent
text 4913 89.8 89.8 89.8
horiz. line 329 6.0 6.0 95.8
graphic 28 .5 .5 96.3
vert. line 88 1.6 1.6 97.9
picture 115 2.1 2.1 100.0
Total 5470   100.0 100.0

Summary Statistics:

Variable Mean Std Dev Minimum Maximum Correlation
HEIGHT 10.47 18.96 1 804 .3510
LENGTH 89.57 114.72 1 553 -.0045
AREA 1198.41 4849.38 7 143993 .2343
ECCEN 13.75 30.70 .007 537.00 .0992
P_BLACK .37 .18 .052 1.00 .2130
P_AND .79 .17 .062 1.00 -.1771
MEAN_TR 6.22 69.08 1.00 4955.00 .0723
BLACKPIX 365.93 1270.33 7 33017 .1656
BLACKAND 741.11 1881.50 7 46133 .1565
WB_TRANS 106.66 167.31 1 3212 .0337

Vista visualization: 442 unseparated items(8%)
* Can't separate some small clusters well.

Download the dataset