## Untitled

Technical University of Crete
Department of Electronic and Computer Engineering
A thesis submitted in partial fulfillment
of the requirements for the degree of
Master of Computer Engineering
Chania, October 2006

**Abstract **
The availability of large volumes of digital content in modern applications (e.g.,
digital libraries and organization intranets) and the on internet has generated
additional interest in methods and tools for effective management of shared content.
Data clustering is a means for achieving better organization of the information by
partitioning the data space into groups of entities with similar content. Clustering of
large document collections is the problem this thesis is dealing with. State of the art
clustering algorithms are reviewed first (e.g. partitional and hierarchical algorithms).
Initially, we focus on partitional clustering methods due to their low time
complexity (i.e., linear on the number of documents). Hierarchical clustering methods
are considered as well. We examine several variants of the original K-Means
algorithm and we propose the so-called "Incremental K-Means" which differs from
K-Means in the way the centroids are updated during each clustering iteration.
However, both K-Means and its variants produce a flat partition of the data collection.
Efficient methods which are able to provide effective organization of information
(like hierarchical clustering) are preferred.
We propose a novel hierarchical clustering approach, which we call "BIC-
Means". BIC-Means produces a hierarchy of clusters by recursively applying the
Incremental K-Means on a document collection. BIC-Means combines the strengths
of partitional and hierarchical clustering methods. The main advantage of BIC-Means
is that it does not terminate when singleton clusters are reached at the bottom of the
hierarchy. To prevent over-splitting of clusters, BIC-Means incorporates the use of the
Bayesian Information Criterion (BIC) or Schwarz Criterion for terminating the
splitting of the hierarchy when meaningful clusters are reached. We use BIC to
perform a splitting test at each leaf cluster in order to decide whether a cluster should
be split or not. BIC-Means terminates when there is no separable cluster according to
the BIC function.
We run several sets of experiments on two TREC standard document
collections (Reuters and OHSUMED). Our experimental results show that the main
advantage of BIC-Means is that it requires significantly less time to build a cluster
hierarchy than the standard Bisecting K-Means algorithm (BIC-Means does not have
to reach singleton clusters at the leafs). In terms of clustering quality, BIC-Means
achieves approximately the same performance as the basic Bisecting technique (the
exhaustive approach). Therefore, BIC-Means is more suitable than its competitors for
clustering very large document collections effectively. This is not only due to its low
computational requirements, but also due to its comparable clustering performance.
We also explore Medical Subject Headings (MeSH) as features for
representing medical documents i.e., each document is represented as a vector of
MeSH terms (multi-word terms) rather than as vectors of single-word terms. Our
evaluation shows that MeSH-based representation of documents improves
significantly the performance of BIC-Means in terms of clustering time and clustering
Finally, we examine how hierarchical clustering could be used to improve the
effectiveness and efficiency of retrieval from large medical document collections. We
propose several cluster-based retrieval strategies using MeSH terms as document
representation. Experimental results show that the best proposed cluster-based
retrieval strategy is almost as effective as exhaustive searching (i.e. searching without
clustering). Cluster-based retrieval not only saves a huge amount of computation but
does so without significant loss in precision and recall.
Ο συνεχώς αυξανόµενος όγκος της ψηφιακής πληροφορίας σε παλιές και νέες
εφαρµογές (ψηφιακές βιβλιοθήκες, εσωτερικά δίκτυα οργανισµών κ.α.) και στο
διαδίκτυο έχουν αυξήσει σηµαντικά το ενδιαφέρον για µεθόδους που επιτυγχάνουν
την οργάνωση της πληροφορίας στα µέσα αποθήκευσης όπως είναι µέθοδοι
«οµαδοποίησης» (clustering). Τα δεδοµένα οργανώνονται σε µικρό αριθµό οµάδων
(clusters) όπου κάθε οµάδα περιέχει παρόµοιο πληροφοριακό περιεχόµενο. Η
οµαδοποίηση σε πολύ µεγάλες συλλογές κειµένων (document clustering) είναι το
βασικό θέµα της συγκεκριµένης εργασίας.
Αρχικά κάνουµε µια ανασκόπηση των πιο γνωστών µεθόδων οµαδοποίησης
που έχουν παρουσιαστεί στη βιβλιογραφία. Εστιάζουµε σε «διαµεριστικούς»
(partitional) αλγόριθµούς οµαδοποίησης κειµένων (partitional clustering) εξαιτίας της
χαµηλής (γραµµικής) πολυπλοκότητάς τους ως προς τον αριθµό των κειµένων αλλά
και της καλής απόδοσης που έχουν επιδείξει. Ωστόσο, για την οµαδοποίηση
συλλογών κειµένων προτιµούνται µέθοδοι οι οποίες παρέχουν αποτελεσµατική
πλοήγηση, οργάνωση και απεικόνιση της πληροφορίας όπως οι ιεραρχικοί µέθοδοι
(hierarchical clustering). Οι περισσότερες γνωστές ιεραρχικές µέθοδοι, αν και
ακριβείς έχουν τετραγωνική πολυπλοκότητα και για αυτό το λόγο δεν εφαρµόζονται
σε µεγάλες συλλογές κειµένων.
Η µέθοδος Κ-µέσων (K-Means) και οι παραλλαγές του παράγουν ένα επίπεδο
διαµερισµό των δεδοµένων (flat partitioning). Εξετάζουµε διάφορες παραλλαγές του
κλασσικού αλγορίθµου των Κ-µέσων και προτείνουµε µία παραλλαγή του, την
«Επαυξητική µέθοδο Κ-µέσων» (Incremental K-Means) η οποία ανανεώνει το κέντρο
(centroid) µιας οµάδας µόλις ένα κείµενο προστεθεί σε αυτόν. Προτείνουµε επίσης
µία νέα ιεραρχική µέθοδο που ονοµάζουµε "BIC-Means". Η µέθοδος BIC-Means
παράγει µια ιεραρχία από οµάδες δεδοµένων (κειµένων στην περίπτωσή µας)
εφαρµόζοντας επαναληπτικά την επαυξητική µέθοδο Κ-µέσων σε µια συλλογή
κειµένων. Η µέθοδος συνδυάζει τα πλεονεκτήµατα των επίπεδων και ιεραρχικών
τεχνικών δηλαδή, είναι ιεραρχική µέθοδος ενώ είναι πιο γρήγορη από την επαυξητική
µέθοδο. Αυτό οφείλεται στο ότι η µέθοδος BIC-Means δεν είναι εξαντλητική,
δηλαδή δεν χρειάζεται να τερµατίσει όταν οι οµάδες περιέχουν ένα µόνο κείµενο
(singleton clusters). Για να επιτευχθεί αυτό, o BIC-Means ενσωµατώνει το Bayesian
Information Criterion (BIC) ή Schwarz Criterion. Το κριτήριο αυτό εφαρµόζεται για
να σταµατήσει τις διασπάσεις των οµάδων σε ανώτερα επίπεδα της ιεραρχίας όταν η
περαιτέρω διάσπασή τους δεν θα οδηγήσει σε καλύτερη οµαδοποίηση. Ο BIC-Means
τερµατίζει όταν έχουν εξεταστεί όλες οι υποψήφιες προς διάσπαση οµάδες (οµάδες
που βρίσκονται στα φύλλα της ιεραρχίας) και δεν υπάρχει άλλη υποψήφια οµάδα για
διάσπαση.
Για την αξιολόγηση της απόδοσης των αλγορίθµων που αναπτύξαµε κάναµε
ένα σύνολο πειραµάτων σε δύο πολύ διαδεδοµένες συλλογές κειµένων (Reuters,
OHSUMED). Σύµφωνα µε τα πειραµατικά αποτελέσµατα, το βασικό πλεονέκτηµα
του BIC-Means είναι ότι χρειάζεται πολύ λιγότερο χρόνο (εν σε σχέση µε τον βασικό
επαυξητικό αλγόριθµο Κ-µέσων) για να δηµιουργήσει µια ιεραρχία από οµάδες ενώ
αποδίδει το ίδιο καλά µε την επαυξητική µέθοδο που προτείναµε (η οποία είναι ήδη
πολύ αποτελεσµατική επιτυγχάνοντας ακρίβεια οµαδοποίησης πάνω από 75% εν
σχέση µε µία οµαδοποίηση που παράγουν ειδικοί χρήστες). Εποµένως, ο BIC-Means,
συγκρινόµενος µε τις γνωστές µεθόδους οµαδοποίησης είναι εξίσου ακριβής και
µπορεί να εφαρµοστεί για την ιεραρχική οµαδοποίηση πολύ µεγάλων συλλογών
κειµένων. Παράλληλα, εξετάζουµε την χρήση ειδικών ιατρικών όρων (από την MeSH
ταξινοµική ιεραρχία) για την αναπαράσταση ιατρικών κειµένων. Με αυτόν τον τρόπο
κάθε κείµενο αναπαριστάται µε διανύσµατα πολυ-λεκτικών MeSH όρων (multi-word
terms) αντί µε διανύσµατα απλών µονο-λεκτικών όρων (single-word terms). Οι
παραστάσεις αυτές περιγράφουν καλύτερα το ιατρικό περιεχόµενο των κειµένων σε
ιατρικές εφαρµογές (π.χ. κείµενα της συλλογής OHSUMED) και είναι πιο συµπαγείς
(περιέχουν λιγότερους όρους). Τα αποτελέσµατα των πειραµάτων έδειξαν, ότι η
παράσταση των κειµένων µε MeSH όρους βελτιώνει σηµαντικά την απόδοση του
BIC-Means, τόσο σε σχέση µε την ποιότητα της οµαδοποίησης όσο και µε τον χρόνο
που απαιτείται.
Ολοκληρώνοντας, εξετάζουµε πώς µία ιεραρχική οµαδοποίηση (που έχει
παραχθεί µε µία µέθοδος όπως η BIC-Means) µπορεί να χρησιµοποιηθεί για την
γρηγορότερη ανάκτηση πληροφορίας (information retrieval) σε µεγάλες ιατρικές
συλλογές κειµένων. Προτείνουµε µια σειρά από στρατηγικές ανάκτησης που κάνουν
χρήση των δεδοµένων της ιεραρχίας. Τα πειραµατικά αποτελέσµατα έδειξαν ότι η
καλύτερη από τις προτεινόµενες στρατηγικές ανάκτησης αποδίδει το ίδιο καλά µε την
εξαντλητική µέθοδο ανάκτησης (χωρίς χρήση οµαδοποίησης) δηλαδή, µειώνει κατά
πολύ τον απαιτούµενο υπολογιστικό χρόνο, χωρίς να προκαλεί µείωση της απόδοσης
της ανάκτησης.
This dissertation could not have been completed without the help of many people. I
am foremost grateful to my supervisor, Professor Euripides Petrakis, for always being
available, for his invaluable advice, his tireless support, the encouragement and the
confidence he showed to me and to my work. His comments and corrections were
always up to the point showing me the way.
I am also very thankful to the rest members of the supervisory committee,
Professors Michail Lagoudakis and Vassilis Samoladas for valuable contributions to
this work. They got into deep questions and with their comments helped me to
improve the content of this dissertation.
Special thanks are given to Professor Evangelos Milios (Computer Science
Department, Dalhousie University) for his critical analysis and recommendations and
for being an outstanding mentor since my initial research steps.
I feel grateful to my colleagues in the Intelligent Systems Laboratory of
Technical University of Crete. Special thanks are given to Angelos Hliaoutakis,
Epimenidis Voutsakis, Paraskevi Raftopoulou and Giannis Varelas for their
collaboration and valuable comments and the harmonic and enjoyable coexistence
they provided. I wish for all of them the best in their life.
I would also like to thank my friends who offered me ungrudgingly their
friend-ship. I want to profoundly thank Patra Papadaki for her encouragement and
patience during this thesis. She always was there whenever I needed her.
Above all, my deepest thanks go to my parents Giorgos and Voula, and my
brother Vaggelis for their endless love and encouragement in all aspects of my life.

**Table of Contents**
BISECTING INCREMENTAL K-MEANS ALGORITHMS 60

**List of Figures**
**List of Tables **
**Chapter 1**
In recent years, we have seen a tremendous explosion of electronic information
available on the Internet, digital libraries and organizational intranets. Large
collections of documents are becoming increasingly common and widely available to
the public. On the other hand, the World Wide Web (WWW) continues to expand at
an amazing rate. Due to the huge size of document collections, searching for
information in such collections has become a very challenging task.
Document or text clustering plays an important role toward this goal. It refers
to the process of automatic grouping of text documents into clusters, so that each
cluster consists of similar documents (documents in different clusters are dissimilar).
Document clustering is the fundamental tool for enabling efficient document
summarization, organization, and navigation in very large data sets. It provides the
infrastructure for developing tools supporting navigation and browsing mechanisms
by organizing enormous amounts of documents into meaningful clusters.
Document clustering has been widely applied in various scientific fields for
supporting search engines, text mining, and Inform
used also as a post-retrieval tool for organizing query results into thematic topics.
These organized results can be interactively browsed, visualized, and explored by the
Text Clustering is an unsupervised learning process of grouping documents
into clusters. There are no pre-defined classes available in document clustering and
this is how text clustering differs from text classification. In text classification, we are
provided with a training set of labeled documents and we are asked to assign to one of
new, yet unlabeled documents the pre-
categorization is a supervised learning task.
CHAPTER 1. INTRODUCTION

**1.1. Motivation **
Document clustering has been extensively studied in the literature and a variety of
algorithmsalgorithms can be
categorized along different dimensions. First, clustering can be either static or
dynamic. Static clustering algorithms usually refer to static document collections. On
the other hand, dynamic clustering is applied on data sets that change dynamically.
Consider for example the flow of information that arrives continuously on news wires
message systems such as Reuters, Marketwatch, etc. In this case the clusters must
adapt to the incoming flow or deletions of documents. Dynamic clustering has not
been widely studied, while static clustering methods can be further improved.
Based on the nature of the membership function the clustering can be either
hard or soft (fuzzy). Hard clustering algorithms produce hard clusters (i.e., each
document is assigned to a single cluster) while in soft clustering, documents may be
instances of more than one cluster. Notice that a fuzzy clustering can be converted to
a hard clustering by assigning each data object to the closest cluster. In this thesis, we
focus on static, hard clustering algorithms.
Based on the underlying algorithmic methodology, the standard clustering
algorithms can be categorized into hierar
partitionaerarchical clustering algorithms proceeds either
bottom-up (agglomerative), or top-down (divisive). Hierarchical Agglomerative
Clustering (HAC) starts with all documents as individual clusters and works by
merging the most similar ones iteratively until a single cluster with all documents is
produced at the root of the hierarchy. Divisive algorithm approaches start with all
documents in the same root cluster and work by iteratively splitting each cluster into a
number of smaller ones until clusters with one document (singleton clusters) are
produced at the leafs of the hierarchy. Both types of methods produce a tree hierarchy
of clusters called a "dendrogram". Contrary to hierarchical clustering techniques,
partitional algorithms create a flat (un-nested) partitioning of documents. K-Means is
a widely used partitional clustering method. It partitions the entire collection into K
clusters, where K stands for the desired number of output clusters and must be known
In recent years, various experimental resu
that partitional clustering algorithms are well-suited for clustering large data sets due
TECHNICAL UNIVERSITY OF CRETE
to their low computational requirements (linear in the number of documen
The time complexity of most hierarchical clustering methods is quadratic in the
number of documents. Due to the large size of document collections in modern
applications and the explosion of the WWW there is an increased need for effective
clustering, which scales-up well for very large data sets. Hierarchical clustering
provides the infrastructure for developing effective navigation, organization, and
visualization tools for such large amounts of information. For this reason, hierarchical
clustering solutions are preferred. However, traditional hierarchical clustering
algorithms have limited applicability in large document collections due to their
quadratic time complexity. Furthermore, partitional techniques usually lead to better
clustering solutions than agglomerative algorithms
Partitional clustering methods can also be used to obtain a hierarchical
clustering solution via a sequence of repeated application of the K-Means algorithm.
Bisecting K-Means method is such an approach. The method starts with all documents
in a single cluster. Initially, this cluster is partitioned into two clusters by applying K-
Means. The algorithm continues by splitting similarly each produced cluster until
singleton clusters are obtained at the leafs or until K clusters have been produced.
The so-obtained clusters are structured as a hierarchical binary tree. The overall
hierarchy is built in

*O*(

*n *log

*n*) time (in case of a balanced hierarchy), where

*n * is the
number of documents.
An important issue in divisive clustering approaches is to determine a strategy
to terminate the divisive procedure. Without prior knowledge on the number of
clusters the algorithm executes exhaustively. In the case of large document collections
the efficiency of clustering decreases significantly. For this reason, additional
termination criteria must be introduced to increase the efficiency of the algorithm and
prevent it from over-partitioning. This is exactly one of the problems this work is
dealing with. Notice that, without a termination criterion, even meaningful clusters
(i.e., clusters corresponding to real classes) are further split until singleton clusters are
reached at the leafs of the tree hierarchy.
NIKOLAOS HOURDAKIS
CHAPTER 1. INTRODUCTION

**1.2. Contributions **
In this thesis, our main objective is to develop a highly efficient algorithm for
clustering very large document collections. We focus on partitional clustering
methods due to their low time complexity. We implemented several partitional
clustering algorithms and we studied their performance.
Initially, we examined the standard K-Means clustering approach. We
developed several variants of the original K-Means method and we proposed the so-
called "Incremental K-Means". Incremental K-Means differs from basic K-Means in
the way the centroids are updated during each clustering iteration. In Incremental K-
Means each cluster centroid is updated after each document is assigned to a cluster
(rather than re-computing each cluster centroid after each iteration when all
documents have been assigned to clusters).
We investigate how Incremental K-Means can be used effectively to build a
hierarchy of clusters. A hierarchical solution is obtained by recursively applying the
Incremental K-Means on a document collection. All documents are initially
partitioned into two clusters. Then, the least cohesive leaf cluster is selected for
further splitting. This process of selecting and bisecting a leaf cluster continues until
all clusters at the bottom of the hierarchy contain a single document. We call the
proposed algorithm "Bisecting Incremental K-Means". As indicated he
basic Bisecting approach significantly outperforms agglomerative hierarchical
clustering algorithm in terms of clustering quality and efficiency. Thus, we focus our
research on Bisecting Incremental K-Means.
As mentioned in Section 1.1, despite its effectiveness, the main disadvantage
of Bisecting Incremental K-Means is that it terminates when each leaf cluster contains
a single document which is not only slow but also produces lots of meaningless
clusters at the bottom of the hierarchy (e.g., singleton clusters). The produced
clustering result is not appropriate for navigation, data summarization and browsing
of information. For this, a terminating condition must be defined.
We propose a novel hierarchical clustering approach which incorporates the
use of the "Bayesian Information Criterion (BIC)" or Schw
terminating the splitting of the Bisecting Incremental K-Means algorithm. We suggest
using BIC as the splitting criterion of a cluster. BIC estimates the cohesiveness of
clusters in order to denote whether a cluster should split. If the BIC score of the
TECHNICAL UNIVERSITY OF CRETE
produced children clusters is less than the BIC score of the parent cluster we do not
accept the split and keep the parent cluster as is. We terminate the divisive procedure
when there is no separable leaf cluster according to the BIC function. Similarly to our
K-Means, first used BIC to estimate the
best K in a given range of values. Notice that X-Means is a partitional clustering
Overall, we propose the so-called here-after "BIC-Means" clustering
algorithm which is the main contribution of this thesis. It produces a hierarchical
clustering solution and combines all the following ideas:
1. Bisecting clustering approach to build a hierarchy of clusters effectively.
2. Incremental K-Means as the proposed partitional method to bisect the
selected leaf cluster at each bisecting step.
3. A termination criterion from preventing clustering from over-splitting based
on Bayesian Information Criterion (BIC).
The proposed BIC-Means terminates before each leaf cluster becomes a single
document. As a result, the obtained clusters are more meaningful as compared to
meaningless singleton clusters of standard hierarchical algorithms. The proposed
algorithm combines the strengths of partitional and hierarchical clustering methods.
We focus on evaluating the performance of the proposed clustering algorithms
in terms of clustering quality and time required to obtain clustering solutions. We
used two standard document collections (OHSUMED and Reuters). F-Measure was
used to examine the quality of the produced clustering results. It measures the
performance of clustering methods in terms of how well the documents belonging to
each of the pre-defined classes match the documents belonging to the corresponding
Experimental results indicated that Bisecting Incremental K-Means performs
significantly better than K-Means and (our proposed variant) Incremental K-Means in
terms of F-Measure on both test collections. We also observed that Incremental K-
Means always produces better partitional clustering solutions than standard K-Means.
We also explored Medical S
describing medical literature, as features for representing medical documents in
NIKOLAOS HOURDAKIS
CHAPTER 1. INTRODUCTION
OHSUMED i.e., each document is represented as a vector of MeSH terms (multi-
word composite terms) rather than by vectors of single word terms (the state of the art
approach). This leads to a more compact representation (each vector contains less
terms) which is directly perceivable by humans. Our evaluation showed that MeSH-
based representation of documents improves noticeably the performance of Bisecting
Incremental K-Means with respect to clustering time and clustering quality.
The performance of our proposed BIC-Means algorithm is evaluated and
compared against the performance of hierarchical clustering methods such as
Bisecting Incremental K-Means. Experimental results indicated that the main
advantage of BIC-Means is that requires significantly less time to build a cluster
hierarchy than Bisecting Incremental K-Means (is executed exhaustively). In terms of
clustering quality, BIC-Means performs approximately the same as our initial
Bisecting approach. Therefore, the proposed BIC-Means is well-suited for obtaining
effective hierarchical clustering solutions of large data sets. This is not only due to its
low computational requirements, but also comparable performance. Notice that, BIC-
Means terminates at meaningful clusters (clusters which are likely to correspond to
Having established the quality of the implemented document clustering
algorithms, we examine how hierarchical clustering could be used to improve the
effectiveness and efficiency of retrievals on large medical document collections. Our
main goal is to noticeably reduce the number of required similarity computations
between the user's query and documents within a collection. We produce a hierarchy
of clusters using the BIC-Means. We propose and evaluate several cluster-based
strategies for searching hierarchical clustered document collection based on the idea
that only leaf clusters need to be searched (intermediate level clusters combine
information from lower level leaf clusters). We retrieve the documents contained in
the N top-ranked clusters. Notice that, we use MeSH terms to build the document,
cluster and query vectors.
The experimental results indicated that among all cluster-based retrieval
strategies proposed in this thesis the best results on OHSUMED are obtained in case
we examine only the leaf clusters which contain all the MeSH terms of the query in
their centroid vectors. Contrary to exhaustive search (233,445 documents are
searched), the proposed cluster-based retrieval strategy search only 30% of
OHSUMED documents. Experiments also showed that the proposed search strategy is
TECHNICAL UNIVERSITY OF CRETE
almost as effective as the retrieval by exhaustive search on OHSUMED.
Summarizing, this cluster-based retrieval not only saves a huge amount of
computation, but does so without loss of retrieval effectiveness.

**1.3. Thesis **
**Structure **
The rest of this thesis is organized as follows.
Chapter 2 provides a review of related work in the fields of document
clustering, evaluation methodology of clustering quality and stopping criteria on
hierarchical clustering.
Chapter 3 describes in detail our proposed clustering algorithms for efficient
clustering of large document collections. We discuss techniques to improve the
quality of the obtained clustering results.
Chapter 4 presents the two sets of experiments that we performed for
evaluating our proposed methodology. The first one focuses on evaluating the quality
of the clustering solutions produced by the several proposed clustering algorithms.
The second set of experiments examined how hierarchical clustering could be used to
improve the effectiveness and efficiency of retrieval by exhaustive search on large
document collections.
Chapter 5 summarizes the achievements of this thesis and points out possible
directions for future research.
Appendix A talks in detail about many parts that took place in the
implementation process and describes some technical issues about the developed
NIKOLAOS HOURDAKIS
CHAPTER 1. INTRODUCTION
TECHNICAL UNIVERSITY OF CRETE

**Chapter 2 **
**Background and Related Work **
In this chapter, we provide an overview of research related to document clustering.
We highlight the key issues in the area of Information Retrieval (IR), presenting the
three common Information Retrieval Models. We describe the main text clustering
techniques and we explain some technical issues. Then, we present the basic
measures of cluster quality. Finally, we focus on stopping criteria in hierarchical
document clustering algorithms, emphasizing on the so-called Bayesian Information
Criterion (BIC). We conclude by describing MeSH control vocabulary which we use
in our experimental evaluation.

**2.1 Information Retrieval **
The internet is expanding at increasing rate, and search for information is becoming
more difficult in this gigantic digital library. This fact calls for improved automatic
methods for searching and organizing documents so requested information can be
accessed quickly and accurately. The term Information Retrieval (IR) defines all
those activities that can be used to retrieve documents of interest from a given
collection of documents.

**2.1.1 Information Retrieval Models **
The three classic mo
Probabilistic. In the Vector Space moents and queries are represented
by vectors in a multi-dimensional space, where each dimension corresponds to a
unique word in corpus. Thus, we say that the model is algebraic. In the Boolean
model, documents and queries are represented as a set of index terms, thus this model
is set theoretic. Finally, in the Probabilistic model the representation of documents
and queries is based on probabilities of occurrence of terms in a corpus. Among the
CHAPTER 2. BACKGROUND & RELATED WORK
three models, vector space is the most commonly used. The various clustering
algorithms that are described and implemented in this thesis are based on the vector

**2.1.2 Vector Space Model **
The Vector Space modeost popular IR model. It has been shown
to perform at least as good as the other two models. In VSM, both documents

*d * and
queries

*q * are considered to be vectors in a multidimensional term space. The VSM
assigns to the terms non-binary weights which are used to compute the degree of
similarity between a document and a query or between two documents.
The weights assigned to each term can be either the term frequency (

*tf *) or
term frequency-inverse document frequency (

*tf *−

*idf *)
frequency of occurrence for a term in a document is included in the vector

*d *= (

*tf *,

*tf *,.,

*tf*
, where

*tf * is the frequency of the

*th*
*i * term in the document.
Usually, very common words are removed and the terms are stemmed. A refinement
to this weighting scheme is the so-called

*tf *−

*idf * weighting scheme. In this approach,
a term that appears in many documents should not be regarded as more important
than the one that appears in few documents, and for this reason it needs to be de-
Let

*N * be the total number of documents in the collection;

*df * (document
frequency) be the number of documents in which the

*k * term appears, and

*freq * be

*i*,

*j*
the raw frequency of the term

*k * in the document

*d *. The inverse document
frequency (

*idf *) for

*k * is defined as:
The

*tf *−

*idf * weight of term

*i * is computed by:
To account for documents of different length, each vector is normalized so that it is of
TECHNICAL UNIVERSITY OF CRETE
There are many variations of this basic formula. The one we use in our implemented
algorithms is described in section 3.2.1.

**Similarity Computation **
The Vector Space Model computes the degree of similarity between a document

*d *
and a query

*q * (or between two documents). The similarity between two documents is
computed as the cosine of the angle between their document vectors in the
multidimensional term

*d *⋅

*d*
∑

*w *×

*w*
*d *×

*d*
∑

*w *∑

*w*
This measure simplifies to cos ( , )

*T*
=

*d d * due to the unit length of
document vectors. All vectors are normalized by document length. The measure takes
values between 1 (the two documents are identical) and 0 (the two documents have no
The main advantages of Vector Space Model (V
♦ The documents are sorted by decreasing similarity with the query

*q *.
♦ The terms are weighted by importance.
♦ It allows for partial matching: the documents need not have exactly the same
terms with the query.
One disadvantage of VSM is that the terms are assumed to be independent. Moreover,
weighting is intuitive and not very formal.

**2.1.3 Boolean Model **
ost simple among the three models and relies on the
use of Boolean operators and set theory. The terms in a query are combined together
with

*AND *,

*OR * and

*NOT * operators. A document is predicted as relevant to a query
expression if it satisfies the query Boolean expression. Each term is either present (1)
or absent (0). The basic advantage of Boolean model is that is very simple (based on
set theory). It is easy to understand and implement.
NIKOLAOS HOURDAKIS
CHAPTER 2. BACKGROUND & RELATED WORK
The Boolean model also has its drawbacks. It only retrieves exact matching (a
retrieved document contains exactly the same terms with the query). The retrieved
documents are all equally ranked with respect to relevance. Furthermore, all terms are
equally important. Boolean operator usage has much more influence than a critical
word. Also query language is expressive but complicated due to complex Boolean
expressions. The Boolean retrieval model has been extended and refined to solve
these problems weighting operations make ranking of
documents possible, where the terms in the document could be weighted according to
their frequency in the document.

**2.1.4 Probabilistic Model **
The Probabilistic Retrieobability that a document
is similar to the query. It assumes that for each document an ideal answer set of
similar documents exists for each query. Given a query

*q *, a subset of documents,

*R *
is relevant to

*q *. The probability that a specific document will be judged relevant to a
specific query is based on the assumption that the terms are distributed differently in
relevant and non-relevant documents. The weights take binary values (a term exists in
a document or not). In general, the Probabilistic model attempts to answer a basic
question: "

*What is the probability that this document is relevant to this query?*".
If retrieved documents are ordered by decreasing probability of relevance on
the data available, then the system's effectiveness is the best that is obtainable on the
basis of those data (Probability Ranki
feedback can improve the ranking by giving better term probability estimates.
In conclusion, Probabilistic Model uses probability theory to model the
uncertainty in the retrieval process. It evaluates probability of relevance based on the
occurrence of terms in queries and in documents.

**2.2 A Variety of Document Clustering Algorithms **
In this section, we review the most common clustering algorithms. Over the past few
years, clustering techniques have been devegoal of clustering is
to group the points in a feature space optimally based on proximity. Document or text
TECHNICAL UNIVERSITY OF CRETE
clustering relates to the automatic grouping of documents into clusters, so that
documents within a cluster have high similarity in comparison to one another, but are
very dissimilar to documents in other clusters.
Text Clustering differs from text classification. Text classification is a
supervised learning process that involves pre-defined category labels; while text
clustering is an unsupervised task (no pre-defined category labels are available).
Document Clustering is widely applicable in areas such as web mining, text
mining and information retrieval. Recently, it has been used in browsing large
collection of docum
to help users find relevant documents (within the query results) fa
This section focuses on the techniques used in document clustering and offers
a brief review of hierarchical and partitional clustering methods, which are used in
this study. These techniques are the most common and differ in the way clusters are
organized. Hierarchical algorithms produce a hierarchy of clusters, while partitional
algorithms generate a flat partition of the data objects.

**2.2.1 Hierarchical Clustering Algorithms **
Hierarchical Text Clustering creates a hierarchical decomposition of the documents
tions with a single cluster at the top and
individual documents at the bottom of the hierarchy. Each cluster at the intermediate
level can be viewed as combining two or more clusters from the next lower level, or
splitting a cluster from the next higher level. A hierarchical clustering defines a tree
called a dendrogram is a tree structure that displays the clusters
that are merged during clustering. Figure 2.1 shows how five documents can be
merged into a single cluster. The parent-child relationship among the nodes in the
dendrogram provides taxonomy and facilitates browsing.
NIKOLAOS HOURDAKIS
CHAPTER 2. BACKGROUND & RELATED WORK

**Dendrogram **
**Figure 2.1: **Sample Dendrogram

There are two basic approaches to generate a hierarchical clustering.
Hierarchical clustering algorithms are either agglomerative (bottom-up) or divisive

**Hierarchical Agglomerative Clustering (HAC) **
Hierarchical Agglomerative Clustering proceeds bottom-up. It starts with the
documents as individual clusters and, at each step, computes the similarity between
all pairs of clusters and merges the most similar pair. The algorithm continues until a
single cluster is formed at the top of the hierarchy. A definition of cluster similarity or
distance is required. In the following page we present the most commonly used
techniques for calculating the similarity between two clusters.
The following summarizes the basic hierarchical agglomerative clustering
1. Treat each document as a cluster on its own.
2. Compute the similarity between all pairs of clusters, calculate the
similarity matrix whose

*th*
*ij * entry gives the similarity between the

*th*
*j * clusters.
3. Merge the most similar two clusters.
4. Update the similarity matrix entries for the newly formed cluster and the
5. Repeat steps 3 and 4 until only one cluster remains.
TECHNICAL UNIVERSITY OF CRETE
The time complexity of hierarchical agglomerative clustering algorithm is

*O*(

*n *)
where

*n * is the number of documents. All hierarchical methods need to compute
similarity for all pairs of n individual instances.

**Divisive Methods **
The Divisive approach follows the opposite strategy. It starts with all documents in
the same root cluster. It works by iteratively splitting each cluster into a number of
smaller ones until clusters with one document (singleton clusters) are produced at the
leafs of the tree hierarchy or until the desired number of clusters is reached.

**Methods to Compute Similarity between Clusters **
In Hierarchical Agglomerative Clustering a number of different methods have been
proposed for determining the next pair of clusters to be merged, i.e. how we define
cluster similarity. There are four comm
compilarity method.
•

**Single-Link Method **
In the single-link method, the similarity of a pair of clusters (

*C *,

*C *) is the maximum
similarity between any two individuals, one in each cluster:

*sim*(

*C *,

*C *) = max

*sim*(

*x*,

*y*) (2.4)
where

*x * and

*y * are documents in cluster

*C * and

*C * correspondingly.
However, this method is highly susceptible to noise, outliers, artifacts and
s a tendency to form loosely bound clusters
remains popular due to its simplicity and the availability
of an optimal space and time comp
•

**Complete Link Method**
In complete-link algorithm, the similarity between two clusters (

*C *,

*C *) is the
minimum of all pairwise similarities between documents in the two clusters:

*sim*(

*C *,

*C *) = min

*sim*(

*x*,

*y*) (2.5)
NIKOLAOS HOURDAKIS
CHAPTER 2. BACKGROUND & RELATED WORK
where

*x * is a document in cluster

*C * and

*y * in cluster

*C *. This method produces
"tighter" clusters that are typically preferred.
•

**Group Average Method (UPGMA) **

Computes the average similarity across all pairs of documents within the two clusters

(

*C *,

*C *) that will be merged (including the pairs of documents within each one of two
∑

*sim*(

*x*,

*y*)

*sim*(

*C *,

*C *)

*C *∗

*C*
where

*x * is a document in cluster

*C * and

*y * in cluster

*C *.
However, due to the complexity of computing the similarity between every
pair of clusters, UPGMA does not scale up well for large data sets.
•

**Centroid Similarity Technique **
The similarity between two clusters (

*C *,

*C *) is defined as the cosine between their
centroid vectors. The centroid vector of a cluster is defined as the mean vector of data
objects. The similarity between two centroids is:

*sim*(

*C *,

*C *) = cos(

*c *,

*c *) =

*c *•

*c */

*c *∗

*c * (2.7)
where

*c *,

*c * are the centroid vectors of the two clusters. Note that the centroid
vectors will not necessarily be of unit length.
Among the four methods discussed above, group average is the preferred one
performing for documenorate schemes have also
been developed. See for examk

**2.2.2 Partitional Clustering Algorithms **
Contrary to hierarchical clustering techniques, a partitional clustering algorithm
creates a flat (non-hierarchical) clustering of data objects. There are many partitional
clustering techniques available. The K-Means algorithm is widely used in document
clustering because it is easy to implement and has low time complexity.
TECHNICAL UNIVERSITY OF CRETE

*K * stands for the desired number of output clusters. For K-Means we use the notion
of the centroid, which is the mean or the median point of a group of data objects.
Given a set

*S * of documents and their corresponding vector representations, the
centroid vector

*C * is the vector obtained by averaging the weights of the various
terms presented in the documents of

*S *. Note that a centroid almost never
corresponds to an actual data object. The similarity between a document

*d * and a
centroid

*c * is computed according to Equation 2.4. Note that even though the
document vectors are of unit length the centroid vector is not necessarily of length
The basic K

*-*Means algorithm works as fo
1. Randomly select

*K * points as the initial cluster centroids (seeds).
2. For each point, put the point in the cluster whose centroid is the closest (most
similar). The most common measure to calculate the similarity between a
document and a centroid is the vector cosine measure which we use in this
3. Re-compute the centroid of each cluster using the current cluster members.
4. Repeat steps 2 and 3 until an objective criterion is met.
At step 4 of the algorithm, there are two most commonly used objective functions.
♦ The procedure terminates when there is no re-assignment of instances to new
♦ The second popular objective function is the mean squared distance function,
which tend to work well with isolated and compact clusters. The square error
criterion function for a clustering of

*N * documents (containing

*K * clusters),

*e *(

*K *,

*N *) = ∑∑

*x *−

*c * (2.8)
where (

*j*)
is the

*th*
*i * document belonging to the

*th*
*j * cluster and

*j*
*c * is the
centroid of the

*th*
*j * cluster.
At step 3 of basic K-Means algorithm, there are two ways to update the centroid:
NIKOLAOS HOURDAKIS
CHAPTER 2. BACKGROUND & RELATED WORK
♦ Continuously: The centroid is updated after each data object is assigned. ♦ Non-continuously: The centroid is adjusted only at the end of iteration, when
all the data objects have been assigned.
Experimeat the continuous centroid adjustment is

*K*-Means and other partitional clustering techniques are well-suited for
clustering large data sets due to their low computational requirements. Their time
complexity is

*O*(

*n*) where n is the number of data objects (documents). They are
more efficient as compared to the HAC algorithm which has quadratic computation
However, a drawback of Κ-Means is that

*K * must be known in advance. An
incorrect estimation of the input parameter may lead to poor accuracy. To avoid this,
we try out several

*K * and the best configuration is obtained (the one that optimizes
the objective function). Apart from that, X- implemented
to avoid this inaccuracy.
Κ-Means is also sensitive to the selection of the initial centroids. Several
methods have been reported in the literature, which attend to select a good initial
partition. The most efficient are:
♦ Run

*K*-Means several times with different initial centroids and pick the best
♦ Use heuristics to pick initial centroids.

**Hybrid Approaches to Pick Good Initial Centroids **
Buckshot and Fractionation are two methods designed to find the initial centroids in
order to avoid random selection. Both techniques are based on other clustering
algorithms. They cluster well, but their run time is slower than plain K-Means.
•

**Buckshot Algorithm **

The Buckshot algorithm avoids problems of bad seed selection. It combines

Hierarchical Agglomerative Clustering and

*K*
Buckshot randomly picks

*Kn * documents from an input set of

*n * documents. Then,
it runs group-average HAC on this sample. This algorithm has

*O*( (

*Kn*) ) =

*O*(

*Kn*)
TECHNICAL UNIVERSITY OF CRETE
time complexity. The

*K * centroids resulting from HAC become the initial centers for
•

**Fractionation Algorithm **

hot, to build a bottom-up hierarchy. If we

want to get

*K * clusters, fractionation algorithm splits

*N * documents into

*M *>

*K *
groups of size

*N */

*M * each (

*M * is a parameter). Then uses HAC for each of the

*M *
groups to generate

*pN */

*M * clusters (

*p * is a parameter). The iteration terminates when
only

*K *groups remain. The algorithm takes

*O*(

*Kn*) time.
Summarizing, Buckshot applies HAC to sample the documents randomly in
order to find initial centroids. Fractionation uses successive applications of HAC over
particular groups of documents to find centroids. Fractionation is more accurate than
hot is significantly faster, so it is more appropriate for
many applications.

**Bisecting K-Means **
Partitional algorithms can also be used to obtain hierarchical clustering solutions via a
sequence of repeated applications of K-Means algorithm. The Bisecting K-Means
algorithm uses this approach to build a hierarchy of clusters. It is very effective in
many applications (browsing, indexing, navigation, and information retrieval
systems). Bisecting K-Means algorithm starts with a single cluster of all the
1. Choose a cluster to split (starting with the initial cluster).
2. Apply the basic

*K*-Means algorithm to split this cluster into 2 sub-clusters
(Bisecting step).
3. Repeat step 2 for ITER times and take the split that produces the clustering
with the highest overall similarity (the average pairwise similarity between all
documents in the cluster). We want to maximize that sum over all clusters.
4. Repeat steps 1, 2 and 3 until a pre-defined stopping criterion is met.
At step 1, there is a number of different ways to select which cluster to split from the
list of leaf clusters. We can select either the cluster with the least cohesion (the least
overall similarity), or the one with the largest size. Alternatively, a criterion based on
NIKOLAOS HOURDAKIS
CHAPTER 2. BACKGROUND & RELATED WORK
both overall similarity and cluster size can be used. Experime
small differences between these possible methods.
To summarize, the Bisecting

* K*-Means algorithm is a divisive hierarchical
clustering technique. Its time complexity is about linear to the number of documents,

*O*(

*n **

*M *) , where

*M * is the number of the produced clusters. In chapter 3, we present
in detail our implementation of Bisecting K-Means algorithm.

**2.2.3 Representation of clusters **
The meaning of cluster representation was introduced in ma
In many applications the resulting clusters have to be represented or described in a
compact form to achieve data abstraction. Jaarized three representation
schemes and indicated that among them, the use of the centroid to represent a cluster
is the most popular way. In many cases, the cluster can be effectively represented by
a number of the highest weighted term

**2.2.4 Comparison of Document Clustering Techniques **
Experime group-average hierarchical clustering
algorithm (UPGMA) is the best performing hierarchical technique. However, it has
limited applicability because of its quadratic time complexity. K-Means and its
variants are commonly preferred due to their time complexity which is linear to the
number of documents. Moreover, partitional algorithms can also be used to obtain
hierarchical clustering solutions via a sequence of repeated bisections (Bisecting K-
Means). Notice that, Bisecting K-Means has a linear time complexity.
♦ Bisecting

*K*-Means is better than regular

*K*-Means and UPGMA. ♦ Although, results of basic K-Means can vary from one run to another, K-
Means is generally better than UPGMA (i.e., achieves better clustering

*K*-Means has linear time complexity as opposed to the quadratic
time complexity of HAC. Furthermore, Bisecting

*K*-Means is not susceptible to
initialization issues. Finally, it is ideal for clustering large document collections not
only due to its linear time complexity, but also due to its higher clustering quality.
TECHNICAL UNIVERSITY OF CRETE

**2.3 Clustering Quality **
The quality of various clustering algorithms can be evaluated with regards to both
internal andrnal measures compare different sets of
clusters without reference to external knowledge. The cohesiveness of a cluster,
which is called "overall similarity" and is based on the pairwise similarity of the
documents in a cluster, is an internal measure. Contrary to internal measures, external
measures evaluate the clustering quality by comparing the clusters produced from
clustering algorithms against already defined classes. The most common external
measures are "entropy" and "

*F*-Measure". Internal and external metrics are
subsequently discussed.

**2.3.1 Overall Similarity **
In the absence of class labels, as external information, overall similarity is an internal
cluster quality msolution, objects within a cluster are most
similar to each other than objects that come from different clusters. Particularly, the
cluster cohesiveness is defined as the average pairwise similarity between objects in a
cluster

*S *:
cos(

*d *,′

*d *) (2.9)
The above Equation is just the squared length of the cluster centroid vector,
equivalence is shown in section 3.3.

**2.3.2 Entropy **
Entropy is an external measure of cluster easure of
quality for un-nested clusters or for the clusters at a certain hierarchy of clusters.
Initially, we calculate the entropy of each cluster, i.e., for cluster

*j * we compute

*p *
the probability that a member of cluster

*j *belongs to class

*i *. Then, the entropy of
each cluster

*j *is defined to be:

*E *= -∑

*p *∗log(

*p *) (2.10)
NIKOLAOS HOURDAKIS
CHAPTER 2. BACKGROUND & RELATED WORK
where the sum is taken over all classes. The entropy of the entire clustering solution is
defined to be the sum of the individual cluster entropies weighted by the size of each

*n *∗

*E*
*Entropy *= ∑
where

*n * is the size of cluster

*j *,

*m * is the number of clusters and

*n * is the total
number of data objects. A clustering solution is perfect when the entropy is zero.

**2.3.3 F-Measure **
*F*ore suitable for measuring the effectiveness of not only
partitional but also of hierarchical clustering. We use this metric in this work to
evaluate our clustering implementations.

*F*-Measure combines the precision and
recall ideas from information retrieval area.
For each manually labelled category (topic)

*T *, we assume that a cluster

*C *
corresponding to the topic

*T * will be formed somewhere in the hierarchy. To find the
cluster

*C * corresponding to category

*T *, traverse the hierarchy computing precision,
recall and

*F*-Measure. For any category

*T * and cluster

*C *, we define:
P(

*C*,

*T *) =

*N */

*C *
R(

*C*,

*T *) =

*N */

*T *
*F *−

*Measure *= 2 ∗

*P *∗

*R */(

*P *+

*R*) (2.14)
where

*N * is the number of members of category

*T * in cluster

*C ,* *C *is the number of
documents in cluster

* C *,

*T * is the number of documents in category

*T *.
For hierarchical clustering, we consider the cluster with the highest F-Measure
to be the cluster corresponding to the category

*T *. The overall F-Measure for the
hierarchy is computed by taking the weighted average of the

*F*-Measure for each
topic

*T * and is defined as:
∑

*T *∗

*F*(

*T*)

*Overall *_

*F *−

*Measure *=
where

*S * is the set of categories,

*T * is the number of documents in topic

*T * and

*F *(

*T *) is the

*F*-Measure for topic

*T *.
TECHNICAL UNIVERSITY OF CRETE

*F*-Measure score ranges from 0 to 1. A higher

* F*-Measure score indicates a
better clustering solution.

**2.4 Stopping Criteria in Bisecting K-Means Algorithm **
As mentioned Bisecting K-Means is a divisive hierarchical clustering technique. Step
4 of the basic algorithm indicates that the procedure terminates when a stopping
criterion is met. Thus, a strategy to stop bisections is needed. In recently published
studies there are various criteria which have been proposed as stopping rules. In this
section, we make a brief review of all known methods.
¾ The most commonly used method suggests stopping the algorithm when no
more clusters can be split. In case of document clustering this implies that the
algorithm continues until each leaf cluster of the hierarchy contains a single
document. The reasons of this are that: 1) no prior knowledge of the desired number
of clusters is available in a specific application; 2) the purpose of clustering is to find
the complete hierarchy.
¾ Bisecting K-Means algorithm continues partitioning until the desired number K
ple and can be applied if there
is prior knowledge of the desired number of clusters.
¾ alternative stopping criterion for
terminating the divisive procedure. Specifically, they stop splitting a cluster if it
contains less that 5% of the total number of documents. The algorithm terminates
when all the resulted clusters meet the stopping condition.
¾ the Min-Max Cut
algorithmilarity concepts and was based on a min-
max clustering principle: "Data should be grouped into clusters such that similarity
between different clusters is minimized while the similarity within each cluster is
We briefly describe the Min-Max algorithm. If n is the number of data objects
and

*W *= (

*w *) is the pairwise similarity matrix, where

*w * is the similarity between

*i *,

*j *, we desire to partition the data into two clusters 1

*A * using the min-max
clustering principle. The similarity between
is defined to be
NIKOLAOS HOURDAKIS
CHAPTER 2. BACKGROUND & RELATED WORK

*s*(

*A *,

*A *) = ∑∑
. The similarity within a cluster

*A * is the sum of pairwise
similarities within

*s*(

*A *,

*A *) , while

*A *. The clustering principle requires minimizing
maximizing

*s*(

*A *,

*A *) and

*s*(

*A *,

*A *) simultaneously. These requirements lead to the
minimization of the Min-Max Cut objective function,

*s*(

*A *,

*A *)

*s*(

*A *,

*A *)

*s*(

*A *,

*A *)

*s*(

*A *,

*A *)
Generally, for

*K *≥ 3 , we define

*J*
(

*A *,.,

*A *) ≡

*J*
*s*(

*A *,

*A *)
(

*K *) = ∑

*J*
(

*A *,

*A *)

*s*(

*A *,

*A *)
where

*A *= ∑

*A * is the complement of

*A *.
After that brief explanation of Min-Max Cut algorithm, we return to its use as
a stopping criterion for terminating Bisecting K-Means technique. The authors
proposed that the algorithm terminates when

*J*
(computed on recent leaf clusters)
exceeds a pre-defined threshold value

*J*
. They showed that as the algorithm
continues (the number of leaf clusters increases)

*J*
increases. This strategy to stop
bisections can be used in applications where the correct K, as the desired number of
clusters, is not known.

**2.5 Bayesian Information Criterion (BIC) **
In section 3.5 we will propose a strategy for terminating our implementation of
Bisecting algorithm. The proposed stopping criterion is based on the Bayesian
Information Criterion (BIC) or Schwarz Cr
Building upon the BIC criterion, Pelleg
new K-Means variant algorithm. X-Means first adapted BIC for estimating the best

*K * clusters from a given range of values automatically. The algorithm searches over
the values of

*K * and scores each clustering result using the BIC criterion. An
equivalent technique called Minimum Description Length (MDL) is ap
The problem of model selection is how to choose the best one among a set of
candidate models (we assume as model in this case each clustering result in X-
TECHNICAL UNIVERSITY OF CRETE
Let

*D * be a set of documents {

*x *,

*x *,.,

*x *}.

*D * can be partitioned into disjoint
subsets

*D *,

*D *,.,

*D *. In the case of Bisecting K-Means

*K *= 2
centroid of the

*th*
*j * cluster (1 ≤

*j *≤

*K *). Let (

*i*) be the index of the centroid which is
closest to the

*i *−

*th * data point. For example, µ is the centroid nearest to the −
data point during an iteration (1 ≤

*i *≤

*n *). Let

*D *⊆

*D * be the set of data points that
have µ as their closest centroid. Let

*R *=

*D * and

*R *=

*D *. The number of
dimensions is M. Notice that, in our case the initial data set is partitioned into two
The BIC of the model

*M * (i.e., in our case the parent cluster or the two

*BIC*(

*M *) =

*l*
where ˆ

*l D * is the log-likelihood of the data according to the model

*M *, while

*p *=

*K *(

*M *+1) is the number of independent parameters in

*M *.
The BIC, according to Equation 2.18 contains two components. The first term
(log-likelihood of the data points) can be used as a measure of the cohesiveness of a
cluster in order to denote whether a cluster should split or not. We estimate how close
to the centroid are the documents of the cluster. More specific, given a cluster of
points, drawn from a Gaussian distribution

*N *(µ,σ ) , log-likelihood is the
probability that a neighborhood of points follows this distribution. The second term
penalizes the complexity of the moe assume that some data points belong to
the cluster. However, due to the complexity of the model (many parameters or many
data points), the data points, in addition to Gaussian, may follow other distributions.
For this reason, we give a penalty by the second term of Equation 2.18.
The maximum likelihood estimate (MLE) for the variance is given by:
∑

*x *− µ (2.19)

*R *−

*K*
where

*R * denotes the number of documents in cluster

*D *.Given a cluster of data

*P x *) is the probability that a point

*x * follows the distribution

*N *(µ,σ )
produced by the cluster.
NIKOLAOS HOURDAKIS
CHAPTER 2. BACKGROUND & RELATED WORK

*P *(

*x *)
Thus, the log-likelihood of the data in cluster

*C * can be calculated as the logarithm of
the product of probabilities:
ˆ

*l*(

*C *=

*R *−

*K*
+

*R *log

*R *−

*R *log

*R*
To extend the formula in Equation 2.18 for all centroids instead of one, we use the
fact that the log-likelihood of the data points that belong to all centroids is the sum of
the log-likelihood of the individual centroids. Thus, Equation 2.18 can be re-written

*BIC *(

*M *= ∑

*l C *−

*R * (2.22)
The number of free parameters

*p * is the sum of:
♦

*K *−1 class probability ♦

*M **

*K * centroid coordinates ♦ One variance estimate
The variance (Equation 2.19) estimates the average of the square of the
distance of each document from the centroid (mean) of the cluster. This is a measure
of the cohesiveness of the cluster. By computing BIC we estimate how close to the
centroid are the documents of the cluster.
Given a set of clustering results, the one with the highest BIC score,
arg max

*BIC M *, is selected. X-Means uses the BIC in order to determine the
number of clusters K in K-Means method.
TECHNICAL UNIVERSITY OF CRETE

**Chapter 3 **
**Clustering Algorithms Implemented **
In this chapter, we present our methodology for efficient document clustering. We
describe in detail our implemented clustering algorithms and suggest methods that
improve their performance. We deal with K-Means which is the basic partitional
clustering technique. We focus on a variant of the standard K-Means algorithm, the
so-called Incremental K-Means which is combined with a Bisecting K-Mean

*s*
algorithm for obtaining hierarchical clustering solutions. Finally, we suggest
incorporating the Bayesian Information Criterion (BIC) as a splitting criterion within
the above Bisecting Incremental hierarchical clustering approach. All these suggested
techniques are integrated in a new proposed algorithm

**, **called here-after "BIC-Means"

and is described in detail in this chapter.

**3.1 Proposed Methods **
In this study, our main objective is to develop a highly efficient algorithm for
clustering very large document collections (such as OHSUMED), We focus on
partitional clustering techniques due to their low time complexity (which is linear on
the number of documents). Partitional methods have advantages in applications with
large data sets for which the construction of a dendrogram using hierarchical
clustering with agglomerative method is computationally prohibitive. We
implemented and evaluated various partitional clustering methods. Our method can
organize large collections of documents into a hierarchical binary tree. To prevent
over-splitting of clusters (and termination at singleton clusters) we propose a strategy
based on Bayesian Information Criterion (BIC) to stop the divisive procedure. The
cluster splitting stops when meaningful clusters are reached. The combination of
Bisecting Incremental K-Means with Bayesian Information Criterion is the main
CHAPTER 3. CLUSTERING ALGORITHMS IMPLEMENTED
contribution of this work. All methods are implemented and evaluated using standard
document collections (such as Reuters and OHSUMED)
We implemented several variants of the original K-Means method and we
proposed the so-called "Incremental K-Means" a variant of the original K-Means
method that differs from K-Means in the way that the centroids are updated during
each clustering iteration. In Incremental K-Means each cluster centroid is adjusted
after each document is assigned to a cluster rather than recomputing each cluster
centroid after each iteration when all documents have been assigned to clusters. The
experimental results indicated that the Incremental K-Means algorithm performs
better than K-Means and needs less iterations to produce a good clustering result.
Despite their linear time complexity, the main disadvantage of K-Means and
Incremental K-Means is that K must be known in advance. An incorrect estimation of
the number of clusters K may lead to poor clustering accuracy. Both K-Means and its
variants produce a flat partition of the data collection. As mentioned in the
introductory chapter, methods which are able to provide effective navigation and
organization of information (like hierarchical clustering) are preferred. Thus, we are
led to organize information in a hierarchical structure.
In the following we present our version of Bisecting K-Means algorithm,
which combines the strengths of partitional and hierarchical clustering methods.
Furthermore, it is not as sensitive to initialization issues. A hierarchy is built by
recursively applying our version of Incremental K-Means algorithm. For this reason,
we call our implemented algorithm "Bisecting Incremental K-Means". The so-
obtained clusters are structured as a hierarchical binary tree (or a binary taxonomy).
This is the reason why the bisecting approach is very suitable and effective in many
applications (e.g. document retrieval, indexing, browsing, navigation systems). The
algorithm proceeds until each leaf node of the cluster hierarchy contains one
The experimental results indicated that Bisecting K-Means outperforms basic
K-Means and our variant Incremental K-Means in terms of accuracy and efficiency.
This confirms the resu
An important issue in divisive clustering approaches is to determine a strategy
to terminate the divisive procedure of the Bisecting algorithm. In most cases there is
no prior knowledge about the desired number of clusters. For this, a stopping
condition must be defined.
TECHNICAL UNIVERSITY OF CRETE
We propose the use of Bayesian Information Criterion (BIC) or Schwarz
gy to stop the divisive procedure. We suggest using BIC
as the splitting criterion of a cluster. We compute the BIC score to measure the
improvement of a cluster when it is split. If the BIC score of the new cluster structure
is less than the BIC score of the parent cluster we do not split the initial cluster. In
such cases we keep the parent cluster as is and we do not select it as a candidate
cluster to split in the next iteration of the algorithm. Consequently, we terminate the
bisecting algorithm when there is no separable cluster according to the BIC function.
Overall, we propose the so-called BIC-Means clustering algorithm. BIC-
Means produces a hierarchical clustering solution and combines all these ideas:
1. Bisecting algorithm to build a hierarchy of clusters effectively.
2. Incremental K-Means as the proposed partitional algorithm to bisect the
selected leaf cluster at each bisecting step.
3. A stopping criterion for terminating the divisive procedure using the
Bayesian Information Criterion (BIC).
Our proposed method is described in detail below.

**3.2 Preliminaries on Document Modeling **
Representing documents for clustering and other text mining tasks is fundamental in
the process of knowledge discovery. We define the similarity measure which is used
to compute the similarity between two documents.

**3.2.1 Document Representation **
The various clustering algorithms are described and implemented based upon the
Vector Space Model (Veasuring document similarity. In this model,
each document d is considered to be a vector in a multi-dimensional term space. Each
dimension of the space corresponds to a unique word from the corpus.
Let D a collection of documents and

*T *= {

*t *,

*t *,.,

*t *} the set of unique terms
appearing in at least one document in D. Firstly, individual words are further
processed by stop-word removal. Using this preprocessing technique we remove
NIKOLAOS HOURDAKIS
CHAPTER 3. CLUSTERING ALGORITHMS IMPLEMENTED
words without inherent meaning, such as articles or pronouns ("a", "the", "and", etc).

*d *∈

*D * is represented as a vector

*d *= {

*w *,

*w *,.,

*w *} , where

*w *is the weight of the term

*t * within document d. In this work, a variant of

*tf *−

*idf *
weighting scheme is used. The weight

*w * for each term t in document

*d * is defined as:
))× log

*N*
∑ (1+log

*tf *) ×(log( ))
where

*tf * is the number of times word t occurs in document

*d * and

*df * is the number
of documents in the data set in which the word t occurs. To account for documents of
different lengths, we scaled the length of each document vector so that it is of unit
Accordingly, a cluster, which is a set of documents, is represented in the
similar way such as a document. A cluster is represented by its centroid vector (i.e.,
the mean vector of all its contained documents).

**3.2.2 Similarity computation **
Document clustering is based on the definition of document similarity. We measure
the similarity between two documents

*d * and

*d * (or between a document and a

*d *•

*d*
cos(

*d *,

*d *)

*d *∗

*d*
If the document vectors are of unit length, the above formula can be simplified
to cos(

*d *,

*d *) =

*d *•

*d * (by normalizing by document length).

**3.3 Methods Implemented **
In Section 2.2.2, we determined that a cluster is represented by the centroid
vector which is the mean or the median point of a cluster. Given a set S of documents
and their corresponding vector representation, we define the centroid vector

*CS * as:
TECHNICAL UNIVERSITY OF CRETE
∑

*d * (3.3)
The centroid vector is the vector obtained by averaging the weights of the various
terms in the document set. The centroid vector has the following properties:
♦ Computing the cosine measure between a document and a centroid vector
is equivalent to computing the similarity between the document and every
other document within the same cluster:
cos(

*d *,

*c*) = d • c =
∑d •d = ∑cos(d ,d) (3.4)
♦ The square of the length of the centroid vector is the average pairwise
similarity between all documents in a cluster:
∑cos(d ,d) = ∑d * ∑d = c•c = c (3.5)
Notice that Equation 3.5 includes the pairwise similarities involving the
same pairs of vectors. In section 2.3 (where methods for evaluating the
clustering quality were described) we used the average pairwise similarity
within a cluster ("overall similarity") as a measure for cluster compactness
or cluster quality.

**3.3.1 K-Means Implementation **
Experimental results have shown that partitional clustering methods always lead to
better clustering solutions than agglomerative algorithms. Moreover, those are well
suited for clustering large docum
K-Means creates a flat, non-hierarchical clustering solution that is consisted of
K clusters. It takes as input a data set and a parameter K which is the number of
clusters desired. Then K-Means typically finds all K-Clusters.
We will use the symbol S to denote the set of n documents that we want to
cluster. Let

*S *,

*S *,.,

*n *,

*n *,.,

*S * be the K desired clusters and

*n * be the sizes of the
corresponding clusters.
Initially, the algorithm picks K documents (at random) as initial centroids.
Then the algorithm assigns each document to each one of these random centroids. The
clusters (and their centroids) are adjusted iteratively by the algorithm until
NIKOLAOS HOURDAKIS
CHAPTER 3. CLUSTERING ALGORITHMS IMPLEMENTED
convergence (i.e., the centroids do not change significantly). Clustering results can
vary based on the selection of initial centroids. For this, there are more sophisticated
methods for selecting starting centroids. These methods use a heuristic or the results
of another method. Buckshot and Fractionation are the most popular seed selection
approaches. We explained these methods in section 2.2.2. However, experimental
indicated that random seed selection is
significantly faster than the other two methods. In the following, we chose the random
seed selection in the implementation of K-Means.
Once K seeds are selected as centroids, we compute the similarity between
each document and all cluster centroids. The similarity is computed according to
Equation 3.2. Each document is assigned to the closest cluster centroid. Notice that
even though the document vectors are of length one, the centroids vectors will not
necessarily be of unit length. The similarity between a document

*d * and a centroid

*c *
cos(

*d*,

*c*) =

*d *•

*c*/

*d * ∗

*c * =

*d *•

*c*/

*c * (3.6)
The next step is the "centroid re-computation". All docs assigned to the same
centroid are averaged to compute a new centroid using Equation 3.3. This results to K
We repeat the above procedure for ITER times (ITER is user defined) and take
the k-way clustering result that produces the clustering with the highest overall
similarity. The overall similarity of a resulting clustering is defined as the sum of the
average pairwise similarities between all documents assigned to each cluster and is

*Clustering *−

*Overall *−

*Similarity *= ∑
∑ cos(

*d *,

*d *) (3.7)

*r d *,

*d*
The above formula can be re-written as:

*Clustering *−

*Overall *−

*Similarity *= ∑

*c * (3.8)
Therefore, the clustering overall similarity is simplified as the sum of the square of the
length of each centroid vector. After the K-Means algorithm has been executed ITER
times we take the clustering result which has the maximum overall similarity. This is
the final k-way partition. The experimental results indicated that satisfactory results
are obtained when the parameter ITER is set to 5 or 6.
TECHNICAL UNIVERSITY OF CRETE
In most applications, K-Means algorithm continues until the centroids do not
change significantly between iterations. However, due to the fact that the centroids
rarely stop moving entirely and extra time is required to check for minimal
movement; more advantages are obtained from determining when to stop. For this
reason we chose to use the parameter ITER in our implemented version of K-Means.
Figure 3.1 summarizes the K-Means clustering algorithm.

** Input: **K: Number of Clusters, ITER: Number of iterations,

S: (

*d *,

*d *,.,

*d *) document collection

**Output:** K clusters

*S *,

*S *,.,

**Step 1. **Initialize clustering. Randomly select K documents as the initial

centroids of K clusters.

**Step 2. **Assign each document

*d * to the cluster

*S * with the most similar

centroid. The similarity is computed according to Equation 3.4 for
all clusters

*S *,

*S *,.,

**Step 3. **Re-calculate the cluster centroids from assigned documents.

**Step 4. **Repeat steps 2 and 3 for ITER times and take the split that produces

the clustering with the highest overall similarity.

**Figure 3.1: **Our implementation of K-Means algorithm

Figure 3.2 demonstrates an example of K-Means clustering algorithm. We
show the initialization phase and how the iterations proceed.
Arbitrarily select K
documents as initial
cluster centroids

**Figure 3.2: **Example of K-Means Clustering algorithm

NIKOLAOS HOURDAKIS
CHAPTER 3. CLUSTERING ALGORITHMS IMPLEMENTED
K-Means has linear time complexity on the number of documents (much more
effective when compared to the quadratic computation time of hierarchical clustering

**3.3.2 Incremental K-Means **
In this section, we present a partitional clustering method, which we call "Incremental
K-Means". Incremental K-Means is based on our implementation of K-Means
clustering algorithm. The main point of this method is that the centroid is updated
incrementally, as each document is assigned to a cluster. This has been shown to
improve the effectiveness of basic K-Means algorithm in both execution time and
clustering quality.
In K-Means during an iteration the centroid remains fixed. New centroids are
computed after each iteration (after all documents have been examined). Incremental
K-Means updates centroids after a document is assigned to a cluster. This way the
cluster adjusts to information collected during an iteration and the centroid better
reflects properties of the documents collected so far within a cluster. The following
formula is used to update the centroid of a cluster with centroid

*C * after a new
document

*d*
is assigned to the same cluster.
(

*S *1))

*d assigned*
is the new centroid,

*C * is the centroid before the assignment of the
new document,

*d*
is the document which is added to the cluster and

*S * is the
new size of the cluster. The time requirement to update the centroid is constant.
After all iterations of the algorithm, new updated centroids have been
computed. Then, all documents are removed from the clusters and we iterate over all
documents in sequence assigning each document to the closest centroid.
Figure 3.3 summarizes Incremental K-Means:
TECHNICAL UNIVERSITY OF CRETE

**Input: **K: Number of Clusters, ITER: Number of iterations,

S: (

*d *,

*d *,.,

*d *) document collection

** Output: **K clusters

*S *,

*S *,.,

**Step 1. **Initialize clustering. We randomly select K documents as the initial

centroids of the K clusters.

**Step 2. **The documents are visited in a random order. When a document is

assigned to a cluster, we update the corresponding centroid.

**Step 3. **We "clean" the clusters and iterate over the documents assigning

each document to the closest centroid.

**Step 4. **Repeat steps 2 and 3 for ITER times and take the split with the

highest overall similarity.

**Figure 3.3: **Incremental K-Means algorithm

Notice that step 2 examines documents in a random order. Otherwise, in a
given data set the clustering will always generate the same cluster solution.
An important advantage of Incremental K-Means over K-Means is that it
requires less iterations to produce an acceptable clustering result. As we shall see in
the experiments, one or two iterations are sufficient. Furthermore, Incremental K-
Means is not as susceptible to the seed selection technique. Experiments by Larsen
cremental K-Means creates equally good
clustering results with random, buckshot or fractionation seed selection algorithm.
Similarly to K-Means, the time complexity of Incremental K-Means is

*O*(

*n*) ,
where n is the number of documents.

**3.3.3 Bisecting Incremental K-Means **
K-Means and Incremental K-Means create a flat, non-hierarchical clustering of a data
set. Due to the tremendous growth in classic document collections and the internet
there is an increased need for effective clustering allowing also for faster browsing
through the contents of a data set. For this hierarchical clustering is more appropriate
than partitional clustering. Hierarchical clustering also provides effective navigation,
data summarization and organization of information by organizing large data
collections into any given number of clusters which are structured as a hierarchical
NIKOLAOS HOURDAKIS
CHAPTER 3. CLUSTERING ALGORITHMS IMPLEMENTED
binary tree. Agglomerative clustering is often thought as the best quality clustering
approach for this purpose. However, it is not as effective due to its quadratic time
complexity. Experimental results inindicated that Bisecting K-Means always
lead to better hierarchical solutions than agglomerative algorithms.
In this section, we present our implementation of the Bisecting K-Means
clustering algorithm. It is derived from the standard approach. In our approach, we
suggest some modifications.
We produce a hierarchical clustering solution via a sequence of repeated
bisections. We chose to use our version of Incremental K-Means, as described in
section 3.3.2, to bisect a cluster at each bisecting step. For this reason, we call our
method "Bisecting Incremental K-Means". The choice of this algorithm instead of
basic K-Means is based on our experimental results which are presented in chapter 4.
These show that Incremental K-Means is better than the standard K-Means clustering
The algorithm starts with a single cluster with all documents. Initially, we use
our Incremental K-Means algorithm to bisect the entire collection into two clusters.
Then, one of two clusters is selected and is further bisected, leading to a total of three
clusters. This process of selecting and bisecting a leaf cluster continues

*n *−1 times,
until

*n * leaf clusters are obtained. In this case, each leaf cluster will contain a single
document. Note that

*n * is the number of documents of the entire collection.
There are a number of different ways to choose which cluster to split from the
list of leaf clusters. In our approach, we choose to split the cluster with the least
overall similarity. Overall similarity is often called "intra cluster similarity" and is
∑cos(d ,d) (3.10)

*S * d'

*S*
where S is the set of documents in the cluster. However, as is derived from Equation
3.5, we can calculate the overall similarity of a cluster by just computing the squared
length of the cluster centroid,

*c *. This simplification decreases the time
requirements to compute overall similarity before each bisecting step. Therefore, we
can quickly choose the cluster to split.
Figure 3.4 summarizes Bisecting Incremental K-Means algorithm:
TECHNICAL UNIVERSITY OF CRETE

**Input **K=2 in Incremental K-Means, S: (

*d *,

*d *,.,

*d *) document collection

** Output: **A hierarchy of clusters (leaf clusters contain a single documents)

**Step 1. **Treat all the documents as one initial cluster.

**Step 2. **Pick a leaf cluster C (or initial) to split. Choose the cluster with the

least overall similarity.

**Step 3. **Bisecting Step: Use Incremental K-Means, as described in section

3.3.3 to split cluster C into two sub-clusters, 1

*C * and

*C *.

**Step 4. **Add the two clusters that are produced from the partition to the list

of leaf clusters (candidate clusters to split).

**Step 5. **Repeat steps 2, 3 and 4 until each cluster at the bottom of the

hierarchy contains a single document.

**Figure 3.4:** Bisecting Incremental K-Means algorithm

The basic Bisecting K-Means stops when the desired number of clusters is
reached. Bisecting Incremental K-Means terminates when each leaf cluster contains a
single document. The reason of this modification is that usually there is no prior
knowledge on the desired number of clusters.
Figure 3.5 demonstrates an example of Bisecting Incremental K-Means. We
use a small data set consisting of five documents and show how our method can be
applied to this collection. The algorithm generates a hierarchical binary tree step-by-
step. At each step the hierarchical tree is expanded by adding two new leafs. The
process starts with a single cluster

*C *, which consists of all the documents and
continues until five leaf clusters are obtained, each containing one document. The
final five leaf clusters are the

*C *,

*C *,

*C *,

*C *. At each step, the leaf clusters
with the least overall similarity is split in two new clusters (leaf nodes in Figure 3.5).
For example, among leaf clusters

*C *,

*C * and

*C * we assume that

*C * is the one with the
least overall similarity, so we bisect it into clusters 5

*C *. Then, we continue the
procedure likewise. In Figure 3.5, the clusters which are selected for bisection are
highlighted orange.
NIKOLAOS HOURDAKIS
CHAPTER 3. CLUSTERING ALGORITHMS IMPLEMENTED
Split Cluster

*C *
**Figure 3.5: **Bisecting Incremental K-Means produce a hierarchical tree

It is obvious that Bisecting Incremental K-Means is a divisive hierarchical
clustering procedure. It builds a hierarchical binary tree from top (i.e. a cluster of all
the documents) to bottom (each cluster contains a single document), as opposed to
agglomerative approaches which build the hierarchy bottom-up.
TECHNICAL UNIVERSITY OF CRETE
The run time of Bisecting Incremental K-Means method is

*O*(

*n *log (

*n*)) ,
where

*n * is the number of documents in the entire collection. Thus, it is appropriate
technique for clustering large datasets and producing hierarchy of clusters.

**3.4 The Bayesian Information Criterion (BIC) **
In the previous section, we described that the Bisecting Incremental K-Means
algorithm continues until

*n * leaf clusters are obtained, each containing a single
document. In this case,

*n * is the number of documents in the collection. We
exhaustively execute the algorithm, because usually there is no prior knowledge on
the desired number of clusters. However, terminating the procedure when each leaf
cluster has one document is time-consuming. Moreover singleton clusters are
meaningless. To prevent over-splitting of clusters we must define a strategy to stop
the Bisecting algorithm when meaningful clusters are reached.
Toward this goal, we propose the use of Bayesian Information Criterion (BIC)
or Schwarz Crite splitting criterion of a cluster. As discussed in section
Means algorithm) first adapted the BIC to
clustering algorithms for estimating the best K in a given range of values. The
algorithm searches over many values of K and scores each clustering result using the
so-called Bayesian Information Criterion. X-Means choose the clustering result with
the best BIC score in the data (i.e., the

*K * clusters with the highest BIC score).
In this study, we use the BIC to perform a splitting test at each cluster in order
to decide whether a cluster should split or not. The BIC score is used to measure the
improvement of the cluster structure between the parent cluster and its two children
clusters. We compute the BIC score to initial cluster and to the resulting (child)
clusters. If the BIC score of the produced children clusters is less than the BIC score
of their parent cluster we do not accept the split. We keep the parent cluster as it is
(we do not select it again). Otherwise, we accept the split and the algorithm proceeds
similarly at lower levels.
NIKOLAOS HOURDAKIS
CHAPTER 3. CLUSTERING ALGORITHMS IMPLEMENTED

**3.4.1 Computing the BIC Score **
In our case we have a collection of documents and we partition it into two clusters.
The parent cluster can be considered as a model and the resulting cluster structure
(with the two children clusters) as a second model. For each model, we compute the
BIC score. We compare the BIC score of the two models and we accept the split if the
BIC score of the second model is higher than the BIC score of the first model. Below,
we describe how BIC is computed.

*D * be a set of documents {

*x *,

*x *,.,

*x *}.

*D * can be partitioned into disjoint
subsets

*D *,

*D *,.,

*D *. In the case of Bisecting K-Means

*K *= 2
centroid of the

*th*
*j * cluster (1 ≤

*j *≤

*K *). Let (

*i*) be the index of the centroid which is
closest to the

*i *−

*th * data point. For example, µ is the centroid nearest to the −
data point during an iteration (1 ≤

*i *≤

*n *). Let

*D *⊆

*D * be the set of data points that
have µ as their closest centroid. Let

*R *=

*D * and

*R *=

*D *. The number of
dimensions is M. Notice that, in our case the initial data set is partitioned into two
The BIC of the model

*M * (i.e., in our case the parent cluster or the two

*BIC*(

*M *) =

*l D *−

*R * (3.11)
where ˆ

*l *(

*D*) is the log-likelihood of the data according to the model

*M *, while

*p *=

*K *(

*M *+1) is the number of independent parameters in

*M *.
The BIC, according to Equation 3.11, contains two components. The first term
(log-likelihood of the documents) can be used as a measure of the cohesiveness of a
cluster in order to denote whether a cluster should split or not. We estimate how close
to the centroid are the documents of the cluster. More specific, given a cluster of
points, that produces a Gaussian distribution

*N *(µ,σ ) , log-likelihood is the
probability that a neighborhood of data points follows this distribution. The second
term penalizes the complexity of the mode We assume that some documents
belong to the cluster. However, due to the complexity of the model (many parameters
TECHNICAL UNIVERSITY OF CRETE
or many data points), some data points, in addition to Gaussian may follow and other
distributions. For this reason, we give a penalty by the second term of Equation 3.11.
In the case of BIC score of the parent cluster K is set to 1, while in case of the
two resulting clusters

*K * is set to 2.

*M * is the number of terms in the representations
of documents. The maximum likelihood estimate for the variance is given by:
∑

*x *−µ (3.12)

*R *−

*K i*
The variance (Equation 3.12) estimates the average of the square of the
distance of each document from the centroid (mean) of the cluster. This is a measure
of the cohesiveness of the cluster.
According to Equation 2.21 the maximum log-likelihood of the data in cluster

*D * can be computed as:

*R *−

*R*
*R * (3.13)

*n *log (2 )

*R * denotes the number of documents in cluster

*D *. The maximum log-likelihood is
computed separately for the parent cluster and for each one of the two children
clusters. In Equation 3.13, we always set the variable

*K * to 1, as we pertain to the log-
likelihood of a single cluster. The maximum likelihood estimate (MLE) for the
σ is computed separately for each cluster according to Equation 3.12.
To extend the formula in Equation 3.11 for two centroids (two children
clusters) instead of one, we use the fact that the log-likelihood of the data points that
belong to the two centroids is the sum of the log-likelihood of the individual
centroids. Thus, Equation 3.11 can be re-written as:

*BIC *(

*M *= ∑

*l C *−

*R * (3.14)
The number of free parameters

*p * is the sum of:
♦

*K *−1 class probability ♦

*M **

*K * centroid coordinates ♦ One variance estimate
We can see in Equation 3.13 that, as 2
σ increases, the likelihood decreases
and therefore the BIC score (see Equation 3.11) decreases. As a result the parent
NIKOLAOS HOURDAKIS
CHAPTER 3. CLUSTERING ALGORITHMS IMPLEMENTED
cluster must be partitioned. We can conclude that 2
σ mostly determines the BIC
score of a cluster as it provides a measure of the cohesiveness of the cluster.
As far as Equation 3.12 is concerned, computationally significant
simplifications can be applied. This allows for a fast and memory efficient
implementation of BIC. Particularly, we rewrite the sum ∑

*x *− µ
3.12. By some simple algebraic manipulations this sum can be re-written as:
∑

*x *−µ =
∑

*x *−

*x *= 2 ∑ 1−cos

*x *,

*x *
*n x *,

*x*
*x *,

*x*
∑ cos(

*x *,

*x *
*n x *,

*x*
*n x *,

*x*
∑ cos(

*x *,

*x *
*n x *,

*x*
∑ cos

*x x * (3.15)

*Rn x *,

*x D*
By using Equation 3.5 the above formula can be re-written as:
∑

*x *− µ =2

*R *−2

*R*
∑ cos

*x *,

*x *
*n x *,

*x*
= 2

*R *− 2

*R c *
*c * is the square of the length of the centroid vector. Thus, the Equation 3.12
after the modifications can be re-written as:
2∗

*R *∗(1−

*c *) (3.17)

*R *−

*K*
Summarizing, the value of 2
σ for a cluster, as defined in Equation 3.17, is
used in Equation 3.13 for computing the maximum log-likelihood of a cluster. In case
of the parent cluster, the value of log-likelihood is applied in Equation 3.11 to
compute the BIC score of the initial cluster. In case of the two resulting clusters, we
compute the log-likelihood value separately for each cluster. Then, the two computed
values are added, as we can see in Equation 3.14 and the BIC score of the resulting
model (two children clusters) is computed.
TECHNICAL UNIVERSITY OF CRETE
Figure 3.6 demonstrates an example of the use of BIC as splitting criterion of a
cluster. We assume cluster

*C * is a leaf cluster in the hierarchy and according to its
overall similarity (see Equations 3.5 and 3.10) we select to bisect it. Cluster

*C * is
bisected into clusters 1

*C * and

*C *.

**Figure 3.6: **BIC as splitting Criterion of a cluster

We use the BIC to determine if the bisection is acceptable. It can be seen in
Figure 3.6 that we have computed two distinct BIC scores. One for the parent cluster
and another for the two children clusters. We compare these scores to decide if we
split the initial cluster. It is shown in Figure 3.6 that the BIC score of the parent
cluster is less than BIC score of the generated cluster structure. Thus, we accept the

**3.5 BIC-Means **
In this section, all techniques presented in the previous sections are integrated in a
new proposed algorithm, which we call

* "BIC-Means"*. It is a partitional clustering
method which structures the resulting clusters as a hierarchical binary tree by
recursively applying the Incremental K-Means algorithm presented in section 3.3.2.
Moreover, a significant modification in our proposed final algorithm as compared
with the basic Bisecting approach is the use of a stopping criterion in order to stop
bisecting the clusters. Instead of continuing the algorithm until each leaf cluster
contains one document, BIC-Means uses a strategy for terminating the divisive
procedure. The BIC plays the most important role towards this goal.
NIKOLAOS HOURDAKIS
CHAPTER 3. CLUSTERING ALGORITHMS IMPLEMENTED
In the following, we propose a strategy for terminating the divisive procedure
in BIC-Means, when meaningful cluster are reached. Let Incremental K-Means
method, as is described in section 3.3.2, be repeatedly applied in a data set which
contains

*n * documents. When this process has been executed

*m *−1 times, a hierarchy
of

*m * leaf clusters is obtained, where

*m *<

*n *.
As mentioned in section 3.4, the BIC score is applied locally as the splitting
criterion of a leaf cluster. It measures the improvement of a cluster when it is split. If
the BIC score of the two new clusters is less than the BIC score of their parent node
we do not accept the split. In such cases, the proposed strategy defines that we keep
the parent cluster as is and we do not select it as a candidate cluster to split in the next
iteration of the algorithm. Consequently, the BIC-Means terminates when there is no
separable cluster according to the BIC function, instead of terminating at meaningless
singleton clusters.
Overall, BIC-Means produces a hierarchical clustering solution and combines
all the following ideas:
1. Bisecting clustering approach to build a hierarchy of clusters effectively.
2. Incremental K-Means as the proposed partitional method to bisect the
selected leaf cluster at each bisecting step.
3. A termination criterion for preventing clustering from over-splitting using
the Bayesian Information Criterion (BIC).
Step-by-step, the proposed BIC-Means algorithm is presented in Figure 3.7:
TECHNICAL UNIVERSITY OF CRETE

** Input: **K=2 in Incremental K-Means method, S: (

*d *,

*d *,.,

*d *) document
collection, BIC formula.

**Output: **A hierarchy of clusters (consist of meaningful leaf clusters)

**Step 1. **Treat all the documents as one initial cluster.

**Step 2. **Pick a leaf cluster C (or initial) to split from the list of leaf clusters.

Choose the cluster with the least overall similarity which is the
average pairwise similarity between all documents in the cluster.

**Step 3. **Bisecting Step: Use Incremental K-Means, as described in section

3.3.2 to split cluster C into two sub-clusters, 1

*C * and

*C *.

**Step 4. **Calculate two BIC scores for the two distinct models. One for the

initial cluster

*C * and another for the two resulting clusters 1

*C *. We define two possible cases:
♦ If the BIC score of the parent cluster is less than the BIC score
of the new cluster structure: We accept the split and add the
two generated clusters to the list of leaf clusters (candidate
clusters to split).
♦ Otherwise: we keep the cluster

*C * as it is and do not select it
as a candidate cluster to split in a next iteration of the BIC-
Means method. In other words, we remove cluster

*C * from the
list of leaf clusters.

**Step 5. **Repeat steps 2, 3 and 4, until there is no leaf cluster in the

hierarchy which is separable according to the BIC score. Then, the
BIC-Means algorithm terminates.

**Figure 3.7: **The proposed BIC-Means algorithm

Figure 3.8 demonstrates an example of BIC-Means. We assume that via a
sequence of repeated bisections on a document collection, a hierarchy of clusters has
been obtained. Figure 3.8 illustrates the last level of the hierarchy, where are the leaf
clusters. We apply the pre-defined strategy on the four leaf clusters which are
appeared in Figure 3.8 to indicate when the BIC-Means algorithm terminates
according to our proposed methodology.

*C *,

*C *,

*C *,

*C * denote the four leaf clusters in
the initial hierarchy and highlight them orange.
NIKOLAOS HOURDAKIS
CHAPTER 3. CLUSTERING ALGORITHMS IMPLEMENTED

*C *: According
to BIC is not bisected

*C *: According
to BIC is bisected
… Sequentially split the remaining leaf
clusters, until no one
is separable. Then, the algorithm terminates.

**Figure 3.8: **Algorithm for terminating the BIC-Means method

According to our Bisecting approach, we pick a leaf cluster to split from the
list of leaf clusters. Let

*C * be the cluster with the least overall similarity. We bisect it
and assume that its BIC score is greater than the BIC score of the two resulting
clusters. Thus, as we described in section 3.4, we do not split the cluster

*C * and
additionally do not select it for further bisections. Also,

*C * is removed from the list of
leaf clusters and is highlighted gray.
TECHNICAL UNIVERSITY OF CRETE
We continue by selecting the next cluster for splitting. Let 3

*C * has the least
overall similarity. We partition it into two sub-clusters 5

*C * and consider that
the BIC score is determining that 3

*C * can be split. Consequently, the list of leaf
clusters consists of

*C *,

*C *,

*C *. For short, as the remaining leaf clusters are
concerned, we bisect them sequentially. For each one, we assume that it can not be
partitioned if we compare its BIC score to the BIC score of the corresponding children
clusters. Therefore, there is no separable leaf cluster in the hierarchy and as step 4 of
our proposed method indicates, the BIC-Means algorithm terminates.
NIKOLAOS HOURDAKIS
CHAPTER 3. CLUSTERING ALGORITHMS IMPLEMENTED
TECHNICAL UNIVERSITY OF CRETE

**Chapter 4 **
**Experimental Results **
We carried out two different sets of experiments. The first set of experiments focuses
on the evaluation of the clustering quality of all algorithms presented in chapter 3. F-
Measure was used to measure the overall "goodness" of the generated clusters.
Clustering techniques have been tested on OHSUMED
two standard text corpora widely available on the Web. Both corpora offer a pre-
defined categorization of its content into clusters which can be used to measuring the
clustering quality of the implemented clustering algorithms. The results demonstrate
that our proposed BIC-Means algorithm performs at least as good as other state of the
art clustering techniques.
The second set of experiments focuses on measuring the effectiveness and
efficiency of a cluster-based information retrieval system. Having established the
quality of document clustering algorithms, we applied the suggested BIC-Means on
OHSUMED (a very large document collection with 233445 medical articles from
Medline) in order to create a hierarchy of clusters. For the evaluations, we applied a
subset of 61 queries of the original OHSUMED query set developed by Hersh et al.
es were compiled by human experts. We
matched each query against the leaf clusters of the hierarchy and the clusters were
ranked based on their similarity to the query. We evaluated several cluster-based
retrieval strategies and compare them against retrieval results by exhaustive search on
MeSHngs) is a taxonomic hierarchy (ontology) of medical
and biomedical terms (or concepts) suggested by the U.S National Library of
CHAPTER 4. EXPERIMENTAL RESULTS
Medicine (NLM)and searching of journal articles in
MEDLINE are produced by the NLM. MeSH is
widely used in indexing and cataloging by libraries and other institutions around the
world. NLM has adopted the Extensible Markup Language (XML)e description
language for MeSH. The MeSH vocabulary file is available in XML (Bray) format.
There exist 23880 main headings, termed descriptors in 2006 MeSH edition.
Moreover, MeSH descriptors are organized in a logical "tree" structure. There are 16
subtrees (taxonomies) or branches in the MeSH ontology (see Figure 4.1), of ISA kind
of relationship between nodes (concepts) in each subtree. Within each sub-category,
descriptors are arrayed hierarchically from most general (e.g. "chemicals and drugs")
to most specific (e.g. "aspirin") in up to eleven hierarchical levels. Each MeSH
descriptor appears in at least one place in the subtree and may appear in several places
in the hierarchy. MeSH concepts correspond to MeSH objects which are described
with terms of several properties [see section A.1 in Appendix A]. The most important

**MeSH Headings (MH): **MeSH Headings or descriptors are a collection of terms for

primary themes or topics contained in literature. They are used in MEDLINE as
the indexing terms for documents. Every journal article is indexed with 10-12
headings. Its use indicates the topic discussed by the work cited.

**Qualifiers or Subheadings: **In addition to the descriptor's hierarchy, MeSH contains

a small number of standard qualifiers, which can be added to descriptors to
narrow down the topic. There are 83 qualifiers in 2006 MeSH ontology.
Qualifiers afford a convenient means of grouping together those citations which
are concerned with a particular

**Entry Terms: **These terms are used as pointers to the MeSH Headings. Entry

vocabulary has been thought of as synonyms or very similar terms of the main
Heading. Entry terms, sometimes called "See cross references", indicate that
information related to one term will be found under a different term. Moreover,
the set of entry terms that points to a MeSH Heading are the terms that indicate
the concept introduced by
3http://www.nlm.nih.gov
5http://www.w3.org/XML
TECHNICAL UNIVERSITY OF CRETE

**MeSH Tree Number: **In the MeSH taxonomy, each MeSH Heading is characterized

by its MeSH tree number (or code name) indicating the exact position of the
term in the MeSH tree taxonomy. For example C is the code name of the
"Diseases" subtree and the term "Slow Virus Diseases" has a tree number
C02.839 meaning that this MeSH Heading belongs to C subtree (see Figure 2.1).

**MeSH Scope Note: **This short piece of free text provides a type of definition in which

the meaning of the MeSH Heading is circumscribed.
Names of descriptors reflect the broad meaning of the concepts involved. The
hierarchical relationships must be intellectually accessible to users of MeSH (e.g.,
clinician, librarian, and indexer). An indexer must be able to assign a given MeSH
Heading to an article and a clinician must be able to find a specific MeSH Heading in
the tree hierarchy.

**1 . Anatomy [A] **

2 . Organisms [B]

3 . Diseases [C]
o Virus Diseases [C02]
¾ Slow Virus Diseases [C02.839] +

**4 . Chemicals and Drugs [D] **

5 . Analytical, Diagnostic and Therapeutic Techniques and Equipment [E]

6 . Psychiatry and Psychology [F]

7 . Biological Sciences [G]

8 . Physical Sciences [H]

9 . Anthropology, Education, Sociology and Social Phenomena [I]
**1 0 . Technology and Food and Beverages [J] **

1 1 . Humanities [K]

1 2 . Information Science [L]

1 3 . Persons [M]

1 4 . Health Care [N]

1 5 . Publication Characteristics [V]

1 6 . Geographic Locations [Z]
**Figure 4.1: **MeSH Tree Structures 2006

**4.2 Document Collections **
For evaluating the quality of clustering algorithms document clustering results are
compared against manually and pre-defined categorization of the corpus. To reduce
NIKOLAOS HOURDAKIS
CHAPTER 4. EXPERIMENTAL RESULTS
the risk that our results might be valid only on a particular corpus, we experimentally
evaluated the performance of the implemented clustering algorithms on two different
data sets: the Reuters-21578
orization exists for both corpora. We
created two subsets of each data set to perform the document clustering experiments.
The entire OHSUMED collection was also used in cluster-based retrieval
experiments. Issues related to Reuters-21578 and OHSUMED along with the
description of their subsets are discussed below.

**4.2.1 Reuters-21578 **
Reuters-21578 is a commonly used document collection for text categorization tasks.
It consists of 21578 newswire articles from the Reuters news service obtained in 1987
ain is broad enough to be realistic and the content of the news is
understandable for non-experts. Reuters-21578 is freely available and is distributed in
22 files. The files are in SGML format. Each of the first 21 files contains 1000
documents, while the last contain 578 documents. All Reuters-21578 documents have
more information than the simple article reference. The structure of a Reuter's
document can be found in Appendix A at section A.2.1. The most commonly used
attributes in a Reuter's article are the title, the abstract and the topic.
Reuters-21578 collection comprises an "a priori" categorization of documents.
They were annotated and indexed with categories by personnel from Reuters Ltd. and
Carnegie Group, Inc. in 1987. The topic field is used to classify each document in a
pre-define category. Documents have been categorized into 135 distinct topics
(categories). Each article may be labeled with none, one or with many pre-defined
topics. The lack of a label indicates that the human annotator could find an adequate
topic. In our experiment we used the most commonly used split of Reuter's
documents, the so-called "Mod-Apte" where the 21578 documents are separated into
9603 training documents, 3299 test documents and 8676 unused documents.
To experimentally evaluate the implemented clustering algorithms we formed
two subsets of Reuters-21578. For both subsets we have selected articles that belong
TECHNICAL UNIVERSITY OF CRETE
to exactly one of 135 topics (categories). Additionally, documents with an empty
body field were also discarded.
The first subset, which we call

*reuters1*, contains documents in which the
value of LEWISSPLIT attribute is "TEST" and attribute TOPIC = "YES" according
to "Mod-Apte" split. Each article classified with a single topic. R

*euters1* contains
2583 documents into 59 categories. In this categorization, categories with extremely
few documents (less than 10) have been discarded. Thus, "outlier categories" are
ignored in the evaluation. The resulting subset consists of 2442 documents which
have been classified in 24 classes (categories). The distribution of documents per
topic is shown in Table 4.1. At each cell we note the name of the topic and the
number of documents that are contained in the corresponding category.

**Reuters1 – 2442 documents (Category: No. of Documents) **
money-supply: 28

**Table 4.1: ***reuters1* - Category Distribution

We call

*reuters2 *the second subset of Reuters-21578. It is larger than

*reuters1*
containing 9120 documents into 66 distinct classes (categories). The only difference
between the two subsets is the value of LEWISSPLIT attribute. In

*reuters2* this value
can be set either "TEST" or "TRAIN". The other settings are the same as in

*reuters1*.
The categories which contain less than 31 documents were discarded from this subset
as well. Thus,

*reuters2* contains 8712 documents into 24 classes. Category
distribution is shown in Table 4.2. In both subsets the majority of the documents have
been labeled with "earn" topic.
NIKOLAOS HOURDAKIS
CHAPTER 4. EXPERIMENTAL RESULTS

**Reuters2 – 8712 documents (Category: No. of Documents) **
money-supply: 151

**Table 4.2: ***reuters2* - Category Distribution

**4.2.2 OHSUMED **
OHSUMED document collection was compiled by William Hersh et al
Oregon Health Sciences University. It is a clinically oriented subset of Medline.
Medline is the bibliographic database of the U.S. National Library of Medicine
(NLM). It contains more that 15 million references (version 2006) to journal articles
in life sciences, medicine and bio-medicine. OHSUMED consists of 348566 Medline
documents from 270 medical journals taken between the years 1987-1991. 233445 of
the references contain abstracts and can be downloaded from
ftp://medir.ohsu.edu/pub/OHSUMED. OHSUMED has become an evaluation
benchmark in text categorization and
OHSUMED stores a rich set of metadata associated with each article. The
structure of an OHSUMED document can be found in Appendix A at section A.2.2.
Publications in OHSUMED are manually indexed by NLM using MeSH Headings
(MH), with typically 10-12 descriptors assigned to each reference. Title (TI), abstract
(AB) and MeSH Headings (MH) are the most commonly used fields of OHSUMED
references. We used these fields in our document clustering evaluation.
To evaluate implemented clustering algorithms pre-classified sets of
documents are needed. For this reason, two OHSUMED subsets were formed.
We assume that OHSUMED documents belong to categories related to the
MeSH Headings that are manually assigned to them. The produced subsets which we
call

*ohsumed1* and

*ohsumed2* contain documents from the risk factors, tomography,
prognosis, pregnancy, receptors, molecular sequence data, in-vitro, DNA, carcinoma,
TECHNICAL UNIVERSITY OF CRETE
and antibodies categories. This OHSUMED categorization has also been used for
document clustering evaluationss differ in the number of
documents they contained. Ohsumed1 consists of 32230 documents classified in 10
categories (classes). Ohsumed2 contains 10902 documents into 10 classes and was
produced from Medline documents of the year 1990. Category distribution of both
subsets is shown in Table 4.3:

**Ohsumed1 – 32230 Documents **
**Ohsumed1 – 10902 Documents **
**Category **
**No. docs in this cat.**
**Category **
**No. docs in this cat. **
Molecular Sequence
Molecular Sequence

**Table 4.3: **Ohsumed1 & Ohsumed2 – Category Distribution

The entire OHSUMED collection was used in our information retrieval
experiments. The basic reason for this choice is that OHSUMED is a domain specific
collection. A set of 106 queries have also been defined on OHSUMED along with the
set of documents which are relevant to each query. Apart from the original
OHSUMED query set developed by Hersh et al, a subset of 63 queries were used in
TREC-9ec Retrieval Conference) IR experiments.
NIKOLAOS HOURDAKIS
CHAPTER 4. EXPERIMENTAL RESULTS

**4.3 MeSH-Based Document Representation in OHSUMED **
An important consideration for document clustering is the representation of
documents. Traditionally, documents are represented by extracting individual words
from text (abstract, title). In OHSUMED, each document is represented by abstract,
title and MeSH terms (MeSH Headings) fields. MeSH is a control vocabulary offering
a hierarchical categorization of medical concepts. In OHSUMED (the same every
document in Medline) each document has been indexed manually by a set of MeSH
One of the goals in our experiments was to explore the MeSH terms as
features for document representation. A summary of MeSH is given in section 2.6. In
a part of our experiments, instead of obtaining the term collection of a document from
single word terms in title and abstract, MeSH terms were extracted and used to
represent the document. They were extracted from title and abstract fields. In this
MeSH term collection we added the MeSH terms accompanying each document. The
use of MeSH terms is important for two reasons. First, they are assigned to
OHSUMED references by trained indexers, thus many issues involved with natural
language processing may be avoided. Second, they are multi-word representations
corresponding to medical concepts and as such they are directly comprehensive by
A MeSH term is often consisted of two or more words. For example,
"abdominal pain" is a MeSH term. It is consisted of the words "abdominal" and
"pain". An issue that needs special attention here is how MeSH terms can be extracted
from OHSUMED documents.
For this, we check if a word combined with its next one that come across in
the document consists a MeSH term. If they do, then we check both of them with the
next one if they consist a MeSH term, and so on. If they do not, then a) if a MeSH
term was found until then, we keep the term and continue checking words after this
term, b) if a MeSH term was not found until then, we keep the word as is and continue
checking with the others. For example,
"Abdominal pain in children"
¾ Stopwords to remove: in
check: abdominal? NO
TECHNICAL UNIVERSITY OF CRETE
check: abdominal pain? YES
check: abdominal pain children? NO (END of text)
¾ Found MeSH term? YES (keep term)
¾ Continue checking after MeSH term
check: children? NO (END of text)
¾ Found MeSH term? NO (keep word)
Checked text: "abdominal pain children"

**4.4 Evaluation Method **
One of the most important issues in document clustering experiments is to find an
algorithm-independent measure to evaluate the quality of the clustering result. As
presented in section 2.3, several measures have been proposed in the literature. They
include "entropy", "purity", "overall similarity", "F-Measure" and more.
In this study, F-Measure was used to evaluate the quality of the generated
clusters. We examine how closely the clusters produced by each clustering algorithm
match the set of categories previously assigned to the documents. This requires the
preparation of the data sets so that at each document is assigned a single topic label.
The category distribution for the two subsets of Reuters-21578 was shown in Table
4.1 and Table 4.2. The categories assigned to the documents of two OHSUMED
subsets were presented in Table 4.3. Each table shows the topic labels and the number
of documents that belong to the specific category.
In section 2.3.3, we presented in detail how F-Measure is computed given a set
of generated clusters and a pre-defined categorization of the documents. The overall
F-Measure for a clustering solution is computed according to Equation 2.15. A perfect
clustering solution will be one that leads to clusters which contain documents solely
from a single category (class). In such case the F-Measure score will be one. In
general, the higher the F-Measure values, the better the clustering result is.
We continue giving a simple example of evaluating F-Measure on a cluster
hierarchy. Figure 4.2 illustrates a hierarchical clustering solution. We assume that

*A B*,

*C*,

*D * and

*E * are five documents which constitute a small data set. Suppose, we
apply on this collection the Bisecting Incremental K-Means algorithm as described in
section 3.3.3. Each leaf cluster at the bottom of the hierarchy contains one document.
NIKOLAOS HOURDAKIS
CHAPTER 4. EXPERIMENTAL RESULTS
We assume that documents

*B*,

*C * and

*D * are in fact members of a real class

*T *. Thus,
the number of documents in the category

*T * is

*T *= 3.
We want to find which cluster in the tree hierarchy corresponds to

*T *. To find
this cluster we traverse the hierarchy of the clusters, calculating precision, recall and
F-Measure for each cluster with respect to topic

*T *. Initially, we meet the root cluster
at the top, which contains all documents (

*C *= 5). There are three common documents
in root cluster and category

*T *. Thus, we calculate precision, recall and F-Measure
according to Equations 2.12, 2.13 and 2.14.
Pr

*ecision*(

*Root *_

*Cluster*,

*T *) = 3/ 5
Re

*call*(

*Root *_

*Cluster*,

*T *) = 3/ 3

*F *−

*Measure *= 0.75

**Figure 4.2: **A representative evaluation example

We apply the same computation to each cluster in the tree hierarchy. The
highest F-Measure is 0.85 and is obtained in cluster 1. Cluster 1 contains the

*A B*,

*C * and

*D *. Therefore, we consider the cluster 1 to be the cluster

*C *
corresponding to category

*T * and 0.85 is the final F-Measure for category

*T *. The
overall F-Measure, as given by Equation 2.15 is used to indicate the quality of the
whole hierarchy. It is the weighted average of the F-Measures for each category

*T *.
TECHNICAL UNIVERSITY OF CRETE

**4.5 Document Clustering Experiments **
In this section we present our first set of experiments evaluating the quality of the
clustering solutions produced by the clustering algorithms implemented in this thesis.
Specifically, we evaluate and compare results obtained by the K-Means, Incremental
K-Means and Bisecting Incremental K-Means methods. These are results obtained
from documents represented by simple (single words) terms. For OHSUMED
documents we also experimented with documents represented by MeSH terms (multi-

**4.5.1 Experimental Setup **
The main features of the four document collections were used in our experiments are
summarized in Table 4.4.

**No of Doc. **
**No of Classes **
reuters1 Reuters-21578 2442
reuters2 Reuters-21578 8712
ohsumed1 OHSUMED-233445 32230 10
ohsumed2 OHSUMED-233445 10902

**Table 4.4: **Summary of the data sets

All clustering methods were implemented on top of Lucenee section A.3
in Appendix A) which is a Java-based open source toolkit for text indexing. In both
data sets the documents sets were indexed by the Lucene utility. Reuters-21578
documents were indexed by title, body and topic fields. Additionally, we created a
field with all distinct terms in title, body and topic. Reuters-21578 documents were
indexed by this field as well. OHSUMED documents were indexed by title, abstract
and MeSH terms (MeSH Headings) fields. Similarly to Reuters, one more field was
indexed consisted of the distinct terms in title, abstract and MeSH field. In case of
experiments in section 4.4.3, OHSUMED documents were represented only by MeSH
terms extracted from title, abstract and MeSH terms fields.
NIKOLAOS HOURDAKIS
CHAPTER 4. EXPERIMENTAL RESULTS
The fields of each document were syntactically analyzed and reduced into
separate term vectors (or MeSH vectors in case of MeSH-based representation).Each
term in this vector was represented by its weight. The

*tf *−

*idf * weighting scheme was
used for computing the weight of each term. Each term vector was normalized by
document length so that it is of unit length. On two data sets, we used a stop-list to
remove common words (stop-words) (i.e. insignificant words like ‘a', ‘the', ‘and',
‘or' F-Measure was used as a measure of cluster "goodness". Additionally, we
discuss the results in terms of clustering time required by each algorithm.
As mentioned in chapter 2, the main disadvantage of partitional clustering
methods is that their performance is sensitive to the selection of the initial cluster
centroids (i.e., clustering the same set of documents more than once with the same
parameter values will generate a different clustering result). This is the reason why
multiple trials are needed. Consequently, we carried out ten separate runs for each
document clustering evaluation. The experimental results on partitional algorithms
reported in this section correspond to the average F-Measure over ten runs.
All algorithms are implemented in Java programming language, and
experiments were run on a PC with a Pentium 4 3.2GHz processor, with 2GB RAM,

**4.5.2 Evaluation and Comparison of our K-Means, Incremental K-**
**Means and Bisecting Incremental K-Means algorithms **
Experimental results on K-Means, its variant Incremental K-Means and the Bisecting
Incremental K-Means method are presented below.

**Evaluation and comparison of K-Means and Incremental K-means **
First in our experiments we evaluated the effectiveness and efficiency of K-Means
and Incremental K-Means clustering algorithms in terms of F-Measure and clustering
time. In parallel, we examined for each method how the vector representation of a
document can affect the clustering quality. We used reuters1 (see Table 4.1) test
corpus and assumed three different ways that a document can be represented:
♦ By the terms from BODY field of Reuters-21578 texts,
TECHNICAL UNIVERSITY OF CRETE
♦ By the terms from TITLE field in a first vector, the terms from BODY field in
a second vector and the TOPIC in a third vector and
♦ By all distinct terms from body, title and topic in a single vector.
For example, in the second case, when we want to compute the similarity between
two documents we compute it separately for each field and then we sum the three
computed values.
We set K=24, as 24 are the categories of reuters1 subset. As far as K-Means is
concerned, we set the parameter ITER (number of iterations) of the algorithm equal to
6. For Incremental K-Means ITER was set to 4. As described in section 3.3.2, this
value determines the number of iterations in K-Means and Incremental K-Means
techniques. The resulting F-Measure values for the various document vector
representations are shown in Figure 4.3. We call this experiment 1a.

**K-Means - Incremental K-Means**
**Different Document Vectors -Reuters1 collection **
Inremental K-Means

**al**

tr 0.6

** (10 **0.5

Body, Title, Topic

**Figure 4.3: **Experiment 1a – Experiments varying document vector representations

Incremental K-Means performs significantly better than basic K-Means
algorithm. We can see that independently of the type of the document representation
Incremental K-Means outperforms the standard implementation of K-Means method
by 20-32%. The performance of Incremental K-Means method fluctuates between
62% and 80%, whereas F-Measure values for K-Means are between 34% and 47%.
The results indicate that the continuously center adjustment and the last re-assignment
NIKOLAOS HOURDAKIS
CHAPTER 4. EXPERIMENTAL RESULTS
of the documents to clusters, (as suggested in section 3.3.2), produce better clustering
results than the naive K-Means procedure.
Regarding different document vector representations, Figure 4.2 illustrates that
for both algorithms the best F-Measure values are obtained in case of documents are
represented by three distinct vectors (body, title and topic), instead of a single unified
vector. This observation was expected due to the structure of the reuters1 test set. The
likelihood that a document will be assigned to the correct cluster increases when the
topic field is included in the vector. Notice that, in Reuters-21578 the topic of each
document is used to classify it in a pre-defined category. Also because the topic
determines the class that a Reuters document belongs to, having the topic in a separate
document vector would be unfair to the other two cases presented in Figure 4.3 (this
would increase the performance of algorithm drastically). For this reason we decided
to use topic only as part of a single vector, together with all other fields and not as a
separate vector. In the following each document in Reuters-21578 is represented by a
single vector formed by all distinct terms from title, body ant topic.
We indicated that Incremental K-Means is much more effective than regular
K-Means in terms of F-Measure. In the second experiment (experiment 1b), we
evaluate the performance of Incremental K-Means under different values of the
number ITER of iterations. The number of iterations of continuous center adjustment
examined is 1, 2, 3, and 4. Similarly to experiment 1a, we used reuters1 subset and set
K equal to 24, as 24 is the number of the hand-labeled classes in this set. Figure
4.4.indicates the quality of the clustering solutions produced by Incremental K-Means
algorithm setting different number of iterations
TECHNICAL UNIVERSITY OF CRETE

**Incremental K-Means - Reuters1**
**Number of Iterations of Center adjustment**
**tria**

10 0.5

**re( **0.4

**-Measu**

F 0.2

**Avg **0.1

**1 iteration**
**2 iterations**
**3 iterations**
**4 iterations**
**Number of Iterations**
**Figure 4.4: **Experiment 1b – Examine the number of iterations of continuous center

As we can see in Figure 4.4, F-Measure is relatively independent on the
number of iterations. Incremental K-Means method showed a stabilized accuracy at
about 68% regardless of the number of iterations. The most significant observation is
that it is sufficient only a single iteration of center adjustment to produce equally good
partitions. This is an important conclusion, because we can efficiently reduce the
clustering time of our Incremental K-Means technique. It is obvious that quadruple
time could be required in case of 4 iterations as compared to a single iteration. While
this time comparison may not be noticeable for small data sets like reuters1, it
becomes much more significant for clustering on large document collections. Notice
ined the effects of the number of iterations of centroid adjustment
in clustering quality. They also suggested that multiple iterations are not necessary.

**Evaluation of Bisecting Incremental K-Means algorithm **
Given the good performance of Incremental K-Means algorithm, we then examined
its performance against our proposed Bisecting Incremental K-Means clustering
technique. We compared it to K-Means and Incremental K-Means methods.
NIKOLAOS HOURDAKIS
CHAPTER 4. EXPERIMENTAL RESULTS
Prior to the evaluation of our Bisecting algorithm we run the following
experiment. We investigated the behaviour of our hierarchical algorithm under
different values of K in Incremental K-Means method. As described in section 3.3.3,
repeated applications of Incremental K-Means algorithm produces a hierarchy of
clusters. Different cluster hierarchies are produced with different values of K. Notice
that different K affects the number of clusters that a given cluster is split and therefore
the higher the value of K the lower the depth of the hierarchy. The larger the value of
K, the broader and shallower is the resulting hierarchy.
To conduct this evaluation (experiment 1c), we set the K equal to 2, 10 and 25
and used the reuters1 test set. Incremental K-Means method terminated as each leaf
cluster contained a single document. According to the results in experiment 1b (see
Figure 4.4), at each bisecting step the parameter ITER in Incremental K-Means
technique was set to 1. F-Measure scores for the various values of K are shown in
Figure 4.5, whereas the Figure 4.6 reports a comparison of the Incremental K-Means
with different K (various-secting clustering at each step) in terms of clustering time
(experiment 1d).

**Comparison of Various-Secting Incremental KMeans on Reuters1**
**25 - Secting**
**10 - Secting**
**Figure 4.5:** Experiment 1c - Various Secting Incremental K-Means – F-Measure

As we can see in Figure 4.5, the F-Measure of the generated cluster hierarchy
increases as the value of K decreases. Notice that F-Measure is rather independent on
TECHNICAL UNIVERSITY OF CRETE
K (increases slightly for K=2). The best F-Measure score was obtained in Bisecting
Incremental K-Means method. However, there are no significant differences in the
effectiveness of clustering results using one of the three values of K.

**A Comparison of Various-Secting Increm. K-Means - Clustering Time**
** Incremental K-Means: 1 Iteration**
**ime**

T

g

n 150

**lust **100

**g C**

v

A
**25 - Secting**
**10 - Secting**
**Figure 4.6:** Experiment 1d - Various Secting Increm. K-Means – Clustering time

As far as the clustering time is concerned, Figure 4.6 shows that the time
(minutes) required building a cluster hierarchy increases with K. Thus, bisecting is the
approach which requires the less clustering time.
We conclude that results in experiment 1c and 1d confirmed our initial
decision to apply our proposed methodology on Bisecting Incremental K-Means
algorithm, instead of other K-Secting techniques

**. **
The main goal of the following document clustering experiment (experiment
1e) is to evaluate Bisecting Incremental K-Means algorithm and compare its
performance against K-Means and Incremental K-Means. In this experiment
(experiment 1e) we used reuters1 and ohsumed1 (see Table 4.4). As far as Bisecting
Incremental K-Means is concerned, the experiments were done by using the same
parameter values discussed in experiment 1c regarding the number ITER of iterations
at each bisecting step and the terminating procedure. In K-Means and Incremental K-
Means algorithms, the number of iterations was set equal to 6 and 1 respectively.
NIKOLAOS HOURDAKIS
CHAPTER 4. EXPERIMENTAL RESULTS
Figure 4.7 shows the results from the comparison of the three clustering methods in
terms of clustering quality.

**Comparison of K-Means, Incremental K-Means and Bisecting**
**Ohsumed1 - Reuters1 data sets**
**10 trials) **0.5

**vg F-**

A 0.1

Incremental K-Means
Bisecting Increm. K-Means

**Algorithm s**
**Figure 4.7:** Experiment 1e – Comparison of K-Means, Incremental K-Means and

Bisecting Incremental K-Means
Figure 4.7 illustrates that our Bisecting Incremental K-Means algorithm
achieves significantly better F-Measure than the other two clustering methods on both
data sets (reuters1 and ohsumed1). Specifically, our Bisecting algorithm outperformed
the basic K-Means and our Incremental version by 42% and 12% respectively on
reuters1 and by 28% and 13% respectively on ohsumed1.
As we can see in Figure 4.7, the resulting F-Measure values for K-Means and
Incremental K-Means on ohsumed1 are consistent with those obtained on reuters1 and
presented in Figure 4.3. We observe that on both data sets the Incremental method
performs noticeably better than the standard K-Means algorithm.
Finally, experimental results in Figure 4.7 indicated that for each one of the
three evaluated algorithms the F-Measure score was less on ohsumed1 as compared to
the corresponding value obtained on reuters1. Thus, we could make the
supplementary conclusion that in document clustering evaluation the OHSUMED
collection gives lower F-Measure values as compared to Reuters-21578 data set.
m this observation.
TECHNICAL UNIVERSITY OF CRETE
Summarizing, Bisecting Incremental K-Means algorithm performs
significantly better than K-Means and its variant Incremental K-Means on both
reuters1 and ohsumed1. Incremental K-Means outperforms basic K-Means by 17-30%
in terms of F-Measure. Incremental K-Means also requires only a single iteration for
center adjustment to produce a good clustering solution. Finally, in experiment 1c we
observed that our version of Bisecting Incremental K-Means performs better than the
other Various-Secting algorithms.

**4.5.3 Evaluation of MeSH based Representation on Clustering **
We showed that Bisecting Incremental K-Means method outperforms the other two
partitional clustering techniques on both data sets. In this experiment, we evaluated
how the clustering quality is affected by the way the documents are represented.
We examined the performance of Bisecting Incremental K-Means method
using vector representation of documents consisting of MeSH terms. Then, we
compared these results with those obtained by representation with single word terms.
The latter approach was evaluated in subsection 4.5.2 on reuters1 and ohsumed1. In
this experiment, we used the ohsumed2 data set which contains 10902 documents. We
selected an OHSUMED subset, as it is a medical corpus which contains articles from
Medline published. As described in section 4.2.2, 10-12 MeSH terms are assigned to
each OHSUMED document by human indexers and constitute a specific field.
MeSH terms are extracted from the title and the abstract field of each Medline
reference using the technique described in section 4.3. Then, we added the existing
(within each document) MeSH terms to obtain a vector of MeSH terms for each
OHSUMED document.
To conduct this experiment (experiment 2a), we used ohsumed2 subset (see
table 4.3). The number of parameter ITER in Incremental K-Means method was set
equal to 1 and the divisive procedure terminated when each leaf cluster contained a
single document. Figure 4.8 shows the F-Measures scores obtained by using MeSH
and single word terms to represent the OHSUMED documents. In terms of clustering
time the evaluation is shown in Figure 4.9. We called this experiment 2b.
NIKOLAOS HOURDAKIS
CHAPTER 4. EXPERIMENTAL RESULTS

**Bisecting Incremental K-Means- OHSUMED2 **
**MeSH terms Vs Single Word Terms Representation**
**ria**

t

0 0.5

Single Word Terms
Extracted MeSH Terms

**Figure 4.8: **Experiment 2a – F-Measure corresponding to Bisecting Incremental K-

Means. Document representation with single word terms and MeSH terms
We observe that our Bisecting Incremental K-Means method yields better F-
Measure when the OHSUMED documents represented by MeSH terms rather than by
single word terms. Figure 4.8 indicates an 8% increase in F-Measure.

**Bisecting Incremental K-Means - Ohsumed2**
** MeSH-based Vs Single word Terms Representation**
Single Word Terms
Extracted MeSH Terms

**Figure 4.9: **Experiment 2b – The effects of document representation on clustering

quality in terms of Clustering Time
TECHNICAL UNIVERSITY OF CRETE
In terms of clustering time, experimental results in Figure 4.9 show that our
version of Bisecting algorithm needs much less execution time when the document
vectors contain only MeSH terms. On ohsumed2 the clustering time for the MeSH-
based representation of documents was 14 minutes whereas for single word terms
representation the clustering hierarchy was obtained in 97.6 minutes. Therefore,
significant decrease in execution time was observed. This happened due to the size of
the document vectors. In case of the document representation by single word terms
the size of each document vector is about 80-100 terms, while in case of MeSH based
representation each vector contains about 20 distinct MeSH terms.
Summarizing, the MeSH-based representation of documents as compared to
single word terms representation improves the performance of our Bisecting
Incremental K-Means algorithm. The results showed that the F-Measure increases
while the clustering time decreases notably. Moreover, MeSH terms form a more
meaningful representation for documents and clusters. The set of MeSH terms
contained in each document specifies well the subject of the document. In case of
clusters the centroid is consisted of MeSH terms and can satisfactorily gives the
semantic content of the cluster.

**4.5.4 Evaluation of BIC-Means - Experiments on BIC **
In section 3.5, we proposed the BIC-Means, a hierarchical clustering algorithm based
on Bisecting Incremental K-Means method. This set of experiments focused on
evaluating the quality of the hierarchical clustering solution produced by BIC-Means.
We examined the use of BIC and evaluated how the clustering quality is affected by
the proposed technique for terminating the divisive procedure.
The performance of BIC-Means was evaluated in terms of clustering quality
and clustering time. The values obtained in this experiment were compared to the
corresponding results of Bisecting Incremental K-Means method where the procedure
terminates as each leaf cluster contains a single document. To conduct this evaluation
the document collections were selected are ohsumed2, reuters1 and reuters2. With
regard to ohsumed2, we used vector representation of documents based on MeSH
terms. Experiments in subsection 4.5.3 showed that this representation outperforms
the representation by single word terms. At each bisecting step the parameter ITER
NIKOLAOS HOURDAKIS
CHAPTER 4. EXPERIMENTAL RESULTS
The resulting F-Measures for BIC-Means and Bisecting Incremental K-Means
are presented in Figure 4.10 (experiment 3a). For each of the three test corpora the
corresponding scores are compared. The comparison in terms of clustering time
(experiment 3b) is shown in Figure 4.11.

**Comparison of BIC-Means and Bisecting Incremental K-Means**
Bisecting Incremental K-Means

**ria**

t

0 0.6

**Figure 4.10: **Experiment 3a – Comparison of BIC-Means and Bisecting Incremental

K-Means on clustering quality
As we can see in Figure 4.10, in ohsumed2 the proposed BIC-Means
algorithm achieved similar F-Measure value as Bisecting Incremental K-Means
method. Results on reuters1 and reuters2 indicated that BIC-Means performed slightly
worse as compared to initial Bisecting approach. In terms of F-Measure, for reuters1
the decrease in clustering quality was 14%, whereas in reuters2 collection was 8%.
Thus, we observe that BIC-Means does not yield better F-Measures values
than Bisecting Incremental K-Means. However, these results were prospective. From
the beginning BIC-Means was not expected to improve the clustering quality of our
basic Bisecting technique as the last is exhaustive producing the entire clustering
hierarchy (terminating in singleton clusters) while BIC-Means was introduced here as
a means for non-exhaustive clustering aiming at terminating at rather meaningful
clusters. However, the performance sacrifices compared to Bisecting Incremental K-
Means is negligible. As described in section 3.5, BIC-Means expands the
TECHNICAL UNIVERSITY OF CRETE
functionality of Bisecting Incremental K-Means method. It uses BIC as the splitting
criterion of a leaf cluster and then, a strategy is applied to terminate the divisive
We can conclude that the basic advantage of BIC-Means is the automatic way
for terminating the Bisecting technique rather than executing the algorithm until each
leaf cluster contains a single document. Figure 4.10 indicates that on the three test
corpus F-Measure values of BIC-Means decreased slightly or were the same as
compared to the corresponding values of Bisecting Incremental K-Means. Only on
reuters1 is observed a high decrease in F-Measure score. This can be explained as
follows. First, the BIC which is used as the splitting criterion of a cluster needs a large
collection in order its application to be more effective. In our case, reuters1 contains
2442 documents. As a result, the use of BIC in the specific collection produced a
small number of clusters and thereby F-Measure score was decreased as compared to
initial Bisecting algorithm. Contrary to Reuters, on ohsumed2 BIC-Means achieves F-
Measure value similar to this obtained by our initial Bisecting approach. This is
because OHSUMED is fairly big data set and the hierarchy obtained by BIC-Means
was quite deep due to the many bisections done.

**Comparison of BIC-Means and Bisecting Incremental K-Means **
Bisecting Incremental K-Means

** Time ( **100

**Figure 4.11: **Experiment 3b – Comparison of BIC-Means and Bisecting Incremental

K-Means on clustering time
NIKOLAOS HOURDAKIS
CHAPTER 4. EXPERIMENTAL RESULTS
In terms of clustering time, the experimental results in Figure 4.11 indicate
that BIC-Means runs much faster than Bisecting Incremental K-Means method on
both document collections. For the three test collections were used in this set of
experiments it took much more time for basic Bisecting technique than proposed BIC-
Means algorithm to build a hierarchy of clusters. On reuters1 the clustering time of
BIC-Means was about 9 times less than that of initial Bisecting method (3 - 31.5
minutes). On reuters2 the clustering time of BIC-Means was about 3 times less as
compared to the basic Bisecting approach (58 - 167 minutes). Finally, on ohsumed2 it
took 9.5 minutes for BIC-Means to produce the cluster hierarchy, whereas Bisecting
Incremental K-Means required 14 minutes.
Regarding reuters1 the too much difference in clustering time can be explained
due to the small number of bisections applied on this subset. We discussed this fact
earlier in this subsection. Thus, the algorithm terminated much more quickly and had
a small decrease in clustering quality as compared to basic Bisecting method. For
ohsumed2 the clustering time was short because only 20-25 MeSH terms were
contained at document vectors.
Summarizing, BIC-Means is a hierarchical clustering approach which
incorporates a strategy for terminating the divisive procedure. Its main advantage is
that requires significantly less time to run compared to Bisecting Incremental K-
Means method. Thus, it is an appropriate algorithm for clustering very large document
collections since it does not execute the procedure exhaustively. Finally, the automatic
way that BIC-Means uses to stop the algorithm keeps F-Measure scores at the same
levels or causes a slight decrease in clustering quality as shown in Figure 4.10.

**4.5.5 Summary of Document Clustering Experimental Results **
In this section we presented our experiments and results on document clustering. This
evaluation revealed strengths and weakness of the different clustering algorithms
implemented in this study. First in our experiments we evaluated and compared the
clustering quality of K-Means and Incremental K-Means. F-Measures scores were
computed for different vector representations of documents on reuters1. The results
showed that Incremental K-Means yielded noticeably better F-Measure values than K-
Means. Additionally, we showed that Incremental K-Means needs only a single
iteration of center adjustment to produce a good clustering partition.
TECHNICAL UNIVERSITY OF CRETE
Then, on reuters1 and ohsumed1 we compared the Bisecting Incremental K-
Means with basic K-Means and Incremental K-Means. The results indicated that
Bisecting Incremental K-Means performs much better than the two other techniques
on both data sets.
We continued our evaluation by examining the performance of Bisecting
Incremental K-Means method using MeSH-based representation of documents on
ohsumed2. We compared these results with those obtained by using single word terms
to represent a document. The comparison showed that MeSH-based representation
improves significantly the performance of Bisecting Incremental K-Means algorithm
in terms of F-Measure and clustering time.
Finally, we evaluated (on three data sets) the quality of the hierarchical
clustering solution produced by BIC-Means algorithm. BIC-Means incorporates a
strategy to stop the divisive procedure. We computed F-Measure scores and clustering
time and then compared them to the corresponding values obtained from Bisecting
Incremental K-Means method which is executed exhaustively. Experimental results
indicated that BIC-Means requires much less time to build a cluster hierarchy as
compared to initial Bisecting approach (see Figure 4.11). This is important in case of
large document collections. In terms of F-Measure, BIC-Means achieves the same or
slightly decreased values as compared to Bisecting Incremental K-Means algorithm.

**4.6 Retrieval using Document Clusters **
In the following we demonstrate that it is possible to apply clustering to reduce the
size of the search (and therefore retrieval response times) on large data sets. We
propose several cluster-based retrieval strategies and evaluated their performance. In
parallel, we examined the use of MeSH terms in document, cluster and query vector
The majority of the document retrieval systems which have been described in
the literature match the query against documents in the entire collection. They do an
exhaustive search (document-based retrieval). Similarity scores between the query
and each document are computed and the documents are then ranked in order of
decreasing similarity with the query. However, the computation of similarities
between user's request and all the documents is time consuming due to the exhaustive
NIKOLAOS HOURDAKIS
CHAPTER 4. EXPERIMENTAL RESULTS
search is done. For this reason an alternative approach is required, mostly for retrieval
on large document collections.
Below, we examined how this goal could be achieved by incorporating of
document clustering into the information retrieval process (cluster-based retrieval).
Cluster-based retrieval incorporates the application of a clustering technique on a
document collection in order to group documents into clusters, matches the query with
a representative representation of each cluster and then ranks clusters based on their
similarity to the query. We search the documents which are contained in the N top-
ranked clusters and not all the documents exhaustively. This is the general approach
of the examined retrieval strategies based on clusters. We evaluated the efficiency and
effectiveness of cluster-based retrieval as compared to exhaustive retrieval method.
A numbve been proposed in the literature on
applying clustering to improve retrieval results. Some experimental results
have shown that cluster-based retrieval using static clustering outperforms retrieval by
indicated that exhaustive retrieval is
generally more effective.
In most experiments the size of document collections used was small. This is
due to the time and space performance of hierarchical clustering approaches. There
are no conclusive results on large data sets. In this study, we examined how cluster-
based retrieval can perform across collection of realistic size. Experimental results in
subsection 4.5.4 showed that BIC-Means can be applied on large document
collections. As described in the following subsection, in our evaluation in addition to
documents, the queries contained only MeSH terms. We examined how MeSH-based
document and query representation affect cluster-based retrieval.
uggested that the associations between
documents contain information about the relevance of documents to user's requests.
They formulated and examined the cluster hypothesis. "Closely associated documents
tend to belong to the same clusters and are expected to be relevant to the same
queries". Correspondingly, dissimilar documents are unlikely to be relevant to the
TECHNICAL UNIVERSITY OF CRETE

**4.6.1 Cluster-based Retrieval on OHSUMED using MeSH **
We examine the cluster-based retrieval on OHSUMED which contains 233445
Medline articles. All documents include abstract. In order to build the document
vectors we extracted MeSH terms from title, abstract and MeSH terms fields. The
MeSH term extraction technique was presented in section 4.3. We chose the MeSH
based representation due to the clustering results presented in subsection 4.5.3. They
indicated that MeSH terms improve the performance of our Bisecting algorithm in
terms of F-Measure and clustering time. Additionally, using MeSH terms much less
time is required to compute similarities between the documents or clusters and the
query due to the small size of document or cluster term vectors. OHSUMED
documents were indexed by the Lucene utility. The weights of all MeSH terms in
OHSUMED documents are computed by

*tf *−

*idf *.
The experiments required that documents be first organized into clusters. We
applied our proposed BIC-Means algorithm on entire OHSUMED and a static
hierarchy of clusters was produced. Each cluster was represented by the centroid
vector which is the vector obtained by averaging the weights of the various terms in
To examine the retrieval performance a test collection of 106 queries was
used. A group of novice physicians generated these queries using Medline. Each
document has been judged by physicians as relevant, possibly relevant or not relevant
to a query. In our experiment we consider the possibly relevant documents as relevant.
Each OHSUMED query contains patient and topic information, in the format:
We present an example of a query:
55 yo female, postmenopausal
does estrogen replacement therapy cause breast cancer
In our evaluation the MeSH terms are extracted from each query in order to
represent it. We used the extraction technique presented in section 4.3. The reason for
this extraction was the MeSH based-representation of OHSUMED documents. Each
NIKOLAOS HOURDAKIS
CHAPTER 4. EXPERIMENTAL RESULTS
MeSH term had a unique participation at each query. After this process the above
query was converted as follows:
"Breast_neoplasms
The words which are connected with "_" constitute a MeSH term.
In our experiments we ignored some queries from the OHSUMED query set
due to three reasons. First, several queries do not contain MESH terms so that to
extract them and represent the corresponding query. Second, a query does not have
relevant documents in the judged pool. Finally, some queries were removed from the
set because in an initial exhaustive search done no relevant documents were retrieved
for these queries. As a result we removed 45 queries. The final query set consisted of
61 queries. Each query contained between 1 and 6 MeSH terms. Section A.4.1 of the
Appendix A shows the 61 OHSUMED queries while section A.4.2 illustrates their
corresponding MeSH-based representation. Apart from the original OHSUMED query
set developed by Hersh et al, a sub-set of 63 queries were used in TREC-9
Retrieval Conference) IR experiments. Similarly to original queries, relevance
judgements provided by NIST (National Institute of Standards and Technology)
determine OHSUMED relevant documents to each query. We observed that 40 of the
61 queries used in our retrieval experiments are contained in TREC-9 query set.
Vector Space Model (VSM) was used for retrieval of documents in
OHSUMED. This state-of-the-art method uses the classic dot product between
centroids of clusters and queries as the matching function. The retrieval system was
built upon Lucene. Notice that in addition to text indexing, Lucene is a full-featured
text search engine library in Java. All retrieval strategies were implemented on top of
Lucene. The weight of each query term was initialized to 1, because a MeSH term can
be contained only once in a query. Each query retrieved the 100 highest ranked
answers due to the tendency of users to examine only the top-ranked documents
retrieved by the system.
As presented in the following subsection we examined several search
strategies. In all cases in order to find the clusters that best match a query we searched
the bottom-level clusters (leaf clusters). Experimental resu
indicated that this method instead of searching all the clusters of the hierarchy gives
the best retrieval results. The 633 leaf clusters of our produced hierarchy are non-
TECHNICAL UNIVERSITY OF CRETE
singleton clusters due to the stopping strategy we use in BIC-Means algorithm. They
were scanned and the most relevant to the query were retrieved according to the
corresponding search strategy.
Efficiency and effectiveness are usually the measures used for the evaluation
of an IR system. The former measure looks at the time and space requirements of the
algorithms used by the system. It checks operations such indexing and searching in
terms of functionality. On the other hand, the effectiveness of an IR system addresses
the quality of the retrieval results. The measures used to examine the effectiveness of
our retrieval system are recall (R) (the ratio of relevant documents that are retrieved)
and precision (P) (the ratio of retrieved documents that are relevant). We computed
averaged precision which is the value of precision averaged over the 61 queries and
averaged recall which is the value of recall averaged over the 61 queries.

**4.6.2 Experimental Results: Precision/Recall and Evaluation **
As cluster hierarchy has been built, a search for the clusters that best match the query
was done. We introduced several retrieval strategies which are based on the bottom-
level clusters of the hierarchy (leaf clusters). Each search strategy incorporates
different criteria in order to match the query against leaf clusters and retrieve them.
We evaluated the effectiveness and efficiency of each of these search strategies and
compared the results with the retrieval results obtained by exhaustive search on
The results of each method are represented by a precision/recall curve. Each
point on a curve is the average precision and recall over all queries. As mentioned we
selected the 100 highest ranked answers for each query, so the precision/recall plot of
each method contains exactly 100 points representing the average precision and recall
over the 61 queries. Precision and recall values are computed from each answer set
after each answer. The top-left points of a precision/recall curve corresponds to the
precision/recall values for the best answer or best match while, the bottom right point
corresponds to the precision/recall values for the entire answer set. A method is better
than another if it achieves better precision and recall.
First in our experiments we performed an exhaustive search (document-based
retrieval) on all OHSUMED documents. Similarities between each query and each
document were computed. Then, the documents were ranked and a list of the top 100
NIKOLAOS HOURDAKIS
CHAPTER 4. EXPERIMENTAL RESULTS
documents was produced. The generated precision/recall curve compared with all the
cluster-based retrieval strategies are suggested below.

**Experiment 1: Retrieve and Search the N highest Ranked Clusters **
We start describing the most simple of the implemented cluster-based retrieval
strategies. Each leaf cluster was represented as a vector of MeSH terms. We computed
the similarity between the query and each leaf cluster. The clusters were ranked based
on their similarity to the query. The issue examined in this strategy was how many
best match clusters to select in order to search the documents of that clusters. We
evaluated seven cases. To select 1, 3, 10, 30, 50, 100 and 150 of the top-ranked leaf
clusters. We chose these values as considered to be representative in indicating the
behaviour of the retrieval system. We called these retrieval strategies top_1Cluster,
top_2Clusters, top_10Clusters, top_30Clusters, top_50Clusters, top_100Clusters and
top_150Clusters.
In each case, the documents of the selected clusters were collected. Thus, a
new document collection was produced. It was a very small subset of the initial data
set. Any document in this subset was considered more likely to be relevant to the
query than documents from clusters ranked lower and were not contained in the
selected list of clusters (1, 3, 10, 30, 50, 100 or 150). Then, we computed the
similarity between the queries and the documents of the produced collection. The
documents were order by decreasing similarity. We selected the 100 highest ranked
documents for each query to evaluate precision and recall.
Figure 4.12 shows the averaged precision and recall values obtained by these
retrieval experiments. For each evaluation (1, 3, 10, 30, 50, 100 and 150 top-ranked
clusters) we present a precision/recall curve. We compare these curves with the
retrieval result by exhaustive search (document-based retrieval).
As we can see in Figure 4.12, document-based retrieval performed better than
all the examined cases of cluster-based retrieval. The best results of cluster-based
retrievals obtained as the number of the selected clusters was 150 and 100. We
observe in Figure 4.12 that top_100Clusters and top_150Clusters strategies achieve
almost the same precision and recall. Notice that performance improves with N (in
this experiment N=150 and N=100 achieve better precision and recall). The reason for
this behavior is that more relevant documents are revealed even within less similar
TECHNICAL UNIVERSITY OF CRETE
clusters and their number of similar documents increases with N. Notice though, that
as N approaches K (total number of clusters), the cluster-based retrieval approaches
exhaustive search. The overall performance of cluster-based retrieval depends on
whether top-ranked clusters contain relevant documents.
Exhaustive Search

**Figure 4.12**: Precision-recall diagrams of exhaustive search on OHSUMED and

cluster-based retrieval strategy using the n top-ranked clusters for retrievals
The better performance of document-based retrieval is reasonable due to the
smaller number of documents contained in the top 150 ranked clusters and the other
cases of top clusters. We counted that in top_150Clusters experiment the averaged
number of documents searched over the 61 queries was only 88806, while the
corresponding number in top_100Clusters experiment was 67648. On the contrary, in
document-based retrieval were exhaustively searched 233445 articles. As a result, in
case of cluster-based retrieval experiments the number of similarity computations
between the query and the documents was significantly decreased. On the other hand,
we observe that retrieval by exhaustive search as compared to top_150Clusters and
top_100Clusters retrieval strategies achieved about up to 5% better precision and up
to 5% better recall. We conclude that top_100Clusters and top_150Clusters retrieval
methods improve noticeably the computation efficiency of the retrieval while the
NIKOLAOS HOURDAKIS
CHAPTER 4. EXPERIMENTAL RESULTS
effectiveness was slightly decreased as compared to retrieval results by exhaustive
search on OHSUMED documents.

**Experiment 2: Use the 20 Highest Weighted MeSH terms of the **
**Centroid and Search the N Top-Ranked Clusters **
The first retrieval experiment used all the centroid terms to compute the similarity
scores between the clusters and the query. The second set of experiments examined
the effectiveness of cluster-based retrieval using the 20 highest weighted MeSH terms
of the centroid. The centroid terms were sorted by decreasing frequency and the top
20 MeSH terms were selected to represent the cluster. Then, the clusters were ranked
by decreasing similarity with the query. In this experiment we evaluated four cases.
To select the 10, 50, 100 or 150 of highest ranked clusters. We called these retrieval
strategies 20Terms-top_10clusters, 20Terms-top_50clusters, 20Terms-
top_100clusters and 20Terms-top_150clusters. Similarly to first set of retrieval
experiments, we computed the similarity between the queries and the documents
contained in the top 10, 50, 100 or 150 ranked clusters. We used the cosine similarity
function to match each query against documents of top retrieved clusters. A ranked
document list was produced for each experiment. We selected the 100 highest ranked
documents to evaluate the retrieval process.
Figure 4.13 illustrates the precision-recall curves for the methods tested in
these retrieval experiments. We compare them with the document-based retrieval
(exhaustive search) on OHSUMED and the top_150Clusters retrieval strategy
presented in Experiment 1 (first set of retrieval experiments).
Figure 4.13 indicates that document retrieval by exhaustive search on
OHSUMED is more effective than the search based on leaf clusters and use the 20
highest weighted MeSH terms of the centroid. Comparing the precision-recall
diagrams obtained in the first set of experiments (Experiment 1) with them illustrated
in Figure 4.13 we observe that the use of the 20 highest weighted MeSH terms of
cluster's centroid did not improve the cluster-based retrieval results on OHSUMED.
More specifically, Figure 4.13 shows that top_150clusters retrieval method (uses all
the MeSH terms of the centroid) performs better than 20Terms-top_150Clusters (uses
the 20 highest weighted MeSH terms of the centroid).
TECHNICAL UNIVERSITY OF CRETE
Exhaustive Search

**Figure 4.13**: Precision-recall diagram of exhaustive search on documents and search

based on leaf clusters using the 20 highest weighted centroid terms and the N top
ranked clusters for retrievals on OHSUMED
Also, we observe that the efficiency of the retrieval does not noticeably
increase with the number of clusters searched. 20Terms-top_150Clusters retrieval
strategy performs slightly better than 20Terms-top_100Cluster and 20Terms-
top_50Cluster methods. This may occur because for some queries there were not 100
or 150 clusters that contained one or more of the query terms in their centroid vectors.
This conclusion can be confirmed by examining the number of documents searched
over the 61 queries. In case of 20Terms-top_50Clusters retrieval strategy 33607
documents were searched whereas for 20Terms-top_100Clusters and 20Terms-
top_150Clusters the searched documents were 46786 and 54991 correspondingly. We
observe that while the number of retrieved clusters were doubled or trebled, the
searched documents were not increased significantly.

**Experiment **
**Retrieve the Clusters with all Query Terms in **
**Centroids and Search them **
The last set of experiments evaluated the performance of cluster-based retrieval using
an alternative strategy for selecting the leaf clusters that best match the query. For a
specific query we examined only the leaf clusters which contained all the MeSH
NIKOLAOS HOURDAKIS
CHAPTER 4. EXPERIMENTAL RESULTS
query terms in their centroid vectors. Then, the retrieved clusters were ranked
according to their similarity to the query. Regarding the number of finally selected
ranked clusters we examined three cases. To select all the retrieved clusters, the top
50, the top 30 or the top 15 from the ranked list. We called these experiments
AllQinCen_AllClusters, AllQinCen_Top_50Clusters AllQinCen_Top_30Clusters and
AllQinCen_Top_15Clusters. For each case a ranked list of documents was produced
by computing the cosine similarity between the documents of the selected clusters and
the query. To conduct each experiment we selected the 100 top-ranked documents
In Figure 4.14, we present the precision-recall diagram for the third set of
experiments. We compare the results with the curve produced by exhaustive search on
OHSUMED documents.
Analyzing the results in Figure 4.14 we observe that retrieval based on leaf
clusters of the hierarchy that contained all the query terms in their centroids is almost
as effective as the retrieval done by exhaustive search on OHSUMED. Document-
based retrieval achieved just up to 2% better precision and up to 2% better recall than
AllQinCen_AllClusters retrieval strategy.
Exhaustive Search

**Figure 4.14:** Precision-recall curves using the leaf clusters which contains all the

query terms in their centroids and a precision/recall curve produced by exhaustive
TECHNICAL UNIVERSITY OF CRETE
The other three examined retrieval methods (AllQinCen_50Clusters,
AllQinCen_30Clusters and AllQinCen_15Clusters) performed slightly worse than
AllQinCen_AllClusters and document-based retrieval strategy.
Additionally, we evaluated the efficiency of our proposed retrieval strategies
in terms of required similarity computations between the documents and the query. In
large document collections the computational overhead matching all documents with
the query is a major drawback in the retrieval process. Figure 4.15 shows for each
retrieval strategy the average number of documents searched over the 61 queries.
Regarding the exhaustive search on OHSUMED, 233445 documents were compared
to the query to produce the ranked document list. As far as AllQinCen_AllClusters
retrieval strategy is concerned, the corresponding average number of documents over
all queries was 71649, while for AllQinCen_50Clusters, AllQinCen_30Clusters and
AllQinCen_15Clusters retrieval methods were 46262, 34759 and 21606 respectively.

**"Avg Num be r of Docum ents se arche d ove r the 61 querie s"**
**Retrie val Strategy: Re trieve the cluste rs w hich contain all the M e SH Query **
**Term s in the ir Centroid.**
** D**

F

O 105000
**VSM AllClus ters**
**Top_50Clus ters Top_30Cluste rs Top_15Cluste rs**
**Figure 4.15**: The average number of searched documents over the 61 queries for the

four retrieval strategies examined in this set of experiments
Summarizing, Figure 4.15 indicate that the three proposed cluster-based
retrieval strategies achieved a significant decrease in time and space requirements as
compared to retrieval by exhaustive search. Mostly, AllQinCen_AllClusters retrieval
method not only saves a huge amount of computation but does so without significant
NIKOLAOS HOURDAKIS
CHAPTER 4. EXPERIMENTAL RESULTS
loss in precision and recall. We observe that among all the cluster-based information
retrieval strategies suggested in this section the best results are obtained in case of
AllQinCen_AllClusters method. This method is as effective as document-based
retrieval on OHSUMED and much more efficient.
Figure 4.16 presents a summary of the best proposed cluster-based retrieval
strategy (AllQinCen_AllClusters).

**Input: ** Bottom Level Clusters of the hierarchy (leaf clusters), Query q,

Document d (MeSH-based representation), Cosine Similarity

**Output: **Documents ordered by decreasing similarity with the query.

**1. **MeSH terms extraction from query using the extraction technique

presented in section 4.3.

**2. **Match the query against the leaf clusters which contain all the MeSH

query terms in their centroid vector. Use cosine similarity function.

**3. **Rank clusters by decreasing similarity with the query.

**4. **Match the query against documents in all retrieved clusters using cosine

similarity function.

**5. **Return a ranked list of documents to the user. (by decreasing similarity to

**Figure 4.16**: "AllQinCen_AllClusters" cluster-based retrieval method

TECHNICAL UNIVERSITY OF CRETE

**Chapter 5 **
**Conclusions **
We present a short summary of the research conducted in this thesis and provide
possible directions for future research.

**5.1 Summary **
The main objective of this thesis was to develop a highly efficient algorithm for
clustering large document collections. We focused on partitional clustering algorithms
mainly due to their low time complexity (i.e. linear on the number of documents) as
opposed to hierarchical clustering methods which have quadratic time complexity.
Therefore partitional techniques are well-suited for clustering large document
Initially, we focused on the standard K-Means clustering approach. We
implemented several variants of the original K-Means and we proposed a new variant,
the so-called "Incremental K-Means". Incremental K-Means differs from basic K-
Means in the way the centroids are updated during each clustering iteration. In K-
Means new centroids are computed after each iteration (after all documents have been
examined and assigned to clusters). Incremental K-Means updates centroids after a
document is assigned to a cluster.
Due to the very large size of document collections and the tremendous
explosion of electronic information available on the internet, there is an increased
need for effective and efficient clustering algorithms that would aim in reasonable
time even on such large document collections and create clusters that correspond to
real classes. However, both K-Means and Incremental K-Means produce a flat
partition of the data while a construction of a hierarchy of clusters using traditional
hierarchical clustering methods is computational prohibitive.
CHAPTER 5. CONCLUSIONS
In the following we examined the so-called Bisecting Incremental K-Means
which produces a hierarchical clustering solution by recursively applying the
Incremental K-Means on a document collection: All documents are initially
partitioned into two clusters. Then, the algorithm iteratively selects and bisects each
one of the bottom-level clusters until singleton leaf clusters are reached. Bisecting
Incremental K-Means can be thought as a divisive hierarchical clustering approach.
The so-obtained clusters are structured as a hierarchical binary tree. The run time of
the algorithm is

*O*(

*n *log(

*n*)) where n is the number of documents.
The main drawback of the Bisecting Incremental K-Means algorithm was that
terminates when each leaf cluster contains a single document. This is because there is
no prior knowledge on the desired number of clusters and moreover there is not a
criterion for stopping bisections before singleton clusters are reached. In case of large
document collections terminating at singleton clusters is time-consuming and the
clustering result does not correspond to real classes (mainly at leaf levels and close to
the meaningless leaf clusters).
To prevent over-splitting of clusters we proposed a strategy based on the
Bayesian Information Criterion (BIC) (introduced earlier in the liter
the divisive procedure. We use BIC to perform a splitting test at each leaf cluster in
order to decide whether a cluster should split or not. The BIC score is computed to
measure the improvement of a cluster when it is split. If the BIC score of the produced
cluster structure is less than BIC score of the parent cluster we do not split the initial
cluster. We terminate the divisive procedure when there is no separable leaf cluster
according to the BIC function
"BIC-Means", a novel hierarchical clustering algorithm, is the main
contribution of this thesis. Building upon Bisecting Incremental K-Means and BIC,
BIC-Means combines the advantages of all these ideas. Specifically, BIC-Means has
the following characteristics:
1. It is a Bisecting clustering approach which can be used to build a hierarchy of
clusters effectively.
2. It incorporates Incremental K-Means as the partitional method for bisecting
the selected leaf cluster at each bisecting step. Incremental K-Means
efficiently updates cluster's centroids.
TECHNICAL UNIVERSITY OF CRETE
3. It uses BIC as the splitting criterion of a cluster and proposes a strategy to stop
the divisive procedure based on the Bayesian Information Criterion (BIC).
As a result, BIC-Means produces clusters which are more meaningful as
compared to the singleton clusters of Bisecting Incremental K-Means. Overall the
proposed algorithm combines the strengths of partitional and hierarchical clustering
We run several sets of experiments. In the first set, we focused on the
evaluation of the document clustering algorithms proposed in this thesis. All methods
were tested using standard document collections (such as Reuters-21578 and
OHSUMED). F-Measure was used to measure the overall "goodness" of the
generated clusters. We examined how good the clusters produced by each clustering
method match the set of categories (or classes) assigned to the documents (by human
Experimental results on Reuters-2at the proposed
Incremental K-Means yielded noticeably better F-Measure than the standard K-
Means. Additionally, we showed that Incremental K-Means needs only a single
iteration of center adjustment to produce a good clustering partition. Then, we
examined the performance of Bisecting Incremental K-Means. The results indicated
that our Bisecting approach performs significantly better than Incremental K-Means
in terms of F-Measure on both data sets. We continued our experiments by examining
the performance of Bisecting Incremental K-Means method using vector
representation of OHSUMED documents consisting of MeSH terms. We compared
these results with those obtained by representation with single word terms. The results
indicated that Bisecting Incremental K-Means yields significantly better F-Measure
when the OHS0UMED documents are represented by MeSH terms rather than by
single word terms.
Then, we evaluated the proposed BIC-Means algorithm in terms of clustering
quality and clustering time and compared it with Bisecting Incremental K-Means.
Experimental results on both data sets showed that a main advantage of BIC-Means is
that requires significantly less time to build a cluster hierarchy than Bisecting
Incremental K-Means method (the algorithm does not have to reach at singleton
clusters at the leafs). In terms of F-Measure, BIC-Means achieved approximately the
same performance with Bisecting Incremental K-Means. Notice though that BIC-
NIKOLAOS HOURDAKIS
CHAPTER 5. CONCLUSIONS
Means was not expected to improve the clustering quality of our initial Bisecting
technique (the exhaustive approach). It was introduced as an algorithm that
incorporates a criterion for terminating the divisive procedure and for preventing from
reaching meaningless leaf clusters. Therefore, BIC-Means is more suited than its
competitors for clustering very large document collections effectively. This is due to
not only its low computational requirements, but also comparable performance.
Notice that, BIC-Means produces meaningful leaf clusters.
The second set of experiments focused on examining the effectiveness and
efficiency of a cluster-based information retrieval system. We applied the proposed
BIC-Means on OHSUMED (a very large document collection with 233445 medical
articles from Medline) in order to create a hierarchy of clusters. We demonstrated that
it is possible to apply clustering to reduce the size of the search (and therefore
retrieval response times) on large data sets such OHSUMED.
The search strategy relied on searching for the clusters that best match the
query. We tested several variants of the above idea. All searched clusters at the leaf
level of the hierarchy (intermediate clusters need not be searched as they contain
documents which are also combined by the leaf clusters). Each search strategy
incorporates different criteria for matching the query against leaf clusters. We
evaluated the cluster-based retrieval strategies and compared them against retrieval
results by exhaustive search on OHSUMED. In parallel, we examined the retrieval
strategies using MeSH terms in document and cluster representation. These are more
compact than single word representations and produce better clustering solutions on
medical data sets.
The experimental results indicated that among all cluster-based retrieval
strategies proposed in this thesis the best results are obtained in case we examined
only the leaf clusters which contained all the MeSH terms of the query in their
centroid vectors (we searched the documents which were contained in the retrieved
clusters). The best proposed cluster-based retrieval strategy searched only 30% of all
OHSUMED documents as opposed to the sequential one which matches all
documents (one by one) with the query. Experiments also demonstrated that this
strategy is almost as effective as the retrieval by exhaustive search on OHSUMED.
Summarizing, this cluster-based retrieval method runs faster without significant loss
in precision and recall.
TECHNICAL UNIVERSITY OF CRETE

**5.2 Future Work **
We present some open issues for future work in the following sub-sections.

**5.2.1 Additional Document Clustering and Retrieval Evaluation **
In this work, we experimentally evaluated our proposed document clustering
algorithms on two document collections (OHSUMED and Reuters). We plan to
extend our evaluation using other general or application specific data sets.
Furthermore, we would like to compare our experimental results with results have
reported by other hierarchical and partitional clustering algorithms proposed in the
In addition to document clustering experiments we proposed several cluster-
based retrieval strategies to improve retrieval by exhaustive search on OHSUMED. It
would be interesting to investigate additional cluster-based retrieval strategies. First,
"top-down" strategy proposes that the search begins from the root of the tree and
moves down the tree following the path of maximum similarity. Second, we would
like to examine the "bottom-up" strategy. The search starts from a bottom-level
cluster towards the root of the tree.

**5.2.2 Medline Clustering and Browsing **
In this thesis we applied the proposed BIC-Means algorithm on entire OHSUMED
(subset of Medline) to produce a hierarchy of clusters. In the future, we plan to apply
BIC-Means on the Medline database. Medline contains more that 15 million
references (version 2006) to journal articles in life sciences, medicine and bio-
medicine. Due to the huge size of Medline, it would be a challenging task for us to
organize this enormous amount of documents into meaningful clusters which contain
related documents. Thus, hierarchical clustering of Medline could be used to improve
the effectiveness and efficiency of document retrieval. The users will be able to locate
quickly and accurately relevant information. Moreover, the produced clustering result
will provide effective and intuitive browsing, navigation and summarization of the
millions Medline documents.
NIKOLAOS HOURDAKIS
CHAPTER 5. CONCLUSIONS

**5.2.3 Clustering Dynamic Document Collections **
Modern information systems have vast amount of un-organized data that change
dynamically. Consider for example the flow of information that arrives incrementally
on news wires message systems (Reuters, Marketwatch, Yahoo, etc) or a document
collection that varies over time as new documents arrive continuously, and they need
to be inserted in the collection. Clustering centroids are updated continuously and
after a while clustering has to be recomputed.
Static clustering algorithms, such BIC-Means, generate a fixed number of
clusters. We plan to develop the dynamic version of BIC-Means. It might
incrementally compute clusters of similar documents, supporting both insertions and
deletions. As a new document is inserted or deleted from the corpus the hierarchy of
clusters is re-organized. This process would incorporate either new split on leaf
clusters or merges of existing leaf clusters.
Summarizing, the dynamic clustering algorithm will keep dynamic corpora or
databases organized. So far, dynamic versions of clustering algorithms have not been
examined adequately in the literature. Dynamic clustering can be applied in research
areas, such peer to peer systems and sensor networks, as well.

**5.2.4 Semantic Similarity Methods in Document Clustering **
In this study, the similarity between two documents is computed according to the
Vector Space Model (Vs the cosine of the inner product between their
document vectors. VSM relates documents that use identical terminology. However,
plain lexicographic analysis and matching between terms is not generally sufficient to
determine if two terms are similar and consequently whether two documents are
similar. The lack of common terms in two documents does not necessarily mean that
the documents are not related. Two terms can be semantically similar (e.g., can be
synonyms or have similar meaning) although they are lexically different terms in the
documents. Therefore, computing document similarity by word-based classical
information retrieval models (e.g., VSM, Probabilistic, Boolean) is not so effective.
For example, VSM will not recognize synonyms or semantically similar terms (e.g.,
"car", "automobile").
In order to take advantage of semantically similar terms, we plan to integrate
semantic knowledge into proposed document clustering algorithms. Several methods
for determining semantic similarity between terms have been proposed in the
TECHNICAL UNIVERSITY OF CRETE
literature, most of them using ontologies such as WordNet
selection of ontology depends on the application domain. In case of natural language
terms semantic similarity can be implemented and evaluated using WordNet as the
underlying reference ontology. WordNet is a controlled vocabulary and thesaurus
offering a taxonomic hierarchy of natural language terms developed at Princeton
University. In case of medical terms semantic similarity can be computed using the
MeSH ontology which contains medical and biomedical terms. As we used
OHSUMED in many document clustering experiments, MeSH ontology will be
appropriate for computing semantic similarity between medical terms and
consequently between OHSUMED documents.
Regarding our cluster-based information retrieval experiments, documents that
contained related information but their context was described by other terms, were not
returned to the user. For example, let's say that some documents use the term "ache"
instead of "pain". Although the two terms are synonyms, if the user's query contains
just the term "ache", documents that use "pain" instead, won't be returned.
For this, it would be interesting to investigate cluster-based retrieval methods
capable for discovering semantic similarities between documents and queries. In our
retrieval experiments on OHSUMED, retrieval by semantic similarity could be
applied by using MeSH as the underlying reference ontology and by associating terms
using semantic similarity me
NIKOLAOS HOURDAKIS
CHAPTER 5. CONCLUSIONS
TECHNICAL UNIVERSITY OF CRETE

**References **
[1] Charu C. Aggarwal, Stephen C. Gates, and Philip S. Yu

*. On the merits of *
*building categorization systems by supervised clustering.* In Proc. of the Fifth
ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining,
[2] Ricardo Baeza-Yates and Berthier Ribeiro-Neto.

*Modern Information Retrieval*.
Addison Wesley Longman, 1999.
[3] Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler (2000).

*Extensible *
*Markup Language (XML) 1.0 (Second Edition) *W3C Recommendation, October
2000

*. *Available at: http://www.w3.org/TR/2000/REC-xml-20001006/
[4] D. Chickering, D. Heckerman, and C. Meek.

*A Bayesian Approach to Learning *
*Bayesian Networks with Local Structure*. In Proceedings of Thirteenth
Conference on Uncertainty in Artificial Intelligence, pages 80–89. Morgan
[5] W. B. Croft.

*A model of cluster searching based on classification.* *Information *
*Systems*, Vol. 5, pp. 189-195, 1980.
[6] Douglass R. Cutting, David R. Karger, Jan O. Pedersen, and John W. Tukey,

*Scatter/Gather: A* *Cluster-based Approach to Browsing Large Document*
*Collections*. SIGIR ‘92, pages 318 – 329, 1992
[7] E. Diday and J. C. Simon.

*Clustering analysis. In Digital Pattern Recognition, *
K. S. Fu, Ed. Springer-Verlag, Secaucus, NJ, pages 47–94, 1976
[8] C. Ding, X. He.

*Cluster merging and splitting in hierarchical clustering *
*algorithms.* In IEEE International Conference on Data Mining (ICDM'02),
[9] C. Ding, X. He, H. Zha, M. Gu, and H. Simon.

*A min-max cut algorithm for *
*graph partitioning and data clustering.* Proc. IEEE Int'l Conf. Data Mining,
pages 107-114, 2001.
[10] R.O. Duda, P.E. Hart, and D.G. Stork.

*Pattern Classification*. John Wiley &
[11] W. Douglas Johnston Stuart J. Nelson and Betsy L. Humphreys. Relationships in
Medical Subject Headings (MeSH). In

*National Library of Medicine, Bethesda,*
*MD, USA*, 2002.
[12] B. S. Duran and P. L. Odell.

*Cluster* *Analysis: A Survey*. Springer-Verlag, New-
[13] V. Faber.

*Clustering and the Continuous k-Means Algorithm*, Los Alamos
Science, November 22, 1994.
[14] Benjamin C. M. Fung, Ke Wang, Martin Ester.

*Hierarchical Document *
*Clustering Using Frequent Itemsets.* Proceedings of the 2003 SIAM
International Conference on Data Mining. T
[15] N. Fuhr. Probabilistic M

*odels in Information Retrieval*. The Computer Journal,
vol. 35, no. 3, pages 243--255, 1992.
[16] S. Guha, R. Rastogi, and K. Shim.

*CURE: An efficient clustering algorithm for *
*large databases.* In Proceedings of the Int'l Conference on Management of Data
(SIGMOD'98) (Seattle, WA). ACM Press, June 1998.
[17] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim.

*ROCK: a robust clustering *
*algorithm for categorical attributes.* In Proc. of the 15th Int'l Conf. on Data
[18] A. El-Hamdouchi and P. Willett.

*Comparison of Hierarchic Agglomerative *
*Clustering Methods for Document Retrieval*. The Computer Journal 32(3): 220-
[19] Jiawei Han and Michelle Kamber.

*Data Mining: Concepts and Techniques.*
Morgan Kaufmann, 2001.
[20] J.A. Hartigan and M.A. Wong.

*A K-Means clustering algorithm*. In applied
statistics, pages 208-220
TECHNICAL UNIVERSITY OF CRETE
[21] W. Hersh, C. Buckley, T.J. Leone, and D. Hickam.

*OHSUMED: An interactive *
*retrieval evaluation and new large test collection for research.* In SIGIR-94,
pages 192–201, 1994.
[22] A. Hliautakis. S

*emantic Similarity Measures in MeSH Ontology and their *
*application to Information Retrieval on Medline. *Technical Report TR-TUC-
ISL-02-2005, September 2005. Available on the WWW at URL
[23] J. E. Jackson.

*A User's Guide To Principal Components*. John Wiley & Sons,
[24] A.K. Jain and R. C. Dubes.

*Algorithms for Clustering Data*. Prentice Hall, 1988.
[25] A.K. Jain., M.N. Murty, and P.J Flynn,

*Data Clustering: A Review*. ACM
Computing Surveys, 31(3), pp.264-323, 1999.
[26] N. Jardine and C. J. van Rijsbergen.

*The use of hierarchic clustering in *
*information retrieval. Information Storage and Retrieval*, 7:217–240, 1971.
[27] W. Douglas Johnston, Stuart J. Nelson and Betsy L. Humphreys.

*Relationships *
*in Medical Subject Headings (MeSH). *In National Library of Medicine,
Bethesda, MD, USA, 2002.
[28] G. Karypis, E.H. Han, and V. Kumar.

*Chameleon: A hierarchical clustering *
*algorithm using dynamic modeling.* *IEEE* *Computer*, 32(8):68–75, 1999.
[29] R. Kass and L. Wasserman.

*A reference Bayesian test for nested hypotheses and *
*its relationship to the Schwarz criterion.* Journal of the American Statistical
Association, 90, 773-795, 1995.
[30] L. Kaufman and P.J. Rousseeuw.

*Finding Groups in Data: an Introduction to *
*Cluster Analysis*. John Wiley & Sons, 1990.
[31] B. King.

*Step-wise clustering procedures*. Journal of the American Statistical
Association, 69:86–101, 1967.
NIKOLAOS HOURDAKIS
[32] Bjornar Larsen and Chinatsu Aone.

*Fast and effective text mining using linear-*
*time document clustering.* In

*Proc. of the* *Fifth ACM SIGKDD Int'l Conference *
*on Knowledge* *Discovery and Data Mining*, pages 16–22, 1999.
[33] C. Leacock and M. Chodorow.

*Combining Local Context and WordNet *
*Similarity for Word Sense Identification in WordNet.* In Fellbaum, C., ed.: An
Electronic Lexical Database. MIT Press (1998) 265–283
[34] R.C.T. Lee. Clustering

*analysis and its applications.* In J.T. Toum, editor,

*Advances in Information Systems Science*. Plenum Press, New York, 1981.
[35] D. D. Lewis.

*Reuters-21578 text categorization test collection distribution 1.0.*
http://www.research.att.com/lewis, 1999.
[36] Y, Li, Z.A. Bandar, and D. McLean.

*An Approach for Measuring Semantic *
*Similarity between Words Using Multiple Information Sources*. IEEE Trans. on
Knowledge and Data Engineering 15(4) (2003) 871–882
[37] D. Lin.

*Principle-Based Parsing Without Overgeneration*. In: Annual Meeting of
the Association for Computational Linguistics (ACL'93), Columbus, Ohio
(1993) 112–120
[38] R. Michalski, R. E. Stepp, and E. Diday.

*A recent advance in data analysis: *
*Clustering objects into classes characterized by conjunctive concepts.* In
Progress in Pattern Recognition, Vol. 1, L. Kanal and A. Rosenfeld, Eds. North-
Holland Publishing Co., Amsterdam, The Netherlands, 1981
[39] G. Nagy.

*State of the art in pattern recognition*. Proc. IEEE 56, 836-862, 1968.
[40] R. Ng and J. Han.

*Efficient and effective clustering method for spatial data *
*mining.* In Proc. of the 20th VLDB Conference, pages 144–155, Santiago, Chile,
[41] T. Nomoto and Y. Matsumoto.

*A New Approach to Unsupervised Text *
*Summarization.* Proceedings of the 24th ACM SIGIR Conference on Research
and Development in Information Retrieval (SIGIR'01), pages 26–34, 2001.
TECHNICAL UNIVERSITY OF CRETE
[42] D. Pelleg and A. Moore.

*X-means: Extending K-means with Efficient Estimation *
*of the Number of Clusters.* In Seventeenth International Conference on Machine
[43] Qi Quifen, Gao Qigang and Michael Shepherd.

*Accessing Tacit Knowledge in *
*the Pediatric Pain e-Mail Archives.* Proceedings of the 38th Hawaii International
Conference on System Sciences – 2005.
[44] R. Rada, H. Mili, E. Bicknell, and M. Blettner.

*Development and Application of *
*a Metric on Semantic Nets.* IEEE Trans.on Systems, Man, and Cybernetics 19(1)
[45] O. Resnik.

*Semantic Similarity in a Taxonomy: An Information-Based Measure *
*and its Application to Problems of Ambiguity and Natural Language*. Journal of
Artificial Intelligence Research 11 (1999) 95–130
[46] Van Rijsbergen, C.J. & Croft, W. B.

*Document clustering: An evaluation of *
*some experiments with the Cranfield 1400 collection. *Information Processing

* & *
Management, 11, pp. 171-182, 1975.
[47] Van Rijsbergen, C.J.

*Information Retrieval *(2nd ed.). London: Butterworths,
[48] S. Robertson.

*The Probability Ranking Principle in IR.* Journal of
Documentation 33, pages 294-304, 1977.
[49] M. Rodriguez and M. Egenhofer.

*Determining Semantic Similarity Among Entity *
*Classes from Different Ontologies.* IEEE Trans. on Knowledge and Data
Engineering 15(2) (2003) 442–456
[50] G. Salton.

*Automatic Text Processing: The Transformation, Analysis, and *
*Retrieval of Information by Computer*. Addison-Wesley, 1989.
[51] G. Salton.

*Introduction to Modern Information Retrieval.* McGraw-Hill, 1983.
[52] G. Schwarz, Estimating

*the dimension of a model.* Ann. Statistics 6, 461-464,
NIKOLAOS HOURDAKIS
[53] R. Sibson.

*SLINK: An optimally efficient algorithm for the single link cluster *
*method.* Computer Journal, 16:30-34, 1973.
[54] P. H. Sneath and R. R. Sokal.

*Numerical Taxonomy. *Freeman, London, UK,
[55] M.Steinbach, G. Karypis, and V. Kumar.

*A comparison of document clustering *
*techniques.* In KDD Workshop on Text Mining, 2000.
[56] A. Tversky.

*Features of Similarity*. Psychological Review 84(4) (1977) 327–352
[57] G. Varelas.

*Semantic Similarity Methods in WordNet and Their Application to *
*Information Retrieval on the Web.* Technical Report TR-TUC-ISL-01-2005.
Department of Electronic and Computer Engineering, Chania, Crete, Greece,
2005. Available on the WWW at URL
[58] E. M. Voorhees, E.M.

*The cluster hypothesis revisited.* In SIGIR 1985, pp.188-
[59] Wikipedia: http://en.wikipedia.org/wiki/MeSH
[60] Y.Yang.

*An evaluation of statistical approaches to text categorization.* Journal
of Information Retrieval, 1(1/2):67-88, 1999.
[61] Y.Yang and J. Pedersen.

*A comparative study on feature selection in text *
*categorization.* In J. D. H. Fisher, editor, The Fourteenth International
Conference on Machine Learning (ICML'97), pages 412-420. Morgan
[62] Oren Zamir, Oren Etzioni, Omid Madani and Richard M. Karp,

*Fast and *
*Intuitive Clustering of Web Documents*, KDD '97, Pages 287-290, 1997.
[63] Ying Zhao and George Karypis.

*Criterion functions for document clustering: *
*Experiments and analysis.* Technical Report TR#01–40, Department of
Computer Science, University of Minnesota, Minneapolis, MN, 2001. Available
on the WWW at

*http://cs.umn.edu/˜karypis/publications*.
TECHNICAL UNIVERSITY OF CRETE
[64] Y. Zhao and G. Karypis.

*Soft Clustering Criterion Functions for Partitional *
*Document Clustering*. Technical Report TR-04-022, Department of Computer
Science, University of Minnesota, Minneapolis, 2004. Available on the WWW
at URL

*http://www.cs.umn.edu/˜karypis*.
[65] Ying Zhao, George Karypis, Jack G. Conrad and Khalid Al-Kofahi.

*Effective *
*Document Clustering for Large Heterogeneous Law Firm Collections.* In
Proceedings of ICAIL, 2005.
[66] Y. Zhao and G. Karypis.

*Evaluation of hierarchical clustering algorithms for *
*document datasets.* In Proc. of Int'l. Conf. on Information and Knowledge
Management, pages 515–524, 2002.
[67] K. Wang, S. Zhou, Y. He, "

*Hierarchical classification of real life documents", *
SIAM International Conference of Data Mining (SDM'01, Chicago, United
[68] P. Willett, "

*Recent Trends in Hierarchic Document Clustering: A Critical *
*Review"*, Information Processing & Management, 24(5), pp. 577-597, 1988.
NIKOLAOS HOURDAKIS
TECHNICAL UNIVERSITY OF CRETE

**Appendix A **
**Appendix **
**A.1 MeSH DTD File **
<!-- MESH DTD files for descriptors desc2006.dtd -->
<!ENTITY % DescriptorReference "(DescriptorUI, DescriptorName)">
<!ENTITY % normal.date "(Year, Month, Day)">
<!ENTITY % ConceptReference" (ConceptUI, ConceptName,
ConceptUMLSUI?)">
<!ENTITY % QualifierReference "(QualifierUI, QualifierName)">
<!ENTITY % TermReference "(TermUI, String)">
<!ELEMENT DescriptorRecordSet (DescriptorRecord*)>
<!ELEMENT DescriptorRecord (%DescriptorReference;,
DateEstablished?,
ActiveMeSHYearList,
AllowableQualifiersList?,
PublicMeSHNote?,
PreviousIndexingList?,
EntryCombinationList?,
SeeRelatedList?,
TreeNumberList?,
RecordOriginatorsList,
ConceptList) >
<!ATTLIST DescriptorRecord DescriptorClass (1 2 3 4) "1">
<!ELEMENT ActiveMeSHYearList (Year+)>
<!ELEMENT AllowableQualifiersList (AllowableQualifier+) >
<!ELEMENT AllowableQualifier (QualifierReferredTo,Abbreviation )>
<!ELEMENT Annotation (#PCDATA)>
<!ELEMENT ConsiderAlso (#PCDATA) >
<!ELEMENT Day (#PCDATA)>
<!ELEMENT DescriptorUI (#PCDATA) >
<!ELEMENT DescriptorName (String) >
<!ELEMENT DateCreated (%normal.date;) >
<!ELEMENT DateRevised (%normal.date;) >
<!ELEMENT DateEstablished (%normal.date;) >
<!ELEMENT DescriptorReferredTo (%DescriptorReference;) >
<!ELEMENT EntryCombinationList (EntryCombination+) >
<!ELEMENT EntryCombination (ECIN,
<!ELEMENT ECIN (DescriptorReferredTo,QualifierReferredTo) >
<!ELEMENT ECOUT (DescriptorReferredTo,QualifierReferredTo? ) >
<!ELEMENT HistoryNote (#PCDATA)>
<!ELEMENT Month (#PCDATA)>
<!ELEMENT OnlineNote (#PCDATA)>
<!ELEMENT PublicMeSHNote (#PCDATA)>
<!ELEMENT PreviousIndexingList (PreviousIndexing)+>
<!ELEMENT PreviousIndexing (#PCDATA) >
<!ELEMENT RecordOriginatorsList (RecordOriginator,
RecordMaintainer?,
RecordAuthorizer? )>
<!ELEMENT RecordOriginator (#PCDATA)>
<!ELEMENT RecordMaintainer (#PCDATA)>
TECHNICAL UNIVERSITY OF CRETE
<!ELEMENT RecordAuthorizer (#PCDATA)>
<!ELEMENT RunningHead (#PCDATA)>
<!ELEMENT QualifierReferredTo (%QualifierReference;) >
<!ELEMENT QualifierUI (#PCDATA) >
<!ELEMENT QualifierName (String) >
<!ELEMENT Year (#PCDATA) >
<!ELEMENT SeeRelatedList (SeeRelatedDescriptor+)>
<!ELEMENT SeeRelatedDescriptor (DescriptorReferredTo)>
<!ELEMENT TreeNumberList (TreeNumber)+>
<!ELEMENT TreeNumber (#PCDATA)>
<!ELEMENT ConceptList (Concept+) >
<!ELEMENT Concept (%ConceptReference;,
RegistryNumber?,
SemanticTypeList?,
PharmacologicalActionList?,
RelatedRegistryNumberList?,
ConceptRelationList?,
<!ATTLIST Concept PreferredConceptYN (Y N) #REQUIRED >
<!ELEMENT ConceptUI (#PCDATA)>
<!ELEMENT ConceptName (String)>
<!ELEMENT ConceptRelationList (ConceptRelation+) >
<!ELEMENT ConceptRelation (Concept1UI,
RelationAttribute?)>
<!ATTLIST ConceptRelation RelationName (NRW BRD REL) #IMPLIED >
<!ELEMENT Concept1UI (#PCDATA)>
<!ELEMENT Concept2UI (#PCDATA)>
<!ELEMENT ConceptUMLSUI (#PCDATA)>
<!ELEMENT CASN1Name (#PCDATA)>
<!ELEMENT PharmacologicalActionList (PharmacologicalAction+)>
NIKOLAOS HOURDAKIS
<!ELEMENT PharmacologicalAction (DescriptorReferredTo) >
<!ELEMENT RegistryNumber (#PCDATA)>
<!ELEMENT RelatedRegistryNumberList (RelatedRegistryNumber+)>
<!ELEMENT RelatedRegistryNumber (#PCDATA)>
<!ELEMENT RelationAttribute (#PCDATA)>
<!ELEMENT ScopeNote (#PCDATA)>
<!ELEMENT SemanticTypeList (SemanticType+)>
<!ELEMENT SemanticType (SemanticTypeUI, SemanticTypeName) >
<!ELEMENT SemanticTypeUI (#PCDATA)>
<!ELEMENT SemanticTypeName (#PCDATA)>
<!ELEMENT TermList (Term+)>
<!ELEMENT Term (%TermReference;,
ThesaurusIDlist?)>
<!ATTLIST Term ConceptPreferredTermYN (Y N) #REQUIRED
IsPermutedTermYN (Y N) #REQUIRED
LexicalTag (ABB ABX ACR ACX EPO LAB NAM NON TRD)
PrintFlagYN (Y N) #REQUIRED
RecordPreferredTermYN (Y N) #REQUIRED>
<!ELEMENT TermUI (#PCDATA)>
<!ELEMENT String (#PCDATA)>
<!ELEMENT Abbreviation (#PCDATA)>
<!ELEMENT SortVersion (#PCDATA)>
<!ELEMENT EntryVersion (#PCDATA)>
<!ELEMENT ThesaurusIDlist (ThesaurusID+)>
<!ELEMENT ThesaurusID (#PCDATA)>
TECHNICAL UNIVERSITY OF CRETE

**A.2 Document Collections **
**A.2.1 Reuters-21578 **
Figure A.1 presents a Reuters-21578 document to demonstrate its structure and its
attributes. The Reuters's document tags are bolded.

**<REUTERS TOPICS="YES" LEWISSPLIT="TEST" **
**<DATE>** 2-JUN-1987 10:54:40.06

**</DATE>**
d f BC-ORION-BROADCAST-<OBGI 06-02 0079

**</UNKNOWN>**
**<TITLE>**ORION BROADCAST <OBGI.O> BUYS FORD <F>

**<DATELINE>** DENVER, June 2 -

**</DATELINE><BODY>**Orion Broadcast

Group Inc said its majority-owned Orion Financial Services Corp subsidiary has
agreed to purchase FN Realty Services Inc from Ford Motor Co for 1,200,000 to
1,500,000 dlrs in cash and notes.
It said closing is expected within 45 days after receipt of regulatory approvals.
FN provides loan collection, accounting, data processing and administrative
services to the real estate industry.

** </BODY></TEXT> **
**Figure A.1: **A Reuters-21578 document

NIKOLAOS HOURDAKIS

**A.2.2 OHSUMED **
Figure A.2 illustrates an OHSUMED document and Figure A.3 presents the field
(attribute) definitions.
Am J Emerg Med 8703; 4(6):514-5
Abdominal Injuries/ET; Accidents, Occupational; Accidents, Traffic/*; Adult;
Amputation; Blood Transfusion/*; Case Report; Female; Fractures/ET; Human;
Pelvic Bones/IN; Shock, Hemorrhagic/ET/*TH; Wounds, Nonpenetrating/*CO.
Massive transfusion without major complications after trauma.
JOURNAL ARTICLE.
A case of massive degloving injury of the trunk, with open pelvic fracture, and
evisceration of abdominal contents from blunt trauma is presented. The most
significant aspect of this case was the transfusion of 173 units of packed cells and 176
units of fresh frozen plasma in the first thirty hours. The patient ultimately recovered
and returned to work.
Brotman S; Lamonica C; Cowley RA.

**Figure A.2**: An OHSUMED document

(important note: documents should be processed in this
MEDLINE identifier (UI) (<DOCNO> used for relevance judgements)
Human-assigned MeSH terms (MH)
Publication type (PT)
TECHNICAL UNIVERSITY OF CRETE

**Figure A.3:** Field Definitions

**A.3 Lucene **
Lucene is a full-featured (Java-based) open source toolkit for text indexing and
searching. It is easy to use, flexible, and powerful - a model of good object-oriented
software architecture. Powerful abstractions and useful concrete implementations
make Lucene very flexible, and allow new users to get up and running quickly and
painlessly. We use Lucene in order to perform various operations needed by the
document clustering and retrieval experiments we conduct as part of this work
(indexing, searching etc). Lucene is freely available at

*http://lucene.apache.org.*
**A.4 Retrieval on OHSUMED – Evaluation Queries **
**A.4.1 61 Original OHSUMED Queries **
For the retrieval evaluation we used a subset of 61 queries of the original OHSUMED
query set. We present them below.

**.I 2 **

.B

60 yo male with disseminated intravascular coagulation

.W

pathophysiology and treatment of disseminated intravascular coagulation

**.I 3 **

.B

prolonged prothrombin time

.W

anticardiolipin and lupus anticoagulants, pathophysiology, epidemiology,

complications

** **

.I 5

.B

58 yo with cancer and hypercalcemia

.W

effectiveness of etidronate in treating hypercalcemia of malignancy

NIKOLAOS HOURDAKIS

** **

.I 6

.B

55 yo female,postmenopausal

.W

does estrogen replacement therapy cause breast cancer

**.I 9 **

.B

30 year old with fever, lymphadenopathy, neurologic changes and rash

.W

t-cell lyphoma associated with autoimmune symptoms

**.I 10 **

.B

57yo male with hypercalcemia secondary to carcinoma

.W

effectiveness of gallium therapy for hypercalcemia

**.I 12 **

.B

30 y old female suvivor of satanic cult

.W

descriptions of injuries associated with cult activities

**.I 14 **

.B

35 y o male with aids and pancytopenia

.W

pancytopenia in aids, workup and etiolog

** **

.I 16

.B

chronic fatigue syndrome

.W

chronic fatigue syndrome, managment and treatment

** **

.I 17

.B

29 yo female 3 months pregnant

.W

Rh isoimmunization, review topics

**.I 18 **

.B

endocarditis

.W

endocarditis, duration of antimicrobial therapy

**.I 19 **
TECHNICAL UNIVERSITY OF CRETE
.B

18 yo pregnant woman with hyperthroidism

.W

use of beta-blockers for thyrotoxicosis during pregnancy

**.I 20 **

.B

cerebral palsy with depression

.W

relationship of cerebral palsy and depression

**.I 22 **

.B

35 yo with advanced metastatic breast cancer

.W

chemotherapy advanced for advanced metastatic breast cancer

** **

.I 25

.B

49 yo B male with hypotension, hypokalemia, and low aldosterone.

.W

isolated hypoaldosteronism, syndromes where hypoaldosteronism and hypokalemia

occur concurrently

** **

.I 29

.B

24 y o female g1 p0 9 months pregnant with thrombocytopenia

.W

thrombocytopenia in pregnancy, etiology and management

** **

.I 30

.B

63 y o male with acute renal failure probably 2nd to aminoglycosides/contrast dye

.W

acute tubular necrosis due to aminoglycosides, contrast dye, outcome and treatment

** **

.I 31

.B

45 yo wf , chronic lower extremity pain

.W

chronic pain management, review article, use of tricyclic antidepressants

** **

.I 32

.B

40 y o male with cocaine withdrawal

.W

cocaine withdrawal management

** **

.I 33

.B

NIKOLAOS HOURDAKIS
67 yo wm with hemiballismus

.W

carotid endarterectomy, when to perform

** **

.I 35

.B

42 YO WITH HEPATOCELLULAR CARCINOMA

.W

RISK FACTORS and TREATMENT for HEPATOCELLULAR CARCINOMA

** **

.I 37

.B

FATIGUE

.W

FIBROMYALGIA/FIBROSITIS, DIAGNOSIS AND TREATMENT

** **

.I 38

.B

DIABETIC GASTROPARESIS

.W

DIABETIC GASTROPARESIS, TREATMENT

** **

.I 39

.B

35 Y O WITH GASTROENTERITIS

.W

VIRAL GASTROENTERITIS, CURRENT MANAGEMENT

** **

.I 41

.B

46 Y0 NEW ASCITES

.W

ASCITES, DIFFERENTIAL diagnosis and work-up

** **

.I 42

.B

31yo female with downs syndrome

.W

keratoconus, treatment options

**.I 43 **

.B

55 yo male with back pain

.W

back pain, information on diagnosis and treatment

** **

.I 46

.B

64 yo black male

.W

TECHNICAL UNIVERSITY OF CRETE
occult blood sceening, need for routine screening

** **

.I 48

.B

35 year old with peripheral neuropathy and edema

.W

which peripheral neuropathies have associated edema

** **

.I 50

.B

62 year old with stroke and systolic hypertension

.W

isolated systolic hypertension, shep study

** **

.I 52

.B

74 yo man with post-radiation pericardial effusion and near tamponade

.W

indications for and success of pericardial windows and pericardectomies

** **

.I 54

.B

older male

.W

angiotensin converting enzyme inhibitors, review article

** **

.I 55

.B

24 y. o. w. f. s/p DVT currently on coumadin

.W

course of anticoagulation with coumadin

** **

.I 57

.B

22 yo with fever, leukocytosis, increased intracranial pressure, and central herniation

.W

cerebral edema secondary to infection, diagnosis and treatment

** **

.I 58

.B

65 yo female with a breast mass

.W

diagnostic and therapeutic work up of breast mass

** **

.I 60

.B

28 yr old male with endocarditis

.W

treatment of endocarditis with oral antibiotics

NIKOLAOS HOURDAKIS

**.I 62 **

.B

26 y o female with bulimia

.W

evaluation for complications and management of bulimia

** **

.I 63

.B

migraine

.W

treatment of migraine headaches with beta blockers and calcium channel blockers

** **

.I 64

.B

30 y o with hypothermia

.W

prevention, risk factors, pathophysiology of hypothermia

** **

.I 66

.B

35 female with pickwickian syndrome

.W

complications of prolonged progesterone

** **

.I 69

.B

70 y o female with left lower quadrant pain

.W

diverticulitis, differential diagnosis and management

**.I 71 **

.B

27 yo with cystic fibrosis and renal failure

.W

cystic fibrosis and renal failure, effect of long term repeated use of aminoglycosides

** **

.I 73

.B

23 YO male with alcolol abuse here for TIPS procedure.

.W

portal hypertension and varices, management with TIPS procedure

** **

.I 74

.B

43 y o female with fevers, increased CPK

.W

neuroleptic malignant syndrome, differential diagnosis, treatment

** **

.I 75

.B

TECHNICAL UNIVERSITY OF CRETE
carcinoid tumors of the liver and pancreas

.W

carcinoid tumors of the liver and pancreas, research, treatments

** **

.I 77

.B

30 y o with dehydration, hyperthermia

.W

heat exhaustion, management and pathophysiology

** **

.I 79

.B

25 y o female with anorexia/bulimia

.W

complications and management of anorexia and bulimia

** **

.I 81

.B

48 y o with culture negative endocarditis suspected

.W

culture negative endocarditis, organisms, diagnosis, treatment

** **

.I 82

.B

24 y o with HIV

.W

aids dementia, workup

** **

.I 83

.B

patient s/p renal transplant with fever

.W

infections in renal transplant patients

** **

.I 84

.B

50 year old with copd

.W

theophylline uses--chronic and acute asthma

** **

.I 88

.B

lung cancer

.W

lung cancer, radiation therapy

** **

.I 89

.B

60 year old with lung abscess

.W

NIKOLAOS HOURDAKIS
surgery vs. percutaneous drainage for lung abscess

** **

.I 90

.B

30 year old female with paroxysmal anaphylaxis

.W

Catamenorrheal Anaphylaxis

** **

.I 92

.B

66 year old male with Guillain-Barre syndrome

.W

Guillain-Barre syndrome, Sensitivity and specificity of nerve conduction velocity

tests

** **

.I 94

.B

23 YO W DYSURIA

.W

Urinary Tract Infection, CRITERIA FOR TREATMENT AND ADMISSION

** **

.I 96

.B

41 yo w f here for new visit healthy otherwise

.W

preventive health care for the adult patient

** **

.I 100

.B

32 yo schizophrenic patient with peripheral neuropathy

.W

association of neuroleptics and peripheral neuropathy

** **

.I 103

.B

50 yo woman with breakthrough vaginal bleeding while on estrogen and progesterone

therapy

.W

differential diagnosis of breakthrough vaginal bleeding while on estrogen and

progesterone therapy

** **

.I 105

.B

68 yo woman with anemia of chronic illness

.W

review of anemia of chronic illness

** **

.I 106

.B

42 yo w/HIV and diarrhea

TECHNICAL UNIVERSITY OF CRETE
.W HIV and the GI tract, recent reviews

**A.4.2 MeSH-Based Representations of the 61 OHSUMED Queries **
We present the corresponding MeSH-based representations of the 61 queries
presented in subsection A.4.1. The number at the left is the identifier of each query.
2. disseminated_intravascular_coagulation male therapeutics
prothrombin_time
anticoagulants epidemiology
hypercalcemia neoplasms etidronic_acid
breast_neoplasms female estrogen_replacement_therapy
exanthema fever t-lymphocytes lymphatic_diseases
gallium hypercalcemia carcinoma male
wounds_and_injuries female
14. acquired_immunodeficiency_syndrome pancytopenia male
fatigue_syndrome,_chronic therapeutics
rh_isoimmunization female
18. endocarditis
pregnancy thyrotoxicosis pregnant_women
cerebral_palsy depression
breast_neoplasms drug_therapy
hypoaldosteronism aldosterone male hypokalemia
pregnancy female thrombocytopenia
30. necrosis male aminoglycosides kidney_failure kidney_failure,_acute
antidepressive_agents,_tricyclic lower_extremity
dyskinesias endarterectomy,_carotid
carcinoma,_hepatocellular risk_factors therapeutics
fatigue diagnosis therapeutics
gastroparesis therapeutics
39. gastroenteritis
diagnosis,_differential ascites
NIKOLAOS HOURDAKIS
back_pain male diagnosis therapeutics world_health_organization prognosis
mass_screening male occult_blood
edema peripheral_nervous_system_diseases
hypertension cerebrovascular_accident
pericardiectomy pericardial_effusion
peptidyl-dipeptidase_a enzyme_inhibitors male
57. intracranial_pressure
leukocytosis fever diagnosis infection
work breast female therapeutics
male endocarditis anti-bacterial_agents therapeutics
bulimia female evaluation_studies
63. calcium_channels
calcium_channel_blockers migraine_disorders therapeutics
hypothermia risk_factors
66. obesity_hypoventilation_syndrome progesterone female
blood lupus vasculitis rectum
pain diagnosis,_differential female diverticulitis
cystic_fibrosis aminoglycosides kidney_failure
hypertension,_portal methods male varicose_veins
pancreas liver research carcinoid_tumor therapeutics
fever heat_exhaustion dehydration
bulimia female anorexia
culture diagnosis endocarditis therapeutics
hiv dementia acquired_immunodeficiency_syndrome
patients fever infection transplants
84. pulmonary_disease,_chronic_obstructive asthma theophylline
lung_neoplasms radiation
surgery lung_abscess drainage
female anaphylaxis
92. neural_conduction
sensitivity_and_specificity guillain-barre_syndrome male
94. urinary_tract
urinary_tract_infections therapeutics
100. antipsychotic_agents patients peripheral_nervous_system_diseases association
103. estrogens progesterone diagnosis,_differential hemorrhage women
TECHNICAL UNIVERSITY OF CRETE
105. chronic_disease anemia women
gastrointestinal_tract diarrhea
NIKOLAOS HOURDAKIS

Source: http://www.intelligence.tuc.gr/lib/downloadfile.php?id=268

Drosophila and Antioxidant Therapy F. Missirlis1, J.P. Phillips2, H. Jäckle3 and T.A. Rouault1 1Cell biology and Metabolism Branch, National Institute of Child Health and Human Development, Bethesda, Maryland, U.S.A. 2Molecular Biology and Genetics, University of Guelph, Ontario, Canada. 3Max-Planck-Institut für biophysikalische Chemie, Göttingen, Germany

Prescription drug Drug Classification No. 629 Harvoni tab. Dosage Forms and strength Each tablet contains Sofosbuvir ……….……………………………………………………………………………………… 400 mg Additives(tar colorant) : Yellow #5 Form A orange, diamond-shaped, film-coated tablet, debossed with "GSI" on one side