Deamo.prof.ufu.br
MILPRIT
∗: A Constraint-Based Algorithm for Mining
Temporal Relational Patterns
∗
Sandra de Amo1, Waldecir P. Junior1, Arnaud Giacometti2
1Faculdade de Computac¸˜ao – Universidade Federal de Uberlˆandia (UFU)
Campus Santa Mˆonica, Bloco B – Uberlˆandia – MG – Brazil
2Universit´e de Tours, Laboratoire d'Informatique,
2, Place Jean Jaur es 45000 Blois, France,
Abstract. In this paper, we consider a new kind of temporal pattern where bothinterval and punctual time representation are considered. These patterns, whichwe call temporal point-interval patterns aim at capturing how events takingplace during different time periods or at different time instants relate to eachother. The datasets where these kind of patterns may appear are temporal re-lational databases whose relations contain point or interval timestamps. Weuse a simple extension of Allen's Temporal Interval Logic as a formalism forspecifying these temporal patterns. We also present the algorithm MILPRIT∗for mining temporal point-interval patterns, which uses variants of the classi-cal level-wise search algorithms. Besides, MILPRIT∗ allows a broad spectrumof constraints to be incorporated into the mining process. An extensive set ofexperiments of MILPRIT∗ executed over synthetic and real data is presentedshowing its effectiveness for mining temporal relational patterns.
Keywords. Temporal Data Mining, Constraint-based Mining, Interval Temporal Logic.
1. Introduction and Motivation
The problem of discovering sequential patterns in temporal data has been extensivelystudied in the past years [Srikant and Agrawal 1996, Zaki 2001, Han et al. 2004] and itsimportance is fully justified by the great number of potential application domains wheremining sequential patterns appears as a crucial issue, such as financial market (evolutionof stock market shares quotations), retailing (evolution of clients purchases), medicine(evolution of patients symptoms), local weather forecast, telecommunication (sequencesof alarms output by network switches), etc. Most of these patterns are specified by for-malisms which are, in some extent, reducible to
Propositional Linear Temporal Logic,where time is represented by points in a straight line. For instance, let us consider aclassical sequential pattern of the form
s =
< {a, b}, {c, d} > (where
{a, b} and
{c,d
}are sets of items purchased by a client). This sequential pattern can be expressed inthe Propositional Linear Temporal Logic by the formula
Pa ∧ Pb ∧ ♦(
Pc ∧ Pd) wherefor each
i ∈ {a, b, c, d},
Pi is a propositional symbol standing for "client buys item
i".
The symbol ♦ stands for the temporal operator
sometimes in the future (for a compre-hensive survey on Linear Temporal Logic, see [Emerson 1990]). The need for a more
∗This research has been supported by CNPq-Brazil, Project 473309/2004-1
expressive kind of temporal patterns arises for instance, when modelling Unix-users be-haviour [Jacobs and Blockeel 2001]. Consider, for instance1, the following sequence ofcommands related to latex users:
ls, vi paper.tex, latex paper.tex, dvips paper.dvi, lprpaper.ps. This sequence of commands can be represented as a sequence of relational(or first-order) atoms of the form:
ls,
vi(
paper.tex),
latex(
paper.tex),
dvips(
paper.dvi),
lpr(
paper.ps). Within such a database of sequences of relational atoms, it would be pos-sible to discover that the relational sequence pattern
vi(
X),
latex(
X) is frequent. Noticethat this pattern tell us that the sequence of commands (predicates)
vi,
latex is frequentlyrequested by users. In order to specify this kind of sequential patterns, we need a morepowerful logical formalism, the First-Order Linear Temporal Logic, since the elementsinvolved in the patterns include both predicates and their parameters.
In all the above examples illustrating the propositional and relational settings of
the sequential pattern mining problem, the events are instantaneous, that is, the time whenthey happen is represented as a point (instant) in a straight line. The sequence of eventscorresponds to a sequence of instants when these events take place. However, there aremany situations where events have a certain
duration, and so, the underlying time is mea-sured in terms of
intervals instead of points. In this paper, we propose a new tempo-ral pattern involving a hybrid representation for time and a new algorithm (MILPRIT
∗)for mining them. These patterns, which we call
temporal point-interval patterns (or pi-patterns for short), aim at capturing how events taking place in time intervals or timeinstants relate to each other. For instance, (1) in a
medical application, we could be inter-ested in discovering if patients who take some medicine
X during a certain period of time
P , and in some moment
m in
P undergo a stomach surgery, will present the symptom
Z during a period of time beginning right after the surgery and finishing as soon as theystop taking the medicine
X; (2) in an
agricultural application, we could be interested indiscovering if the use of some organic fertilizer during a period of time has an effect onthe way a plant grows
during and
after the fertilizer application. The following exampleillustrates the medical application in more details.
Example 1 (Running Example) Let us consider the database schema R =
{Patient, Med,Symp, Hist} and the relational database instance
D over R illustrated in Figure 1. Noticethat the following behavior is verified by 50% of the patients (two patients out of four inthe
Patient relation verified it, namely, Paul and Sarah):
the patient takes penicillin duringa certain period of time e; during a period of time f following his taking the medicine, hefeels dizzy and undergoes a stomach surgery someday t during f.
Temporal pi-patterns are specified as a set of atomic first order formulae where
time is represented by an interval-time variable or by a point-time variable, together witha set of temporal predicates
{ before, meets, overlaps, during, starts, finishes }. Ourlogical formalism for specifying pi-patterns is based on Allen's First Order Interval Logic[Allen and Ferguson 1994]. For instance, the temporal pattern described in Example 1 isspecified by the triple (
K,
D,
T ) where:
K =
Patient(
x) (where
x is a registered patient),
D =
{Med(
x, penicillin, e),
Symp(
x, dizziness, f ),
Hist(
x, st.surgery, t)
} (representingthe events which take place in the pattern),
T =
{before(
e, f ),
during(
t, f )
} (representingthe relationships verified by the time parameters associated to the pattern events).
In most applications, the lack of user control for specifying the kind of patterns
1This example has been taken from [Lee and De Raedt 2002].
Pname Mname
Pname Sname
Pname Event
stomach surgery [5]
Charles tetracyclin [4,7]
Charles dizziness [5,6]
stomach surgery [9]
10/05/2000 15/11/2000
20/07/2001 06/09/2001 09/01/2002
23/08/2002 19/12/2002
Figure 1. A Temporal relational database instance
she would find interesting can lead to tedious and computationally costly post-processingphases. In this paper, we use a formalism based on regular expression in order to allowusers to specify the
shape of potentially interesting patterns. Our temporal mining methodMILPRIT
∗ is constraint-based: it incorporates constraints inside the mining process andin doing so, it accomplishes two goals: it saves a lot of calculation in the post-processingphase and it allows a non-negligent reduction in the search space of potential patterns. Thefollowing example illustrates the idea of using regular expressions to specify constraintson sequential patterns.
Example 2 (Running Example) Let us consider the same situation depicted in Example1. Users could be interested only in discovering pi-patterns where
patients take somemedicine (maybe more than one) and at the end present some symptom. Roughly speak-ing, we can say that these patterns fit into the regular expression format
Med∗Symp. In thiscase, it would be important that patterns representing situations where patients present alist of symptoms and
after that take some medicine, are not generated in the mining pro-cess, since such patterns do not fit in the given regular expression format.
Usually, in real world applications, it is important to allow different time gran-
ularities to be considered when evaluating pattern support, otherwise some interestingpatterns will not be mined. For instance, let us consider a database schema R =
{R(A,T),S(B,T)}, where
T is an interval-time attribute. If the values of the temporal attributesin the database are dates (day/month/year), the temporal predicate
meets(e,f) will be truein this database only if
e and
f are mapped into time intervals [
d1
, d2], [
d3
, d4] respec-tively, where
d1
, d2
, d3
, d4 are dates and
d2 =
d3. If there is very few pair of tuples(
R(
a, [
d1
, d2]),
S(
a, [
d2
, d4]) we are unable to mine patterns of the form
(K(x), {R(x,e),S(x,f)}, {meets(e,f)}). On the other hand, we can have a lot of pair of tuples (
R(
a, [
d1
, d2]),
S(
a, [
d3
, d4]) where
d2 and
d3 are different days but belong to the same week in the year.
In this case, if we relax the time granularity and consider
week to be the time unity,then
d2 and
d3 will be consider identical and hence, the pattern
(K(x), {R(x,e), S(x,f)},{meets(e,f)}) will be mined. The algorithm MILPRIT
∗ we present in this paper allowsusers to specify different time granularities when evaluating the pattern support.
Related Work. A lot of studies on constrained-based temporal mining, in a proposi-tional context, have been investigated in order to reduce the huge amount of patternswhich are generated in the relational mining context. [Srikant and Agrawal 1996] is a
pioneer work introducing constraints in sequential pattern mining in order to reduce ex-ecution time and the number of discovered patterns. In this work, the constraints (min-max, slide-window) are related to the way the support of patterns is counted and notto the patterns format. Recently, a lot of work on constrained-based temporal mininghave investigated formalisms to specify constraints over the pattern format. Regularexpressions have been adopted in many of these work as a very general mechanism tospecify user constraints. [Garofalakis et al. 1999] was a pioneer work in this field, in-troducing the SPIRIT family of algorithms for constraint-based sequential pattern min-ing. [Lorincza and Boulicaut 2003] and [Pei et al. 2002] extended this pioneer work, byproposing new and more efficient algorithms (RE-Hackle and Prefix-Growth respectively)for sequential pattern mining where regular expressions are used to specify useful patternsbeforehand. Such constraints are pushed inside the mining process instead of being ver-ified only in a post-processing phase. [Antunes and Oliveira 2005], following the linesof [Garofalakis et al. 1999], proposed a more general formalism, based on context-freegrammars, in order to specify constraints over sequential patterns.
In [de Amo and Furtado 2005], we tackled the constraint-based temporal min-
ing problem in the relational context, by proposing a formalism to specify con-straints over the temporal patterns we had introduced in [de Amo et al. 2004].
[Lee and De Raedt 2002], the algorithm SeqLog, based on Inductive Logic Program-ming, was introduced for discovering relational sequential patterns with constraints. In[Masson and Jacquenet 2003], the algorithm Spirit-Log, designed to mine relational tem-poral patterns was introduced. Spirit-Log extends SeqLog (where time is punctual), bypushing regular expression constraints inside the mining process. A comprehensive textabout constraint-based temporal
relational pattern mining following an inductive databaseapproach can be found in [Boulicaut et al. 2006]. In all these work, time is representedby
points in a straight line. The approach we propose in this paper is more general sinceit allows mining relational temporal patterns where time is represented by
points as wellas by
intervals. We notice that most real world applications involve both kind of timerepresentation.
In [Hoppner 2001], Allen's Propositional Interval Logic has been used for
the first time, to treat the problem of discovering association rules over time series[Allen and Ferguson 1994]. In [Lattner and Herzog 2005], an approach similar to oursis presented for temporal relational pattern mining where time is represented as intervals.
However, this approach is not constraint-based like ours. Moreover, no experimentalresults concerning the performance of the proposed method are provided, which makesdifficult a comparative analysis with our approach.
The issue of allowing user-specified different time granularities during sup-
port counting has been treated in the pioneer work of [Srikant and Agrawal 1996] and[Bettini et al. 1998], in the context of sequential pattern mining and time series. Here,patterns were specified in a formalism equivalent to propositional temporal logic. More-over, time attributes are
point-valued. We emphasize that our approach is more generalsince it is developed in a temporal
relational database context where time attributes maybe
point-valued as well as
interval-valued.
Our Contribution. We can summarize our main contributions as follows: (1) the in-troduction of a new temporal relational pattern allowing interval and point-based timerepresentation in relations, as well as a logical formalism to specify these patterns. (2) theintroduction of pattern constraints in the mining process, more precisely during the can-didate generation phase, (3) the possibility of considering different user-specified timegranularities during the support count phase, (4) the design and implementation of an al-gorithm (MILPRIT
∗) to mine such constrained temporal patterns and (5) an extensive setof experiments over synthetic and real datasets. This is an extended version of the paper[de Amo et al. 2007]. In that paper, we had only considered interval-based time represen-tation in relations and we had not treated user-specified time granularities. MILPRIT
∗ is ageneralization of the MILPRIT algorithm presented in [de Amo et al. 2007], designed tomining point-interval patterns with user-specified time granularities. The introduction ofpoint-based time representation required to consider a broader set of temporal predicatesin the generation phase. We took care to consider a minimal set of new temporal predi-cates in order to simplify the generation phase without losing completeness. MILPRIT
∗can be considered as a general algorithm for temporal relational pattern mining sinceit can be executed over a dataset of multiple tables with point-based and interval-basedtimestamps. It can be applied in any situation which can be modeled by a temporal re-lational database involving point-valued and interval-valued attributes. Medical applica-tions, for instance, are typical applications where both time representations (point-basedand interval-based) must be considered. In this paper, we focus on this kind of applicationwhen evaluating the efficiency of our algorithm (see Section 5.2).
Paper Organization. This paper is organized as follows. In Section 2, we formalize themain concepts related to the problem of mining pi-patterns. In Section 3, we propose aconstraint-based framework for mining pi-patterns and formalize all the concepts involvedin the presentation of the main algorithm (MILPRIT
∗). Section 4 is dedicated to the pre-sentation of MILPRIT
∗, an algorithm for mining pi-patterns with constraints. In Section5 we present and analyse the experimental results obtained by executing MILPRIT
∗ oversynthetic data as well as over real data related to a medical application. Finally in Section6, we present our concluding remarks and discuss some future research on the subject.
2. Problem Formalization
In this section, we introduce the main concepts related to the problem of discoveringtemporal point-interval patterns.
The Dataset. First of all, let us describe the datasets where the discovery procedurewill be executed later on. We assume a set of attributes symbols
A1
, A2
, ., called
dataattributes, and two special attribute symbols
Tit and
Tpt, called interval-time attribute andpoint-time attribute respectively. Let R =
{K, R1
, ., Rn} be a database schema such thateach relational schema
Ri is of the form
Ri(
Ai , ., Ai , T
, ., Ai , T
pt) or
Ri(
Ai
relational schema
K (called
the reference relation) is of the form
K(
A1
, ., Am). Theattributes appearing in
K (all data attributes) are called
reference attributes.
The values for data attributes
Ai are taken within a set Dom, called the
data do-
main. The attribute
Tit takes values in the set
I =
{[
i, j]
i, j ∈ N
, i < j}. The elements of
I are intervals [
i, j] where
i and
j are natural numbers mapping dates:
i is the starting dateand
j is the ending date of the interval [
i, j]. The attribute
Tpt takes values in the set N.
Elements of
Tpt are natural numbers mapping dates. Naturally, a
point can be viewed asan
interval [
i, j] where
i =
j. In our approach, in order to optimize the generation phaseof the algorithm MILPRIT
∗, we treat interval and points as distinct entities, emphasizingthat the starting and ending points of an interval are different.
A
temporal dataset over R is a set of temporal relations
{k, r1
, ., rn} where each
ri is a set of tuples (
a1
, ., an , e) over
R
i and
k is a set of tuples over
K . We will assume
that the relations
ri with interval-valued timestamp are
coalesced[B¨ohlen et al. 1996], thatis, if (
a1
, ., ak, e) and (
a1
, ., ak, f)
∈ ri, then
e ∩ f =
∅, i.e., intervals associated to asame tuple are maximal. The dataset depicted in Figure 1 is a temporal dataset with thefollowing schema:
Temporal pi-Patterns. Let R =
{K, R1
, ., Rn} be a database schema. Let us sup-pose four disjoints sets
V (data variables, denoted by
x, y, z, x1
, .),
C (data constants,denoted by
a, b, c, a1
, .),
Vit (interval-time variables, denoted by
e, f, g, e1
, .) and
Vpt(point-time variables, denoted by
t, s, t1
, s1
, .). Let us also assume the set of temporalpredicates
PT =
{before, meets, overlaps, during, finishes, starts}, as in Allen's IntervalLogic [Allen and Ferguson 1994]. A
data atom over the database schema R is an expres-sion of the form
R(
x1
, ., xn, e) where
R ∈ R, (
R 6=
K), each
xi is a data variable ora data constant and
e is an interval-time or point-time variable (depending on the timeattribute of
R). A
temporal atom is an expression of the form
r(
e, f ), where
r is oneof the temporal predicates in
PT and
e, f are interval-time or point-time variables. If
r ∈ {overlaps, meets}, then
e and
f are interval-time variables. For the temporal predi-cates
before, during, starts, finishes we may have different temporal atoms:
before(
e, f ),
before(
t, e),
before(
e, t),
before(
t, s),
during(
e, f ),
during(
t, e),
starts(
e, f ),
starts(
t, e),
finishes(
e, f ),
finishes(
t, e), where
e, f are interval-time variables and
t, s are point-timevariables. Figure 2 illustrates the semantics of the temporal predicates. For instance, theformula
before(
e, f ) is true if the variables
e and
f are evaluated as [4
, 6] and [8
, 9] respec-tively. The formula
before(
t, e) is true if
t is evaluated as 5 and
e as [6
, 7]. The formula
during(
e, f ) is true if the variables
e and
f are evaluated as [5
, 7], and [4
, 9] respectively.
Interval-time x Interval-time
Point-time x Point-time
Point-time x Interval-time
before(t,e) during(t,e) starts(t,e) finishes(t,e) before(e,t)
Figure 2. Semantics of Temporal Predicates
Definition 1 (Temporal pi-pattern) A
temporal pi-pattern (temporal point-interval pat-tern) is a triple (
K,
D,
T ) where (1)
D is a set of data atoms (2)
T is a set of temporalatoms and (3)
K is a special data atom
K(
x1
, ., xn) whose variables
x1
, ., xn appear in
D. The data atom
K is called the
reference query of the pi-pattern (
K,
D,
T ). It is essential
in the definition of a pi-pattern, since it specifies
what is counted. It plays the same rolein our setting as the
atom key in the Datalog patterns of [Dehaspe and Toivonnen 1999].
Example 3 (Running Example) Let
(
Patient(x)
, D, T ) where
D =
{Med(
x, y, e),
Symp(
x, z, f )
} and
T =
{before(
e, f )
}.
Intuitively, this temporal pattern translates the fact that a patient
x takes a medicine
y (notspecified) during a period of time
e and during a further period of time
f (after she hasstopped taking the medicine
y) presents a symptom
z (not specified). Note that we arealways interested in analyzing the behaviour of patients registered in the relation
Patient,even though relations
Med and
Symp may contain data related to other patients (not onlythose registered in
Patient). For this reason,
Patient(
x) is called
the reference query.
An
instantiation of an interval-time variable
e (resp. a point-time variable
t) is a
mapping
v associating an interval
v(
e) = [
i, j] to
e (resp. a point
i to
t), where
i and
j arenatural numbers (mapping dates). A
ground temporal atom is the temporal atom obtainedby instantiating its interval-time and point-time variables.
Definition 2 (Support and frequency of a pi-pattern) Let
D be a temporal databaseover the schema R =
{K, R1
, ., Rn} and
M = (
K(
x1
, ., xm),
{D1
, .Ds}, {T1
, ., Tl})be a temporal pi-pattern over R. The
support of
M with respect to
D, denoted by
sup(
M)(omitting the dataset
D, for the sake of simplifying the notation), is defined by
sup(
M) =
{u∈K u =
QM } where
Q
M is the
relational calculus query
∃y1
∃y2
.∃yk(
D1
∧ D2
∧ . . ∧
Ds ∧ T1
∧ . . ∧ Tl), and
y1
, ., yk are the data variables not appearing in the referencequery
K(
x1
, ., xm). The expression
u =
QM means that
u is an answer to the query
QM when executed over
D. Given 0
≤ α ≤ 1, we say that the temporal pi-pattern
M is
frequent with respect to
D and
α if
sup(
M)
≥ α.
Example 4 (Running Example) Let us consider the temporal pi-pattern
M = (
Pa-tient(
x),
{Med(
x, y, e),
Symp(
x, z, f )
},
{before(
e, f )
}). Let
D be the temporal databaseinstance illustrated in Figure 1. The relational query
QM can be specified in the relationalalgebra formalism by the expression Π
PName(
Patient o
n
Symp). The answer of
QM is
{(Paul), (Sarah)
}. Note that Charles is not an answer. Indeed, Charles has takentetracycline and was dizzy
during (and not
after) the time he was taking this medicine.
John is not an answer neither, because he has not taken any medicine at all. So, the supportof
M w.r.t.
D is 1.
In what follows, we suppose that the set of temporal atoms
T is
consistent, i.e.,
the temporal variables can be instantiated in a consistent way. Formally, a set of temporalatoms
T is
consistent if there is an instantiation
v of the set of temporal variables appear-ing in
T , such that the ground atoms obtained are simultaneously true. We denote this by
v =
T . For instance, the set of temporal atoms
{before(
e, f ),
before(
f, g),
before(
g, e)
}is not consistent, because the first two atoms imply that
e comes before
g and so, there isno way to instantiate the temporal variables
e, f and
g in order to make the three temporalformulae simultaneously true. We also suppose that the set
T is
complete, i.e., for eachpair of temporal variables
e, f appearing in
T it is possible to infer a
unique temporalrelationship between
e and
f . For instance, the set
{before(
e, f ),
meets(
f, g)
} is completebecause the relationship between
e and
g is completely determined (even though it doesnot appear explicitly in the set): the only possibility is
before(
e, g). On the other hand,
{during(
e, f ),
overlaps(
f, g)
} is not complete since the relationship between
e and
g is not
deterministically implied by the two relationships
during(
e, f ),
overlaps(
f, g): it could be(1)
before(
e, g), (2)
meets(
e, g), (3)
overlaps(
e, g), (4)
during(
e, g) or (5)
starts(
e, g). Thatis, for each one of the five temporal atoms (I) (where (I)
∈ {(1), (2), (3), (4) (5)
}) the set
{during(
e, f ),
overlaps(
f, g), (I)
} is consistent.
Representing temporal pi-patterns as sequences. The notion of consistency and com-pleteness of a set of temporal atoms is essential in our approach. As we will see next, givena consistent and complete set of temporal atoms, its temporal variables can be ordered ina natural way. This implies that a pi-pattern can be viewed as a
sequence of data atoms.
And so, we can naturally think about
regular expressions as a formalism for specifyingspecial sets of pi-patterns. Due to space limitation, we will only give an informal defini-tion of the order relation
<T over the set of temporal variables appearing in a pi-pattern.
The idea can be easily formalized.
Let us consider the domain
I for the interval-time variables and the domain N for
the point-time variables. We can view a natural number
n as an interval [
n, n]. So, weconsider the set
I∗ =
I ∪ N as being constituted of intervals [
i, j], with
i ≤ j. Let [
a, b]and [
c, d] in
I∗. We say that [
a, b]
< [
c, d] if and only if one and only one of the followingconditions is verified: (1)
b < d or (2)
b =
d and
a < c. It can easily be shown that
< is a(strict) total order over the set
I∗.
The (strict) ordering
< over the interval structure
I∗ naturally induces an ordering
<T over the set of temporal variables (interval-time and point-time) appearing in the set
of temporal predicates
T of a pi-pattern. The following example illustrates the idea: let
T=
{before(e,f), starts(f,g)}. Then the induced ordering over the set of temporal variables
{e, f, g} appearing in
T is
e <T f <T g. Indeed, for all instantiation
v of the temporal
variables,
v(
e) = [
ae, be]
, v(
f) = [
af , bf ]
, v(
g) = [
ag, bg], such that the predicates in
Tare simultaneously true, the ending points
be, bf , bg satisfy
be < bf < bg. That is, for allinstantiation
v such that
v =
T , we have
v(
e)
< v(
f )
< v(
g).
The following Theorem guarantees that the relation
<T is well-defined and also
an order relation.
Theorem 1 Let
T be a complete and consistent set of temporal atoms. Then (1) the def-inition of
<T does not depend on the particular instantiation
v of the temporal variables
appearing in
T and (2)
<T is a strict order relation over the set of temporal variables
appearing in
T .
Proof (Sketch). In order to simplify the presentation, we suppose that we have onlyinterval-time variables. The case where one has interval-time and point-time variables istreated in a similar way. (1) By the fact that
T is consistent, we can affirm that there existsa valuation
v of the temporal variables appearing in
T such that
v =
T . If
v is unique,then
<T is well-defined, since there is a unique way to define it through the valuation
v.
Now, let us suppose that such valuation is not unique, that is, let us suppose that
v and
v0are two valuations of temporal variables of
T such that
v =
T and
v0 =
T . From the factthat
T is complete, we can affirm without loss of generality that there exists a temporalpredicate
r such that
v =
r(e,f) and
v0 =
r(e,f). So,
v(
e)
< v(
f ) if and only if
r(e,f) iseither
before(e,f) or
during(e,f) or
meets(e,f) or
overlaps(e,f) or
finishes(e,f) or
starts(e,f).
In each one of these cases, from the fact that
v0 =
r(e,f), we conclude that
v0(
e)
< v0(
f ).
(2) In order to prove that
<T is a strict order over the temporal variables appearing in
T ,
we must show that it satisfies the irreflexivity and transitivity properties. It is obvious that
<T is irreflexive. Let
e <T f and
f <T g, where
e, f, g are temporal variables appearing
in
T . Let
r(
e, f ),
r0(
f, g) and
r00(
e, g) be the unique temporal relationship inferred from
T for the pair of variables (
e, f ), (
f, g) and (
e, g) respectively. We can guarantee this byusing the fact that
T is complete. As
T is consistent, there exists a valuation
v of thetemporal variables
e, f, g such that
v =
r(
e, f )
∧ r0(
f, g)
∧ r00(
e, g). Now, using the factthat
e <T f and
f <T g, and considering all the possibilities for the temporal predicates
r,
r0 and
r00 it is not difficult to show that
v(
e)
< v(
g) in all cases.
The induced order
<T over the temporal variables appearing in a temporal pi-
pattern
M allows us to view
M as a
sequence of data atoms. Indeed, let
M = (
K,
D,
T ),where
D =
{p1(
x1
, ., x1
, e
, ., xk , e
1)
, . . , pk(
xk
k)
}. Since we are assuming that
T is
complete and consistent, then, according to Theorem 1 its temporal variables are orderedby the order
<T . Let us suppose, without loss of generality, that
e1
<T e2
<T . . <T ek.
So, the set
D can be viewed as the
sequence < p1(
x1
, ., x1
, e
, ., xk, e
1)
, . . , pk(
xk
3. Temporal pi-Patterns with RestrictionsIn an application where the user has a previous idea of some specific characteristics ofthe patterns she is searching, she can be overwhelmed with uninteresting patterns. In thissection, we propose to use regular expressions as a formalism to specify a broad spectrumof constraints over pi-patterns in order to better satisfy user requirements as well as toreduce the search space of patterns.
A
mode over a temporal database R =
{K, R1
, ., Rk} is an expression of the
form
R(
u1
, ., us, #), where each
ui is a data variable or the (new) symbol
∗, # is a newsymbol, and
R(
A1
, ., An, T ) is one of the predicates
Ri ∈ R,
T ∈ {Tit, Tpt}. We saythat a data atom
R(
y1
, ., ys, e) is
in accord with the mode
R(
u1
, ., us, #) if for each
i = 1
, ., s we have: (1) If
ui is a variable then
yi =
ui; (2) If
ui =
∗, then
yi is avariable or a constant ; (3)
e is a interval-time variable if
T =
Tti or a point-time variableif
T =
Tpi. For instance,
Med(
x, ∗, #) is a mode and the atom
Med(
x, penicillin, e) isin accord with
Med(
x, ∗, #). On the other hand, the atom
Med(
y, penicillin, e) is not inaccord with the mode
Med(
x, z, #).
Let Σ be a set of modes over R =
{K, R1
, ., Rn}. In what follows, we will
consider regular languages (sets of strings) over the alphabet Σ. These languages arespecified by regular expressions. Given a regular expression
E over the alphabet Σ, wedenote by
W (
E) the set of strings (words) verifying
E. We denote by
P (
E) the set ofprefixes of strings in
W (
E). We will need this notation later.
Definition 3 (The Search Space) Let R =
{K, R1
, ., Rn} be a database schema, Σa set of modes over R and
E a regular expression over Σ. The
search space de-fined by
E is the set of temporal pi-patterns (
K(
x1
, ., xm),
D,
T ) such that
D =
<p1(
t1
, ., t1
, e
, ., tn , e
1)
, . . , pn(
tn
n)
> satisfies the following properties: (1) There ex-
ists a string
w1
w2
.wn ∈ W (
E), such that
pi is in accord with
wi for each
i = 1
, ., n;(2) For each data atom
pi(
xi , ., xi , e
i)
∈ D let
wi =
pi(
u1
, ., ul
i) be its associated
mode; if
uj =
∗ and
xj is a variable, then
xj has only one occurrence in
D (a data variablesymbol
xj can be used to replace a symbol
∗ in a mode only once). We denote by
W(
E)the search space specified by
E. We denote by
P(
E) the set of pi-patterns defined in
a similar way as
W(
E), by considering strings
w1
w2
.wn ∈ P (
E) instead of
W (
E) incondition (1). Patterns in
W(
E) are called
valid. Patterns in
P(
E) are called
prefix valid.
As we will see in the next section, in the generation phase of our algorithm, only
pi-patterns in
P(
E) (prefixes) will be generated. The reason why we generate prefix-validpatterns (p-valid patterns for simplify) rather than valid patterns will be clearer later.
Example 5 (Running Example) Let
E =
Med(
x, ∗, #)
Med(
x, ∗, #)
∗ Symp(
x, ∗, #).
The pi-pattern
M1 = (
Patient(
x),
{Med(
x, penicillin, e),
Med(
x, tetracyclin, f),
Symp(
x, diarrhea, g)
},
{before(
e, f ),
overlaps(
f, g)
}) belongs to
W(
E). The pattern
M2= (
Patient(
x),
{Med(
x, penicillin, e),
Med(
x, tetracyclin, f )
},
{before(
e, f )
}) belongs to
P(
E). The pattern
M3 = (
Patient(
x),
{Med(
x, y, e),
Symp(
x, y, f)
},
{before(
e, f)
}) doesnot belong to
P(
E), since property (2) of definition 3 is not verified. Intuitively, the reg-ular expression
E captures the temporal pi-patterns corresponding to a patient
x takingcertain medicines during successive periods and eventually presenting some symptom.
Our mining task is: Given a temporal database
D, a minimum support threshold
α,0
≤ α ≤ 1, and a regular expression
E over Σ, find all temporal pi-patterns in
W(
E)which are frequent with respect to
D and
α.
4. The Algorithm MILPRIT
∗In this section, we present the algorithm MILPRIT
∗ (Mining Interval Logic Patterns withRegular expressIons consTraints) which generalizes the idea of the SPIRIT algorithmintroduced in [Garofalakis et al. 1999]. In a high level, it follows the general Aprioristrategy of [Srikant and Agrawal 1996], working in passes, and each pass producing pat-terns more
specificic than those produced in the previous pass. The classical Aprioristrategy uses a
Pruning Phase, where generated patterns which are more specific thana non-frequent pattern
p are pruned. This pruning strategy relies on the AntimonotonicProperty: "
a frequent pattern cannot be more specific than a non-frequent pattern".
At a first glance, a way of conceiving a method for mining pi-patterns, following
the Apriori strategy would be: (1) at each pass
k, one combines frequent and valid patternsin the set
Fk−1 obtained in the previous pass in order to generate the set of candidates
Ck.
One takes care to produce only valid patterns in this combination, pushing the regularexpression
E in the generation process; (2) one prunes those patterns in
Ck containing a
subpattern which is valid and which is not in
k−1
F
i (this pattern surely is not frequent!);
(3) one scans the database in order to count the support of the remaining patterns. In thismethod, the number of generated patterns is in
direct proportion to the restrictiveness of
E. In the pruning phase, however, all patterns which are more specific than a pattern
satisfying E and not belonging to
Fk−1 should be pruned. In this case, the size of the setof pruned patterns is in
inverse proportion to the restrictiveness of the constraint
E. Suchtrade-offs are due to the fact that regular expression constraints are not anti-monotone,that is, if a pattern satisfies
E it may have some subpattern not satisfying
E2.
In order to find a suitable trade-off, we use a
relaxation of the constraint
E, that
is, a less restrictive one, the prefix constraint associated to E, according to which, onlyfrequent p-valid pi-patterns will be produced in the mining process. A post-processingphase will filter the valid patterns.
2For instance, the word
aba verifies the regular expression
E =
ab∗a but its subword
ba does not verify
E.
The general structure of the algorithm MILPRIT
∗ is depicted below:
Procedure MILPRIT
∗(
D,
α,
E,
δ)
D is a temporal dataset over R =
{K, R1
, ., Rn},
E is a regular expression over the alphabet of modesΣ over R,
α is a minimum support threshold and
δ is the time granularity (
day,
week,
fortnight,
month,
bimester,
trimester, semester, year). Let
M1 be the set of data atoms
Ri(
x1
, ., xn , e), where
e is a
interval-time or a point-time variable, depending on the sort of the temporal attribute appearing in
Ri.
begin1.
F :=
F1 =
{M ∈ M1 :
M is frequent and
M ∈ P(
E)
}2.
k := 23. repeat
4.
Ck := Gen(
Fk−1,
E) (
Generation Phase)5.
P :=
{M ∈ Ck :
∃N ∈ P(
E),such that
N 6∈ F and
M is more specific than
N}6.
Ck :=
Ck − P (
Pruning Phase)7.
Fk :=
{M ∈ Ck : sup(
M, D, δ)
≥ α} (
Validation Phase)8.
F :=
F ∪ Fk9.
k :=
k + 1
10. until
Fk−1 =
∅11.
F :=
{M ∈ F :
M ∈ W(
E)
} (
Post-Processing Phase)
The function sup(
M, D, δ) is defined as follows: First, each date appearing in
some tuple in
D is associated to the corresponding time value determined by the timegranularity
δ. For instance, if
δ =
month and date is 20
/09
/2007 and the minimum dateappearing in
D is 28
/07
/2006 then 20
/09
/2007 is mapped into 15, since date 20
/09
/2007occurs in the 15th month relatively to the first month July 2006. Let
D0 be the databaseobtained from
D after this transformation. The support sup(
M, D, δ) is obtained by cal-culating sup(
M, D0) as in Definition 2.
The Generation Phase. MILPRIT
∗ works in passes, each pass
k producing patterns of"
level"
k. The "level" of a temporal pi-pattern
M is measured in terms of the
refinementvector associated to
M as defined below.
Definition 4 The
refinement vector associated to
M = (
K,
D,
T ) is
v(
M) = (
n, c) where
n is the size of
D and
c is the number of constants appearing in
D. The
refinement level of
M, denoted by
l(
M), is defined as
n +
c where (
n, c) =
v(
M). For instance, let
M be thetemporal i-pattern (
Patient(
x),
{Med(
x, penicillin, e),
Symp(
x, z, f )
},
{before(
e, f )
}).
Then
v(
M) = (2
, 1) and
l(
M) = 3.
The procedure Gen(
Fk−1,
E) is designed to generate temporal pi-patterns whose
refinement level is
k, by
specializing the frequent temporal pi-patterns of
Fk−1 (whoserefinement level is
k − 1) in such a way that the patterns produced are in
P(
E). The coreof procedure Gen is the two
specialization operators (
ρE and
ρI) defined below.
Definition 5 (Extension) Let
E be a regular expression and
M
∈ P(
E),
M =
(
K(
x1
, ., xn),
D,
T ). The
extension operator ρE executed over
M is defined as fol-lows: (a) if
v(
M) = (
n, c) with
c > 0, then
ρE(
M) =
∅; (b) if
v(
M) = (
n, 0) then
ρE(
M) is the set of temporal pi-patterns
M0 = (
K(
x1
, ., xn)
, D0, T 0)
∈ P(
E) such that
D0 =
D∪{pn+1(
z1
, . . , zl, en+1)
} where
pn+1
∈ R,
z1
, ., zl are data variables and
en+1 isa interval-time or a point-time variable (according to the sort of the time attribute of
pn+1)
and
T 0 =
T ∪ {r1(
e1
, en+1)
, . . , rn(
en, en+1)
} where
ri are temporal predicates, and
T 0is complete and consistent. Clearly, for each
i = 1
, ., n the temporal predicates
ri aretaken according to the type of the time variables
ei, en+1 (point or interval) (see Figure 2).
For instance, if both
ei and
en+1 are point-time,
ri must be the temporal predicate
before.
The following example illustrates the operator
ρE:
Example 6 (Running Example) Let
E be the regular expression
Hist(x,*,#)∗Med(x,*,#)∗.
The pi-pattern
M = (
Patient(x),
{Hist(x,y,t1
)},
∅) belongs to
P(
E).
Let us consider the following pi-patterns:
M1 = (
Patient(
x),
{Hist(
x, y1
, t1),
Hist(
x, y2
, t2)
},
{before(
t1
, t2)
})
M2 = (
Patient(
x),
{Hist(
x, y1
, t1),
Hist(
x, y2
, t2),
Med(
x, z, e3)
},
{before(
t1
, t2),
meets(
t2
, e3)
})
M3 = (
Patient(
x),
{Hist(
x, y1
, t1),
Hist(
x, y2
, t2),
Med(
x, z, e3)
},
{before(
t1
, t2),
starts(
t1
, e3),
during(
t2
, e3)
})
M4 = (
Patient(
x),
{Hist(
x, y1
, t1),
Hist(
x, y2
, t2),
Med(
x, z1
, e3)
},
Med(
x, z2
, e4)
},
{before(
t1
, t2),
starts(
t1
, e3),
overlaps(
e3
, e4)
})
M1 is in
ρE(
M), but
M2 is not in
ρE(
M1), because
meets is not a valid relationship
between
t2 and
e3. The pattern
M3 is in
ρE(
M1), but
M4 is not in
ρE(
M3), because thereis no explicit nor implicit relationship between the temporal variables
e4 and
t2, that is,the set of temporal predicates in
M4 is not complete.
Definition 6 (Instantiation) Let
E be a regular expression and
M ∈ P(
E),
M =(
K(
x1
, ., xn),
D,
T ), where
D =
< D1
, ., Dm >. The
instantiation operator executedover
M produces the set
ρI(
M) of temporal pi-patterns
M0 = (
K(
x1
, ., xn)
, D0, T ),where
D0 is obtained by replacing some variable
yk (
yk 6=
xl for
l ∈ [1
, n]) occurring insome
Di =
p(
y1
, . . , yp, e) by a constant
c in such a way that if the string of modes cor-responding to
D is
w1
w2
.wm and
wi =
p(
u1
, . . , up, #), then: (a)
uk =
∗; (b) for every
l > k, if
ul =
∗, then
yl is not a constant; (c) for every
j > i, if
wj =
p0(
v1
, . . , vq, #),then for every
l ∈ [1
, q], if
vl =
∗ and
Dj =
p0(
z1
, . . , zq, e), then
zl is not a constant.
Intuitively, condition (b) states that if some variable in a data atom is instantiated
then no other variables on the left of this one in the atom can be instantiated in further stepsof the candidate generation process. Condition (c) states that if some variable in a dataatom
p is instantiated then no other variables appearing in data atoms coming after
p inthe pattern can be instantiated in the future. These conditions are necessary for assuringoptimality of the specialization operators, as stated in Lemma 2 below. The followingexample illustrates the operator
ρI:
Example 7 (Running Example) Let
E be the regular expression
E =
Med(x,*,#)∗Symp(x,*,#)∗. Consider the following pi-patterns in
P(
E):
M1 = (
Patient(
x),
{Med(
x, w, e1),
Symp(
x, y, e2)
},
{before(
e1
, e2)
})
M2 = (
Patient(
x),
{Med(
x, w, e1),
Symp(
x, dizziness, e2)
},
{before(
e1
, e2)
})
M3 = (
Patient(
x),
{Med(
x, penicillin, e1),
Symp(
x, y, e2),
},
{before(
e1
, e2)
})
M4 = (
Patient(
x),
{Med(
x, penicillin, e1),
Symp(
x, dizziness, e2)
},
{before(
e1
, e2)
})
According to definition 6,
M2 and
M3 are in
ρI(
M1) (i.e., are valid instantiations
of
M1),
M4
6∈ ρI(
M1),
M4
6∈ ρI(
M2),
M4
∈ ρI(
M3).
Two temporal pi-patterns
M and
M0 are said to be
equivalent if
M0 is obtained
from
M by renaming its variables.
Theorem 2 The specialization operator
ρ defined as
ρ(
M) =
ρI(
M)
∪ ρE(
M) satisfiesthe following properties:
• completeness: every pi-pattern
M ∈ P(
E) can be obtained by successively apply-
ing
ρ to some pi-pattern with the refinement level 1.
• optimality: suppose that the regular expression
E verifies the following condition:
for all distinct modes m1
and m2
appearing in E, associated to the same predicatep, the positions containing the symbol * are the same in boths modes. Under thiscondition, the instantiation operator
ρ is optimal, in the sense that a pi-patterncannot be obtained by specializing two non-equivalent pi-patterns.
Proof.
(1)
Completeness: Let
M = (
K, D,
T )
∈ P(
E) and
ν(
M) = (
n, c). We will showby induction on
n that for all
c ≥ 0,
M is obtained by successively applying
ρ to somepi-pattern
M1 whose refinement level is 1.
• Let
n = 1. Let us show by induction on
c that
M is obtained by successively
applying
ρ to some pi-pattern
M1 with refinement level 1. For
c = 0: the re-finement level of
M is 1 and
M =
ρ0(
M) (we apply
ρ "0 times" over
M). Let
c ≥ 1 and let us suppose that the result is true for
c − 1. Let
p(
t1
, ., tk, en)the last atom in
D where, for some
i ∈ {1
, ., k},
ti is a constant
∈ C. Let
m= max
{i 1
≤ i ≤ k such that
ti ∈ C}. Let
M0 be the pi-pattern obtained byreplacing
ti by a new variable
y in the atom
p(
t1
, ., tk, en). That is, we replacethe data atom
p(
t1
, ., ti, ., tk, en) by
p(
t1
, ., y, .tk, en). Then,
M is obtainedby applying
ρI over
M0. By the induction hypothesis, since
ν(
M0) = (
n, c − 1)then
M0 is obtained by successively applying the specialization operator
ρ over api-pattern
M1 with refinement level 1. Then,
M =
ρI(
M0) is also obtained in sucha way.
• Let us suppose the result is true for
n − 1 and let us show it is true for
n ≥ 2. We will show that by induction on
c ≥ 0. For
c = 0:
M =
< p1(
t1
, t1
, ., t1
, e
, tm, ., tm , e
is a variable, for all
1)
, ., pm(
tm
m)
>, and
tij
i, j. It is easy to see that
M is obtained by applying successively the operator
ρE to the pi-pattern
< p1(
t1
, t1
, ., t1
, e
1)
> with refinement level 1 (the proof is
achieved by induction on
m, using the fact that each prefix of
M is in
P(
E). Forthe case
c ≥ 1, we employ the same argument used in the case
n = 1 and
c ≥ 1.
(2)
Optimality: For lack of space the proof is given in the Appendix.
Figure 3 illustrates the completeness and optimality of
ρ.
M3 cannot be obtained
by specializing both
M1 and
M2, since they are non-equivalent pi-patterns. In fact
ρexecuted over
M1 does not produce
M3 since according to Definition 5(a), each time api-pattern is instantiated, it cannot be extended any more.
M5 cannot be obtained byspecializing both
M3 and
M4, since they are non-equivalent pi-patterns. In fact
ρ executedover
M4 does not produce
M5 since according to Definition 6(c), each time a position ina pi-pattern is instantiated no positions at its left can be instantiated in future steps.
The Validation Phase. The support counting of the generated pi-patterns is a rathertechnical procedure and its details are omitted here. Roughly speaking, an equivalence
M =
Patient
M =
Patient
M = (
Patient(
x), {
Med(
x,y,e ),
Symp(
x,z,e )},
T )
Level 3
M = (
Patient(
x), {
Med(
x,penicillin,e ),
Symp x,z,e
)},
T )
M = (
Patient(
x), {
Med(
x,y,e ),
Symp(
x,dizziness,e )},
T )
M = (
Patient(
x), {
Med(
x,penicillin,e ),
Symp x,dizziness,e
Figure 3. Candidate Generation: executing the specialization operators level by level.
relation over the set of generated patterns is defined, depending on the sequence of modesappearing in the patterns. A SQL query
Q(
C) is associated to each equivalence class
C.
Thus, to each equivalente class
C, a database
DC is associated. An important point toremark here is that the database
DC has a unique relation, namely the answer of query
Q(
C) executed over the input database
D. Thus, in order to evaluate the support of thepatterns in class
C, the relation
DC is scanned once at each pass. A hash-tree structure isused for optimizing support counting in each equivalence class. The construction of thesehash-trees have been inspired on the ones used in [Srikant and Agrawal 1996].
The following result is a corollary of Theorem 2 which guarantees the soundness
and completeness of MILPRIT
∗:
Corollary 1 (Soundness and Completeness) Let
P be the set of temporal pi-patternsreturned by the algorithm MILPRIT
∗. Then (1) all patterns in
P are frequent and satisfythe constraints enforced by the regular expression
E and (2) all frequent and valid patternsare in
P.
Proof. (1) If
P ∈ P then
P is frequent since, after the validation phase, only frequentpatterns are returned. Besides, step 11 of MILPRIT
∗ guarantees that only valid patterns(with respect to
E) are returned. (2) Let
P be a frequent and valid pattern. Then
P isp-valid. By Theorem 2
P is obtained by applying the specialization operator
ρ n timesover a pi-pattern
M in the refinement level 1. It is easy to prove, by induction on
n, that
P is obtained during the generation phase of MILPRIT
∗, in the refinement level
n + 1.
Since
P is frequent, it will be returned in step 7 of MILPRIT
∗. Since
P is valid it will bereturned in step 11 of MILPRIT
∗.
5. Experimental ResultsIn this section, we discuss the performance and scalability of algorithm MILPRIT
∗through a set of experiments using both synthetic and real-life datasets. Our experi-ments were performed on an IBM SuperWorkstation7043A-8R with 2.4GHz and 4GBof main memory, running Linux. The DBMS PostgreSQL has been used to handle theSQL queries at the validation phase, as discussed before.
5.1. Synthetic DataWe have developed a synthetic data generator which generates regular expressions andtemporal databases where pi-patterns satisfying a regular expression may appear. The
regular expressions generated have the format (
B∗B∗.B∗), where each
block B
form (
T1
.Tm), and each
Tj (called
term) has the form (
mj +.+
mj ), where
mj are
modes. Thus, our generator can produce several types of regular expressions, by varyingthe number of blocks (
k), the maximum number of terms in each block (
m) and themaximum number of modes appearing in the terms. For instance, one regular expressionwhich fits into this format is
E = (
Med(x,*,#)Symp(x,*,#))
∗ (
Med(x,*,#)Hist(x,*,#))
∗. Thisexpression has two blocks and two terms in each block. For each term there is only onemode which can be chosen. In Table 1, we list the parameters used in the synthetic datagenerator, allowing a great variety of temporal databases and regular expressions to beproduced. For the tests presented in this paper, we use the following notation for the inputdatasets:
Ix − P y − Mz − Us − Bt − T u − V v − Qw, where
x, y, z, s, t, u, v and
w arevalues for the input parameters
I, P, M, U, B, T, V and
Q, respectively. The remaininginput parameters (
S, D and
F ) are set as (
S =
U ,
D =
U and
F =
U ). The parameter
Q
refers to the number of valid pi-patterns which surely appear in the generated dataset.
Table 1 -
Parameters used for the synthetic data generation.
Parameters
Number of tables with intervals
Number of tables with points
Number of attributes per table (not temporal)
Number of tuples per table - in '000s
Size of the domain of reference attributes - in '000s
Size of the domain of non-reference attributes - in '000s
Size of the domain of temporal attributes - in '000s
Number of blocks of the regular expression
Number of terms per blocks of the regular expression
Number of modes per term of the regular expression
Minimum amount of pi-patterns - in '000s
Performance Analysis. Figure 4(a) shows the performance of MILPRIT
∗ with respectto support variation. The tests have been executed over the dataset
I2
− P 2
− M2
−U5
− B2
− T 2
− V 1
− Q1. As the minimum support is increased from 0.5% to 3%, thenumber of frequent patterns at each pass decreases, as expected. Thus, the execution timeof the algorithm decreases, since few candidates are potentially frequent for high values ofminimum support. Figures 4(b), (c) and (d) show how the algorithm performs with respectto variations in the regular expression. In these tests the minimum support has been set as2%. The datasets used in these tests were
I2
−P 2
−M2
−U5
−Bt−T 2
−V 1
−Q1, where
t ∈ {1
, 2
, 3
, 4
, 5
},
I2
− P 2
− M2
− U5
− B2
− T u − V 1
− Q1, where
u ∈ {1
, 2
, 3
, 4
, 5
}and
I2
− P 2
− M2
− U5
− B2
− T 2
− V v − Q1, where
v ∈ {1
, 2
, 3
, 4
, 5
}. As thevalues of parameters
B, T or
V (respectively the number of blocks, the number of termsin each block and the number of choices for modes in each term) increase, the numberof
long patterns potentially frequent increases as well. Thus, the execution time of thealgorithm increases too, since longer patterns have to been tested at the validation phase.
Notice that the number of attributes of the table which is scanned to evaluate the supportfor these long patterns increases as the size of the patterns increases.
Figure 5(a) shows the relationship between the number of generated and pruned
pi-patterns during the execution of MILPRIT
∗ over the dataset
I2
− P 2
− M2
− U5
−B2
− T 2
− V 1
− Q1 with minimum support 0.5%. Note that the number of pruned
pth's is quite expressive, and consequently, a smaller amount of pi-patterns are evaluatedat the validation phase. Figure 5(b) shows the execution time associated to each phase
P 2-M 2-U 5-B 2-N 2-V 1-Q 1
I2-P 2-M 2-U 5-B x-N 2-V 1-Q 1
tio 30000
Minimum Support (%)
Number of Blocks
I2-P 2-M 2-U 5-B 2-N x-
I2-P2-M2-U2-B2-T V 1-Q 1
I2-P 2-M 2-U 5-B 2-N 2-V x-Q 1
Number of Terms
Number of Modes
Figure 4. Performance results
(generation-pruning-validation) at each refinement level. The tests also have been exe-cuted over the dataset
I2
− P 2
− M2
− U5
− B2
− T 2
− V 1
− Q1 with minimumsupport 0.5%. The execution time at the validation phase, especially at higher levels, isquite high. This can be explained by the fact that calculating the support of long patternsat high refinement levels involves the execution of SQL queries with a high number of
joins.
I2-P 2-M 2-U 5-B 2-N 2-V 1-Q 1
I2-P 2-M 2-U 5-B 2-N 2-V 1-Q 1
tte 25000
Figure 5. (a) Number of generated and pruned pi-patterns, (b) Amount of time spent in
each phase at different levels
Scalability For the scalability experiments, the parameters related to the database havebeen changed. In these experiments, we used the default parameters for the regular ex-pression and fixed the minimum support as 2%. We first show how MILPRIT
∗ performsas the number of tuples per table (
U) increases. The datasets considered here were
I2
−P 2
− M2
− Us − B2
− T 2
− V 1
− Q1, where
s ∈ {20000
, 40000
, 60000
, 80000
, 100000
}.
Figure 6(a) shows the results of these tests. Clearly, the higher the number of tuples inthe database tables, the longer the execution time of the algorithm, since the validationphase (the support evaluation) is the most computationnally costly. It can be noticedthat MILPRIT
∗ scales almost linearly with respect to the number of tuples in the dataset.
Figure 6(b) shows how MILPRIT
∗ scales up as the number of attributes per table (
M) in-creases. We show the results for the datasets
I2
− P 2
− Mz − U5
− B2
− T 2
− V 1
− Q1,where
z ∈ {2
, 3
, 4
, 5
}. As the number of attributes in the base tables increases, the num-ber of attributes of the table which is scanned in the validation phase3 increases as well.
Consequently the execution time increases. Figure 6(c) shows how the execution timeincreases as the number of generated patterns increases. This experiment consisted inrunning the algorithm over datasets
I2
− P 2
− M2
− U5
− B2
− T 2
− V 1
− Qw, for
w ∈ {1000
, 1500
, 2000
, 2500
, 3000
, 3500
, 4000
}. In these three scalability tests, we cannotice that MILPRIT
∗ scales up almost linearly with respect to the corresponding param-eters (number of tuples per table, number of attributes per table and number of interestingpatterns appearing in the dataset).
I2-P 2-M 2-U x-B 2-N 2-V 1-Q 1
I2-P 2-M x-U 5-B 2-N 2-V 1-Q 1
Tuples - in '000
Nº of Attributes per Table
I2-P 2-M 2-U 5-B 2-N 2-V 1-Q x
Frequent Patterns - in '000
Figure 6. Scalability results
The
AMDI4 is part of a project in our laboratory, which aims at building a system support-ing an indexed atlas of digital mammograms. This system intends to be used by radiolo-gists as well as radiology students, in order to help them in breast cancer diagnosis. The
AMDI database 5 stores, for each patient, a series of digital mammograms, taken duringits lifetime. Besides image data, some important temporal data is stored, related to thepatient habits and life style: in which periods of time she smoked, or took contraceptivedrugs, for instance. The data used in our tests were supplied by Instituto Victorio Valeri deDiagn´osticos M´edicos, Ribeir˜ao Preto-SP, Brazil. For the time being, the
AMDI databasestores information about 1365 patients. In the future, as the system will be available on-line, it will be possible for radiologists to insert new patient's records and images. The
AMDI relational schema is R =
{Patient(
PName),
Contraceptive(
PName,
CName,
Tit),
3This table is obtained as the result of executing a SQL query over the temporal dataset.
4The
AMDI system is avaliable on www.lcc.ufu.br/amdi.
5The
AMDI database is avaliable on www.lsi.ufu.br.
Antidepressant(
PName,
AName,
Tit),
Hormone(
PName,
HName,
Tit),
Tobacco (
PName,
Amount,
Tit),
Drink(
PName,
Type,
Tit),
FirstPregnancy(
PName,
Tpt),
Menarche(
PName,
Tpt)
Menopause(
PName,
Tpt),
Birads(
PName,
Category,
Tpt)
}. The relation Birads storesinformation related to the BIRADS classification, standard published by ACR (AmericanCollege of Radiology)6. The attribute
Category may assume five different values, eachone corresponding to a mammogram class. MILPRIT
∗ has been executed over the
AMDIdatabase in order to find temporal pi-patterns relating the evolution of a breast cancertumor and the patient's habits and life style. For the experiment carried out over thisdatabase, we considered the time granularity
δ =
semester.
Some of the patterns found were:
P2 = (
Patient(x),
{(x,20,
e1),
Birads(
x, 2
, s2)
},
{before(
e1
, s2)
})
P3 = (
Patient(x),
{FirstPregnancy(
x,YES,
e1),
Menopause(x,YES,
f2),
Birads(
x, 2
, s3)
},
{before(
e1
, f2),
before(
e1
, s3),
before(
f2
, s3)
})
The pi-pattern
P1 can be interpreted as follows:
patients start taking the antide-
pressant fluoxetine as soon as they enter the menopause. This pi-pattern has been verifiedby 25 patients (support = 1.8%). The pi-pattern
P2 has been verified by 14 patients (sup-port = 1%) and can be interpreted as follows:
patients which smoke 20 cigarettes perday stop smoking after their mammogram is classified with Birads 2. The pi-pattern
P3can be interpreted as follows:
patients who have already had children presents mammo-grams with Birads 2 after entering the menopause, this pi-pattern has been verified in 177patients (support = 12%).
We notice that, for the time being, MILPRIT
∗ has not mined expressive patterns
relating patient's lifestyle and cancer occurrence as we expected in the beginning of theprojet. Clearly, this is due to the fact that in the current stage of the
AMDI database, withonly 1365 records, the Birads classification of the mammograms is almost unchanged.
Moreover, there are no cases of mammograms with Birads 5 in our database. We hopethat as the
AMDI database gets more populated and diversified, MILPRIT
∗ will be able todiscover more interesting patterns.
6. Conclusion and Further WorkIn this paper we introduced a first-order (or relational) temporal pattern mining frameworkwhere time can be viewed as
intervals as well as
points in a straight line. We also proposeda formalism to specify a broad class of constraints over such temporal patterns. We de-veloped and implemented the algorithm MILPRIT
∗ designed to mining temporal patternswith constraints. For the time being, our temporal pattern formulation does not include apattern where
several data atoms are related to the same temporal variable. For instance,we does not consider patterns like (
Patient(
x),
{Med(
x, penicillin, e),
Med(
x, xanax, f )
},
{equal(
e, f )
}).
Our future research focuses on three main targets: (1) the development of a sec-
ond algorithm for mining pi-patterns based on the technique PrefixSpan [Han et al. 2004]
6Breast Imaging Reporting and Data System BIRADS. American College of Radiology, 2004.
which presents a much better performance when compared to Apriori-based techniquesin the sequence pattern mining context, (2) the refinement of the method in order to in-corporate more general temporal patterns and (3) the development of a tool for temporalrelational pattern mining.
Allen, J. F. and Ferguson, G. (1994). Actions and events in interval temporal logic. Tech-
nical Report TR521.
Antunes, C. and Oliveira, A. L. (2005). Constraint relaxations for discovering unknown
sequential patterns. In
Knowledge Discovery in Inductive Databases, LNCS 3377,pages 11–32. Springer Berlin, Heidelberg.
Bettini, C., Wang, X. S., Jajodia, S., and Lin, J.-L. (1998). Discovering frequent event pat-
terns with multiple granularities in time sequences.
IEEE Transactions on KnowledgeData Engineering, 10(2):222–237.
B¨ohlen, M. H., Snodgrass, R. T., and Soo, M. D. (1996). Coalescing in temporal
databases. In
VLDB 1996, pages 180–191.
Boulicaut, J.-F., De Raedt, L., and (Eds.), H. M. (2006). Constraint-based mining and
inductive databases. In
European Workshop on Inductive Databases, volume 3848 ofLecture Notes in Artificial Intelligence.
de Amo, S. and Furtado, D. (2005). First-order temporal pattern mining with regular
expression constraints. In
20th Brazilian Symposium on Databases, pages 280–294.
de Amo, S., Giacometti, A., Furtado, D. A., and Laurent, D. (2004). An apriori-based
approach for first-order temporal pattern mining. In
19th Brazilian Symposium onDatabases, pages 48–62.
de Amo, S., Giacometti, A., and Pereira-Jr., W. (2007). Mining first-order temporal in-
terval patterns with regular expression constraints. In
9th International Conferenceon Data Warehousing and Knowledge Discovery (DaWaK 2007), LNCS 4654, pages459–469.
Dehaspe, L. and Toivonnen, H. (1999). Discovery of frequent datalog patterns.
Data
Mining and Knowledge Discovery, 3:7–36.
Emerson, E. A. (1990). Temporal and modal logic. In van Leeuwen, J., editor,
In
Handbook of Theoretical Computer Science, Volume B: Formal Models and Seman-tics, pages 995–1072. Elsevier Science Publishers.
Garofalakis, M., Rastogi, R., and Shim, K. (1999). Spirit: Sequential pattern mining with
regular expression constraints. In
The VLDB Journal, pages 223–234.
Han, J., Pei, J., Mortazavi-Asl, B., Pinto, H., and Dayal, U. (2004). Mining sequential pat-
terns by pattern-growth: The prefix-span approach.
IEEE Transactions on Knowledgeand Data Engineering, 16(11):1424 – 1440.
Hoppner, F. (2001). Discovery of temporal patterns – learning rules about the qualitative
behavior of time series. In
Proceedings of the 5th European Conference on Principlesand Practice of Knowledge Discovery in Databases, pages 192–203.
Jacobs, N. and Blockeel, H. (2001). From shell logs to shell scripts. In Rouveirol, C. and
Sebag, M., editors,
Proceedings of the 11th Int. Conf. on Inductive Logic Programming,volume 2157 of
LNAI, pages 80–90.
Lattner, A. and Herzog, O. (2005). Mining temporal patterns from relational data. In
LWA
2005, pages 184–189.
Lee, S. D. and De Raedt, L. (2002). Constraint based mining of first order sequences in
seqlog. In
KDID'02, pages 76–92.
Lorincza, H. A. and Boulicaut, J.-F. (2003). Mining frequent sequential patterns under
regular expressions: a highly adaptive strategy for pushing constraints. In
ICDM'03,pages 316–320.
Masson, C. and Jacquenet, F. (2003). Mining frequent logical sequences with SPIRIT-
LoG. In
Proceedings of the 12th International Conference on Inductive Logic Pro-gramming, volume 2583 of Lecture Notes in Computer Science, pages 166–181.
Springer.
Pei, J., Han, J., and Wang, W. (2002). Mining sequential patterns with constraints in large
databases. In
Proceedings of the 2002 ACM CIKM Conference, pages 18–25.
Srikant, R. and Agrawal, R. (1996). Mining sequential patterns: Generalizations and
performance improvements. In
EDBT'96, volume 1057, pages 3–17, Avignon, France.
Springer-Verlag.
Zaki, M. J. (2001). SPADE: An efficient algorithm for mining frequent sequences.
Ma-
chine Learning, 42(1-2):31–60.
Appendix: Proof of Theorem 2 -
Optimality.
Let
M = (
K, D, T )
∈ P(
E)
,M = (
K, D, T ) and D =
{p1(
t1
, ., t1
, e
, ., tn , e
1)
, ., pn(
tn
n)
} and let
ν(
M ) = (
n, c).
1. Case 1: if
c = 0 then
M must be generated by extension, that is,
M =
ρE(
M1),
for some pi-pattern
M1. Since the atoms in
D are ordered according to the partialorder introduced in Section 2, then
M0 = (
K,D0,
T 0)
∈ P(
E) with
• D0 =
{p1(
t1
, ., t1
, e
, ., tn−1
, e
1)
, ., pn−1(
tn−1
• T 0 =
{r(
ei, ej)
∈ T i, j < n}
2. Case 2: if
c > 0: Let us suppose that
M =
ρ(
M1) =
ρ(
M2). Notice that, since
c > 0, it is not possible that
M is obtained by extension from
M1 or
M2. Then,
M =
ρI(
M1) =
ρI(
M2). Let
M1 = (
K,D1,
T 1) and
M2 = (
K,D2,
T 2).
• D1 =
{p1(
x1
, ., x1
, e
, ., xn , e
1)
, ., pn(
xn
• D2 =
{p1(
y1
, ., y1
, e
, ., yn , e
1)
, ., pn(
yn
The strings of modes associated to
M1 and
M2 are respectively7:
• W 1 =
p1(
u1
, ., u1 )
, ., p
• W 2 =
p1(
v1
, ., v1 )
, ., p
(I) Since
M =
ρI(
M1) then there exists
i1
, j1 (uniquely determined) such that
ui1
j1=
∗,
xi1 is a variable and
ti1 is a constant. Moreover, for all
i, j, if
i 6=
i
j 6=
j1 then
xi =
ti .
(II) Since
M =
ρI(
M2) then there exists
i2
, j2 (uniquely determined) such that
ui2 =
∗,
xi2 is a variable and
ti2 is a constant. Moreover, for all
i, j, if
i 6=
i
j 6=
j2 then
xi =
ti .
We must consider three possibilities, namely:
(a)
i1 =
i2 and
j1
< j2: in that case, both
ti1 and
ti2 are constants (according
to points (I) and (II)). Moreover, since
M =
ρI(
M1) and
j1
< j2 then
M is obtained by instantiating a variable in position
j1 in the
i1-th atom.
But in this
i1-th atom, a variable in position
j2
> j1 is already instantiated.
Contradiction (see Definition 6).
(b)
i1
< i2: in that case
vi2 =
ui1 =
∗ and
xi1
, yi2 are constants. That is
impossible, since by Definition 6, a variable
xi1 cannot be instantiated if it
appears in the
i1-th atom and a variable
xi2 in the
i
2-th atom (
i2
> i1) has
already been instantiated before.
(c)
i1 =
i2 and
j1 =
j2: in that case we have
ui1 =
vi2 =
∗ and
xi1 ,
yi2 are
variables. So,
M1 and
M2 are equivalent (
M1 can be obtained from
M2 byrenaming its variable
xi1 as
yi2).
7For the sake of simplifying the notation, we omitted the symbol # at the end of the modes.
Source: http://www.deamo.prof.ufu.br/arquivos/IJDWM2008.pdf
Sexual Offender Treatment ISSN 1862-2941 Legal and Ethical issues when usingAntiandrogenic Pharmacotherapy with SexOffenders Karen HarrisonUniversity of the West of England - Bristol [Sexual Offender Treatment, Volume 3 (2008), Issue 2] The treatment of sex offenders and more specifically the treatment of high-risk sex offenders is asubject of great importance for practitioners, professionals, policymakers and the public at large.Whilst treatment is thought to largely centre upon cognitive-behavioural methods and otherpsychotherapy techniques, in more recent years the use of pharmacotherapy has also begun togain ground. Current debate often centres upon how effective such treatment is; with bothsupporters and opponents of its use existing. This article, however, does not specifically look atwhether pharmacotherapy as a method of treatment with sex offenders actually works, but ratherlooks at the legal and ethical issues surrounding its use. In particular it considers issues such aswhether the treatment should be voluntary or mandatory; whether it should indeed even beclassified as treatment or should instead be seen as punishment and finally whether it should beused with convicted offenders or made freely available to all.
ELECTROSPINNING OF POLY(VINYL ALCOHOL)/CHITOSAN FIBERS FOR WOUND DRESSING APPLICATIONS MS.SUNISA PANBOON A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE (MATERIALS SCIENCE) DEPARTMENT OF INDUSTRIAL CHEMISTRY GRADUATE COLLEGE KING MONGKUT'S INSTITUTE OF TECHNOLOGY NORTH BANGKOK ACADEMIC YEAR 2005 ISBN 974-19-0476-2