deamo.prof.ufu.br

Medicine Lakex

medicinelakex1.com

/d/deamo.prof.ufu.br1.html

Deamo.prof.ufu.br

MILPRIT∗: A Constraint-Based Algorithm for Mining Temporal Relational Patterns∗ Sandra de Amo1, Waldecir P. Junior1, Arnaud Giacometti2 1Faculdade de Computac¸˜ao – Universidade Federal de Uberlˆandia (UFU) Campus Santa Mˆonica, Bloco B – Uberlˆandia – MG – Brazil 2Universit´e de Tours, Laboratoire d'Informatique, 2, Place Jean Jaur es 45000 Blois, France, Abstract. In this paper, we consider a new kind of temporal pattern where bothinterval and punctual time representation are considered. These patterns, whichwe call temporal point-interval patterns aim at capturing how events takingplace during different time periods or at different time instants relate to eachother. The datasets where these kind of patterns may appear are temporal re-lational databases whose relations contain point or interval timestamps. Weuse a simple extension of Allen's Temporal Interval Logic as a formalism forspecifying these temporal patterns. We also present the algorithm MILPRIT∗for mining temporal point-interval patterns, which uses variants of the classi-cal level-wise search algorithms. Besides, MILPRIT∗ allows a broad spectrumof constraints to be incorporated into the mining process. An extensive set ofexperiments of MILPRIT∗ executed over synthetic and real data is presentedshowing its effectiveness for mining temporal relational patterns. Keywords. Temporal Data Mining, Constraint-based Mining, Interval Temporal Logic.
1. Introduction and Motivation The problem of discovering sequential patterns in temporal data has been extensivelystudied in the past years [Srikant and Agrawal 1996, Zaki 2001, Han et al. 2004] and itsimportance is fully justified by the great number of potential application domains wheremining sequential patterns appears as a crucial issue, such as financial market (evolutionof stock market shares quotations), retailing (evolution of clients purchases), medicine(evolution of patients symptoms), local weather forecast, telecommunication (sequencesof alarms output by network switches), etc. Most of these patterns are specified by for-malisms which are, in some extent, reducible to Propositional Linear Temporal Logic,where time is represented by points in a straight line. For instance, let us consider aclassical sequential pattern of the form s =< {a, b}, {c, d} > (where {a, b} and {c,d}are sets of items purchased by a client). This sequential pattern can be expressed inthe Propositional Linear Temporal Logic by the formula Pa ∧ Pb ∧ ♦(Pc ∧ Pd) wherefor each i ∈ {a, b, c, d}, Pi is a propositional symbol standing for "client buys item i".
The symbol ♦ stands for the temporal operator sometimes in the future (for a compre-hensive survey on Linear Temporal Logic, see [Emerson 1990]). The need for a more ∗This research has been supported by CNPq-Brazil, Project 473309/2004-1 expressive kind of temporal patterns arises for instance, when modelling Unix-users be-haviour [Jacobs and Blockeel 2001]. Consider, for instance1, the following sequence ofcommands related to latex users: ls, vi paper.tex, latex paper.tex, dvips paper.dvi, lprpaper.ps. This sequence of commands can be represented as a sequence of relational(or first-order) atoms of the form: ls, vi(paper.tex), latex(paper.tex), dvips(paper.dvi),lpr(paper.ps). Within such a database of sequences of relational atoms, it would be pos-sible to discover that the relational sequence pattern vi(X), latex(X) is frequent. Noticethat this pattern tell us that the sequence of commands (predicates) vi, latex is frequentlyrequested by users. In order to specify this kind of sequential patterns, we need a morepowerful logical formalism, the First-Order Linear Temporal Logic, since the elementsinvolved in the patterns include both predicates and their parameters.
In all the above examples illustrating the propositional and relational settings of the sequential pattern mining problem, the events are instantaneous, that is, the time whenthey happen is represented as a point (instant) in a straight line. The sequence of eventscorresponds to a sequence of instants when these events take place. However, there aremany situations where events have a certain duration, and so, the underlying time is mea-sured in terms of intervals instead of points. In this paper, we propose a new tempo-ral pattern involving a hybrid representation for time and a new algorithm (MILPRIT∗)for mining them. These patterns, which we call temporal point-interval patterns (or pi-patterns for short), aim at capturing how events taking place in time intervals or timeinstants relate to each other. For instance, (1) in a medical application, we could be inter-ested in discovering if patients who take some medicine X during a certain period of timeP , and in some moment m in P undergo a stomach surgery, will present the symptomZ during a period of time beginning right after the surgery and finishing as soon as theystop taking the medicine X; (2) in an agricultural application, we could be interested indiscovering if the use of some organic fertilizer during a period of time has an effect onthe way a plant grows during and after the fertilizer application. The following exampleillustrates the medical application in more details.
Example 1 (Running Example) Let us consider the database schema R = {Patient, Med,Symp, Hist} and the relational database instance D over R illustrated in Figure 1. Noticethat the following behavior is verified by 50% of the patients (two patients out of four inthe Patient relation verified it, namely, Paul and Sarah): the patient takes penicillin duringa certain period of time e; during a period of time f following his taking the medicine, hefeels dizzy and undergoes a stomach surgery someday t during f.
Temporal pi-patterns are specified as a set of atomic first order formulae where time is represented by an interval-time variable or by a point-time variable, together witha set of temporal predicates { before, meets, overlaps, during, starts, finishes }. Ourlogical formalism for specifying pi-patterns is based on Allen's First Order Interval Logic[Allen and Ferguson 1994]. For instance, the temporal pattern described in Example 1 isspecified by the triple (K,D,T ) where: K = Patient(x) (where x is a registered patient),D = {Med(x, penicillin, e), Symp(x, dizziness, f ), Hist(x, st.surgery, t)} (representingthe events which take place in the pattern), T = {before(e, f ), during(t, f )} (representingthe relationships verified by the time parameters associated to the pattern events).
In most applications, the lack of user control for specifying the kind of patterns 1This example has been taken from [Lee and De Raedt 2002].
Pname Mname
Pname Sname
Pname Event
stomach surgery [5] Charles tetracyclin [4,7] Charles dizziness [5,6] stomach surgery [9] 10/05/2000 15/11/2000 20/07/2001 06/09/2001 09/01/2002 23/08/2002 19/12/2002 Figure 1. A Temporal relational database instance
she would find interesting can lead to tedious and computationally costly post-processingphases. In this paper, we use a formalism based on regular expression in order to allowusers to specify the shape of potentially interesting patterns. Our temporal mining methodMILPRIT∗ is constraint-based: it incorporates constraints inside the mining process andin doing so, it accomplishes two goals: it saves a lot of calculation in the post-processingphase and it allows a non-negligent reduction in the search space of potential patterns. Thefollowing example illustrates the idea of using regular expressions to specify constraintson sequential patterns.
Example 2 (Running Example) Let us consider the same situation depicted in Example1. Users could be interested only in discovering pi-patterns where patients take somemedicine (maybe more than one) and at the end present some symptom. Roughly speak-ing, we can say that these patterns fit into the regular expression format Med∗Symp. In thiscase, it would be important that patterns representing situations where patients present alist of symptoms and after that take some medicine, are not generated in the mining pro-cess, since such patterns do not fit in the given regular expression format.
Usually, in real world applications, it is important to allow different time gran- ularities to be considered when evaluating pattern support, otherwise some interestingpatterns will not be mined. For instance, let us consider a database schema R = {R(A,T),S(B,T)}, where T is an interval-time attribute. If the values of the temporal attributesin the database are dates (day/month/year), the temporal predicate meets(e,f) will be truein this database only if e and f are mapped into time intervals [d1, d2], [d3, d4] respec-tively, where d1, d2, d3, d4 are dates and d2 = d3. If there is very few pair of tuples(R(a, [d1, d2]), S(a, [d2, d4]) we are unable to mine patterns of the form (K(x), {R(x,e),S(x,f)}, {meets(e,f)}). On the other hand, we can have a lot of pair of tuples (R(a, [d1, d2]),S(a, [d3, d4]) where d2 and d3 are different days but belong to the same week in the year.
In this case, if we relax the time granularity and consider week to be the time unity,then d2 and d3 will be consider identical and hence, the pattern (K(x), {R(x,e), S(x,f)},{meets(e,f)}) will be mined. The algorithm MILPRIT∗ we present in this paper allowsusers to specify different time granularities when evaluating the pattern support.
Related Work. A lot of studies on constrained-based temporal mining, in a proposi-tional context, have been investigated in order to reduce the huge amount of patternswhich are generated in the relational mining context. [Srikant and Agrawal 1996] is a pioneer work introducing constraints in sequential pattern mining in order to reduce ex-ecution time and the number of discovered patterns. In this work, the constraints (min-max, slide-window) are related to the way the support of patterns is counted and notto the patterns format. Recently, a lot of work on constrained-based temporal mininghave investigated formalisms to specify constraints over the pattern format. Regularexpressions have been adopted in many of these work as a very general mechanism tospecify user constraints. [Garofalakis et al. 1999] was a pioneer work in this field, in-troducing the SPIRIT family of algorithms for constraint-based sequential pattern min-ing. [Lorincza and Boulicaut 2003] and [Pei et al. 2002] extended this pioneer work, byproposing new and more efficient algorithms (RE-Hackle and Prefix-Growth respectively)for sequential pattern mining where regular expressions are used to specify useful patternsbeforehand. Such constraints are pushed inside the mining process instead of being ver-ified only in a post-processing phase. [Antunes and Oliveira 2005], following the linesof [Garofalakis et al. 1999], proposed a more general formalism, based on context-freegrammars, in order to specify constraints over sequential patterns.
In [de Amo and Furtado 2005], we tackled the constraint-based temporal min- ing problem in the relational context, by proposing a formalism to specify con-straints over the temporal patterns we had introduced in [de Amo et al. 2004].
[Lee and De Raedt 2002], the algorithm SeqLog, based on Inductive Logic Program-ming, was introduced for discovering relational sequential patterns with constraints. In[Masson and Jacquenet 2003], the algorithm Spirit-Log, designed to mine relational tem-poral patterns was introduced. Spirit-Log extends SeqLog (where time is punctual), bypushing regular expression constraints inside the mining process. A comprehensive textabout constraint-based temporal relational pattern mining following an inductive databaseapproach can be found in [Boulicaut et al. 2006]. In all these work, time is representedby points in a straight line. The approach we propose in this paper is more general sinceit allows mining relational temporal patterns where time is represented by points as wellas by intervals. We notice that most real world applications involve both kind of timerepresentation.
In [Hoppner 2001], Allen's Propositional Interval Logic has been used for the first time, to treat the problem of discovering association rules over time series[Allen and Ferguson 1994]. In [Lattner and Herzog 2005], an approach similar to oursis presented for temporal relational pattern mining where time is represented as intervals.
However, this approach is not constraint-based like ours. Moreover, no experimentalresults concerning the performance of the proposed method are provided, which makesdifficult a comparative analysis with our approach.
The issue of allowing user-specified different time granularities during sup- port counting has been treated in the pioneer work of [Srikant and Agrawal 1996] and[Bettini et al. 1998], in the context of sequential pattern mining and time series. Here,patterns were specified in a formalism equivalent to propositional temporal logic. More-over, time attributes are point-valued. We emphasize that our approach is more generalsince it is developed in a temporal relational database context where time attributes maybe point-valued as well as interval-valued.
Our Contribution. We can summarize our main contributions as follows: (1) the in-troduction of a new temporal relational pattern allowing interval and point-based timerepresentation in relations, as well as a logical formalism to specify these patterns. (2) theintroduction of pattern constraints in the mining process, more precisely during the can-didate generation phase, (3) the possibility of considering different user-specified timegranularities during the support count phase, (4) the design and implementation of an al-gorithm (MILPRIT∗) to mine such constrained temporal patterns and (5) an extensive setof experiments over synthetic and real datasets. This is an extended version of the paper[de Amo et al. 2007]. In that paper, we had only considered interval-based time represen-tation in relations and we had not treated user-specified time granularities. MILPRIT∗ is ageneralization of the MILPRIT algorithm presented in [de Amo et al. 2007], designed tomining point-interval patterns with user-specified time granularities. The introduction ofpoint-based time representation required to consider a broader set of temporal predicatesin the generation phase. We took care to consider a minimal set of new temporal predi-cates in order to simplify the generation phase without losing completeness. MILPRIT∗can be considered as a general algorithm for temporal relational pattern mining sinceit can be executed over a dataset of multiple tables with point-based and interval-basedtimestamps. It can be applied in any situation which can be modeled by a temporal re-lational database involving point-valued and interval-valued attributes. Medical applica-tions, for instance, are typical applications where both time representations (point-basedand interval-based) must be considered. In this paper, we focus on this kind of applicationwhen evaluating the efficiency of our algorithm (see Section 5.2).
Paper Organization. This paper is organized as follows. In Section 2, we formalize themain concepts related to the problem of mining pi-patterns. In Section 3, we propose aconstraint-based framework for mining pi-patterns and formalize all the concepts involvedin the presentation of the main algorithm (MILPRIT∗). Section 4 is dedicated to the pre-sentation of MILPRIT∗, an algorithm for mining pi-patterns with constraints. In Section5 we present and analyse the experimental results obtained by executing MILPRIT∗ oversynthetic data as well as over real data related to a medical application. Finally in Section6, we present our concluding remarks and discuss some future research on the subject.
2. Problem Formalization In this section, we introduce the main concepts related to the problem of discoveringtemporal point-interval patterns.
The Dataset. First of all, let us describe the datasets where the discovery procedurewill be executed later on. We assume a set of attributes symbols A1, A2, ., called dataattributes, and two special attribute symbols Tit and Tpt, called interval-time attribute andpoint-time attribute respectively. Let R = {K, R1, ., Rn} be a database schema such thateach relational schema Ri is of the form Ri(Ai , ., Ai , T , ., Ai , T pt) or Ri(Ai relational schema K (called the reference relation) is of the form K(A1, ., Am). Theattributes appearing in K (all data attributes) are called reference attributes.
The values for data attributes Ai are taken within a set Dom, called the data do- main. The attribute Tit takes values in the set I = {[i, j] i, j ∈ N, i < j}. The elements of I are intervals [i, j] where i and j are natural numbers mapping dates: i is the starting dateand j is the ending date of the interval [i, j]. The attribute Tpt takes values in the set N.
Elements of Tpt are natural numbers mapping dates. Naturally, a point can be viewed asan interval [i, j] where i = j. In our approach, in order to optimize the generation phaseof the algorithm MILPRIT∗, we treat interval and points as distinct entities, emphasizingthat the starting and ending points of an interval are different.
A temporal dataset over R is a set of temporal relations {k, r1, ., rn} where each ri is a set of tuples (a1, ., an , e) over R i and k is a set of tuples over K . We will assume that the relations ri with interval-valued timestamp are coalesced[B¨ohlen et al. 1996], thatis, if (a1, ., ak, e) and (a1, ., ak, f) ∈ ri, then e ∩ f = ∅, i.e., intervals associated to asame tuple are maximal. The dataset depicted in Figure 1 is a temporal dataset with thefollowing schema: Temporal pi-Patterns. Let R = {K, R1, ., Rn} be a database schema. Let us sup-pose four disjoints sets V (data variables, denoted by x, y, z, x1, .), C (data constants,denoted by a, b, c, a1, .), Vit (interval-time variables, denoted by e, f, g, e1, .) and Vpt(point-time variables, denoted by t, s, t1, s1, .). Let us also assume the set of temporalpredicates PT = {before, meets, overlaps, during, finishes, starts}, as in Allen's IntervalLogic [Allen and Ferguson 1994]. A data atom over the database schema R is an expres-sion of the form R(x1, ., xn, e) where R ∈ R, (R 6= K), each xi is a data variable ora data constant and e is an interval-time or point-time variable (depending on the timeattribute of R). A temporal atom is an expression of the form r(e, f ), where r is oneof the temporal predicates in PT and e, f are interval-time or point-time variables. Ifr ∈ {overlaps, meets}, then e and f are interval-time variables. For the temporal predi-cates before, during, starts, finishes we may have different temporal atoms: before(e, f ),before(t, e), before(e, t), before(t, s), during(e, f ), during(t, e), starts(e, f ), starts(t, e),finishes(e, f ), finishes(t, e), where e, f are interval-time variables and t, s are point-timevariables. Figure 2 illustrates the semantics of the temporal predicates. For instance, theformula before(e, f ) is true if the variables e and f are evaluated as [4, 6] and [8, 9] respec-tively. The formula before(t, e) is true if t is evaluated as 5 and e as [6, 7]. The formuladuring(e, f ) is true if the variables e and f are evaluated as [5, 7], and [4, 9] respectively.
Interval-time x Interval-time
Point-time x Point-time
Point-time x Interval-time
before(t,e) during(t,e) starts(t,e) finishes(t,e) before(e,t) Figure 2. Semantics of Temporal Predicates
Definition 1 (Temporal pi-pattern) A temporal pi-pattern (temporal point-interval pat-tern) is a triple (K,D,T ) where (1) D is a set of data atoms (2) T is a set of temporalatoms and (3) K is a special data atom K(x1, ., xn) whose variables x1, ., xn appear inD. The data atom K is called the reference query of the pi-pattern (K,D,T ). It is essential in the definition of a pi-pattern, since it specifies what is counted. It plays the same rolein our setting as the atom key in the Datalog patterns of [Dehaspe and Toivonnen 1999].
Example 3 (Running Example) Let (Patient(x), D, T ) where D = {Med(x, y, e), Symp(x, z, f )} and T = {before(e, f )}.
Intuitively, this temporal pattern translates the fact that a patient x takes a medicine y (notspecified) during a period of time e and during a further period of time f (after she hasstopped taking the medicine y) presents a symptom z (not specified). Note that we arealways interested in analyzing the behaviour of patients registered in the relation Patient,even though relations Med and Symp may contain data related to other patients (not onlythose registered in Patient). For this reason, Patient(x) is called the reference query.
An instantiation of an interval-time variable e (resp. a point-time variable t) is a mapping v associating an interval v(e) = [i, j] to e (resp. a point i to t), where i and j arenatural numbers (mapping dates). A ground temporal atom is the temporal atom obtainedby instantiating its interval-time and point-time variables.
Definition 2 (Support and frequency of a pi-pattern) Let D be a temporal databaseover the schema R = {K, R1, ., Rn} and M = (K(x1, ., xm),{D1, .Ds}, {T1, ., Tl})be a temporal pi-pattern over R. The support of M with respect to D, denoted by sup(M)(omitting the dataset D, for the sake of simplifying the notation), is defined by sup(M) = {u∈K u =QM } where Q M is the relational calculus query ∃y1∃y2.∃yk(D1 ∧ D2 ∧ . . ∧ Ds ∧ T1 ∧ . . ∧ Tl), and y1, ., yk are the data variables not appearing in the referencequery K(x1, ., xm). The expression u = QM means that u is an answer to the queryQM when executed over D. Given 0 ≤ α ≤ 1, we say that the temporal pi-pattern M isfrequent with respect to D and α if sup(M) ≥ α.
Example 4 (Running Example) Let us consider the temporal pi-pattern M = (Pa-tient(x),{Med(x, y, e), Symp(x, z, f )},{before(e, f )}). Let D be the temporal databaseinstance illustrated in Figure 1. The relational query QM can be specified in the relationalalgebra formalism by the expression ΠPName(Patient o n Symp). The answer of QM is {(Paul), (Sarah)}. Note that Charles is not an answer. Indeed, Charles has takentetracycline and was dizzy during (and not after) the time he was taking this medicine.
John is not an answer neither, because he has not taken any medicine at all. So, the supportof M w.r.t. D is 1.
In what follows, we suppose that the set of temporal atoms T is consistent, i.e., the temporal variables can be instantiated in a consistent way. Formally, a set of temporalatoms T is consistent if there is an instantiation v of the set of temporal variables appear-ing in T , such that the ground atoms obtained are simultaneously true. We denote this byv = T . For instance, the set of temporal atoms {before(e, f ), before(f, g), before(g, e)}is not consistent, because the first two atoms imply that e comes before g and so, there isno way to instantiate the temporal variables e, f and g in order to make the three temporalformulae simultaneously true. We also suppose that the set T is complete, i.e., for eachpair of temporal variables e, f appearing in T it is possible to infer a unique temporalrelationship between e and f . For instance, the set {before(e, f ),meets(f, g)} is completebecause the relationship between e and g is completely determined (even though it doesnot appear explicitly in the set): the only possibility is before(e, g). On the other hand,{during(e, f ), overlaps(f, g)} is not complete since the relationship between e and g is not deterministically implied by the two relationships during(e, f ), overlaps(f, g): it could be(1) before(e, g), (2) meets(e, g), (3) overlaps(e, g), (4) during(e, g) or (5) starts(e, g). Thatis, for each one of the five temporal atoms (I) (where (I) ∈ {(1), (2), (3), (4) (5)}) the set{during(e, f ), overlaps(f, g), (I)} is consistent.
Representing temporal pi-patterns as sequences. The notion of consistency and com-pleteness of a set of temporal atoms is essential in our approach. As we will see next, givena consistent and complete set of temporal atoms, its temporal variables can be ordered ina natural way. This implies that a pi-pattern can be viewed as a sequence of data atoms.
And so, we can naturally think about regular expressions as a formalism for specifyingspecial sets of pi-patterns. Due to space limitation, we will only give an informal defini-tion of the order relation <T over the set of temporal variables appearing in a pi-pattern.
The idea can be easily formalized.
Let us consider the domain I for the interval-time variables and the domain N for the point-time variables. We can view a natural number n as an interval [n, n]. So, weconsider the set I∗ = I ∪ N as being constituted of intervals [i, j], with i ≤ j. Let [a, b]and [c, d] in I∗. We say that [a, b] < [c, d] if and only if one and only one of the followingconditions is verified: (1) b < d or (2) b = d and a < c. It can easily be shown that < is a(strict) total order over the set I∗.
The (strict) ordering < over the interval structure I∗ naturally induces an ordering <T over the set of temporal variables (interval-time and point-time) appearing in the set of temporal predicates T of a pi-pattern. The following example illustrates the idea: let T= {before(e,f), starts(f,g)}. Then the induced ordering over the set of temporal variables{e, f, g} appearing in T is e <T f <T g. Indeed, for all instantiation v of the temporal variables, v(e) = [ae, be], v(f) = [af , bf ], v(g) = [ag, bg], such that the predicates in Tare simultaneously true, the ending points be, bf , bg satisfy be < bf < bg. That is, for allinstantiation v such that v = T , we have v(e) < v(f ) < v(g).
The following Theorem guarantees that the relation <T is well-defined and also an order relation.
Theorem 1 Let T be a complete and consistent set of temporal atoms. Then (1) the def-inition of <T does not depend on the particular instantiation v of the temporal variables appearing in T and (2) <T is a strict order relation over the set of temporal variables appearing in T .
Proof (Sketch). In order to simplify the presentation, we suppose that we have onlyinterval-time variables. The case where one has interval-time and point-time variables istreated in a similar way. (1) By the fact that T is consistent, we can affirm that there existsa valuation v of the temporal variables appearing in T such that v = T . If v is unique,then <T is well-defined, since there is a unique way to define it through the valuation v.
Now, let us suppose that such valuation is not unique, that is, let us suppose that v and v0are two valuations of temporal variables of T such that v = T and v0 = T . From the factthat T is complete, we can affirm without loss of generality that there exists a temporalpredicate r such that v = r(e,f) and v0 = r(e,f). So, v(e) < v(f ) if and only if r(e,f) iseither before(e,f) or during(e,f) or meets(e,f) or overlaps(e,f) or finishes(e,f) or starts(e,f).
In each one of these cases, from the fact that v0 = r(e,f), we conclude that v0(e) < v0(f ).
(2) In order to prove that <T is a strict order over the temporal variables appearing in T , we must show that it satisfies the irreflexivity and transitivity properties. It is obvious that<T is irreflexive. Let e <T f and f <T g, where e, f, g are temporal variables appearing in T . Let r(e, f ), r0(f, g) and r00(e, g) be the unique temporal relationship inferred fromT for the pair of variables (e, f ), (f, g) and (e, g) respectively. We can guarantee this byusing the fact that T is complete. As T is consistent, there exists a valuation v of thetemporal variables e, f, g such that v = r(e, f ) ∧ r0(f, g) ∧ r00(e, g). Now, using the factthat e <T f and f <T g, and considering all the possibilities for the temporal predicates r, r0 and r00 it is not difficult to show that v(e) < v(g) in all cases.
The induced order <T over the temporal variables appearing in a temporal pi- pattern M allows us to view M as a sequence of data atoms. Indeed, let M = (K, D, T ),where D = {p1(x1, ., x1 , e , ., xk , e 1), . . , pk(xk k)}. Since we are assuming that T is complete and consistent, then, according to Theorem 1 its temporal variables are orderedby the order <T . Let us suppose, without loss of generality, that e1 <T e2 <T . . <T ek.
So, the set D can be viewed as the sequence < p1(x1, ., x1 , e , ., xk, e 1), . . , pk(xk 3. Temporal pi-Patterns with RestrictionsIn an application where the user has a previous idea of some specific characteristics ofthe patterns she is searching, she can be overwhelmed with uninteresting patterns. In thissection, we propose to use regular expressions as a formalism to specify a broad spectrumof constraints over pi-patterns in order to better satisfy user requirements as well as toreduce the search space of patterns.
A mode over a temporal database R = {K, R1, ., Rk} is an expression of the form R(u1, ., us, #), where each ui is a data variable or the (new) symbol ∗, # is a newsymbol, and R(A1, ., An, T ) is one of the predicates Ri ∈ R, T ∈ {Tit, Tpt}. We saythat a data atom R(y1, ., ys, e) is in accord with the mode R(u1, ., us, #) if for eachi = 1, ., s we have: (1) If ui is a variable then yi = ui; (2) If ui = ∗, then yi is avariable or a constant ; (3) e is a interval-time variable if T = Tti or a point-time variableif T = Tpi. For instance, Med(x, ∗, #) is a mode and the atom Med(x, penicillin, e) isin accord with Med(x, ∗, #). On the other hand, the atom Med(y, penicillin, e) is not inaccord with the mode Med(x, z, #).
Let Σ be a set of modes over R = {K, R1, ., Rn}. In what follows, we will consider regular languages (sets of strings) over the alphabet Σ. These languages arespecified by regular expressions. Given a regular expression E over the alphabet Σ, wedenote by W (E) the set of strings (words) verifying E. We denote by P (E) the set ofprefixes of strings in W (E). We will need this notation later.
Definition 3 (The Search Space) Let R = {K, R1, ., Rn} be a database schema, Σa set of modes over R and E a regular expression over Σ. The search space de-fined by E is the set of temporal pi-patterns (K(x1, ., xm),D,T ) such that D = <p1(t1, ., t1 , e , ., tn , e 1), . . , pn(tn n) > satisfies the following properties: (1) There ex- ists a string w1w2.wn ∈ W (E), such that pi is in accord with wi for each i = 1, ., n;(2) For each data atom pi(xi , ., xi , e i) ∈ D let wi = pi(u1, ., ul i) be its associated mode; if uj = ∗ and xj is a variable, then xj has only one occurrence in D (a data variablesymbol xj can be used to replace a symbol ∗ in a mode only once). We denote by W(E)the search space specified by E. We denote by P(E) the set of pi-patterns defined in a similar way as W(E), by considering strings w1w2.wn ∈ P (E) instead of W (E) incondition (1). Patterns in W(E) are called valid. Patterns in P(E) are called prefix valid.
As we will see in the next section, in the generation phase of our algorithm, only pi-patterns in P(E) (prefixes) will be generated. The reason why we generate prefix-validpatterns (p-valid patterns for simplify) rather than valid patterns will be clearer later.
Example 5 (Running Example) Let E = Med(x, ∗, #) Med(x, ∗, #)∗ Symp(x, ∗, #).
The pi-pattern M1 = (Patient(x), {Med(x, penicillin, e), Med(x, tetracyclin, f),Symp(x, diarrhea, g)}, {before(e, f ), overlaps(f, g)}) belongs to W(E). The pattern M2= (Patient(x), {Med(x, penicillin, e), Med(x, tetracyclin, f )}, {before(e, f )}) belongs toP(E). The pattern M3 = (Patient(x), {Med(x, y, e), Symp(x, y, f)}, {before(e, f)}) doesnot belong to P(E), since property (2) of definition 3 is not verified. Intuitively, the reg-ular expression E captures the temporal pi-patterns corresponding to a patient x takingcertain medicines during successive periods and eventually presenting some symptom.
Our mining task is: Given a temporal database D, a minimum support threshold α,0 ≤ α ≤ 1, and a regular expression E over Σ, find all temporal pi-patterns in W(E)which are frequent with respect to D and α.
4. The Algorithm MILPRIT∗In this section, we present the algorithm MILPRIT∗ (Mining Interval Logic Patterns withRegular expressIons consTraints) which generalizes the idea of the SPIRIT algorithmintroduced in [Garofalakis et al. 1999]. In a high level, it follows the general Aprioristrategy of [Srikant and Agrawal 1996], working in passes, and each pass producing pat-terns more specificic than those produced in the previous pass. The classical Aprioristrategy uses a Pruning Phase, where generated patterns which are more specific thana non-frequent pattern p are pruned. This pruning strategy relies on the AntimonotonicProperty: "a frequent pattern cannot be more specific than a non-frequent pattern".
At a first glance, a way of conceiving a method for mining pi-patterns, following the Apriori strategy would be: (1) at each pass k, one combines frequent and valid patternsin the set Fk−1 obtained in the previous pass in order to generate the set of candidates Ck.
One takes care to produce only valid patterns in this combination, pushing the regularexpression E in the generation process; (2) one prunes those patterns in Ck containing a subpattern which is valid and which is not in k−1 F i (this pattern surely is not frequent!); (3) one scans the database in order to count the support of the remaining patterns. In thismethod, the number of generated patterns is in direct proportion to the restrictiveness ofE. In the pruning phase, however, all patterns which are more specific than a patternsatisfying E and not belonging to Fk−1 should be pruned. In this case, the size of the setof pruned patterns is in inverse proportion to the restrictiveness of the constraint E. Suchtrade-offs are due to the fact that regular expression constraints are not anti-monotone,that is, if a pattern satisfies E it may have some subpattern not satisfying E2.
In order to find a suitable trade-off, we use a relaxation of the constraint E, that is, a less restrictive one, the prefix constraint associated to E, according to which, onlyfrequent p-valid pi-patterns will be produced in the mining process. A post-processingphase will filter the valid patterns.
2For instance, the word aba verifies the regular expression E = ab∗a but its subword ba does not verify E.
The general structure of the algorithm MILPRIT∗ is depicted below: Procedure MILPRIT∗(D,α,E,δ)D is a temporal dataset over R = {K, R1, ., Rn}, E is a regular expression over the alphabet of modesΣ over R, α is a minimum support threshold and δ is the time granularity (day, week, fortnight, month,bimester, trimester, semester, year). Let M1 be the set of data atoms Ri(x1, ., xn , e), where e is a interval-time or a point-time variable, depending on the sort of the temporal attribute appearing in Ri.
begin1. F := F1 = {M ∈ M1 : M is frequent and M ∈ P(E)}2. k := 23. repeat 4. Ck := Gen(Fk−1,E) (Generation Phase)5. P := {M ∈ Ck : ∃N ∈ P(E),such that N 6∈ F and M is more specific than N}6. Ck := Ck − P (Pruning Phase)7. Fk := {M ∈ Ck : sup(M, D, δ) ≥ α} (Validation Phase)8. F := F ∪ Fk9. k := k + 1 10. until Fk−1 = ∅11. F := {M ∈ F : M ∈ W(E) } (Post-Processing Phase) The function sup(M, D, δ) is defined as follows: First, each date appearing in some tuple in D is associated to the corresponding time value determined by the timegranularity δ. For instance, if δ = month and date is 20/09/2007 and the minimum dateappearing in D is 28/07/2006 then 20/09/2007 is mapped into 15, since date 20/09/2007occurs in the 15th month relatively to the first month July 2006. Let D0 be the databaseobtained from D after this transformation. The support sup(M, D, δ) is obtained by cal-culating sup(M, D0) as in Definition 2.
The Generation Phase. MILPRIT∗ works in passes, each pass k producing patterns of"level" k. The "level" of a temporal pi-pattern M is measured in terms of the refinementvector associated to M as defined below.
Definition 4 The refinement vector associated to M = (K, D, T ) is v(M) = (n, c) wheren is the size of D and c is the number of constants appearing in D. The refinement level ofM, denoted by l(M), is defined as n + c where (n, c) = v(M). For instance, let M be thetemporal i-pattern (Patient(x), {Med(x, penicillin, e), Symp(x, z, f )}, {before(e, f )}).
Then v(M) = (2, 1) and l(M) = 3.
The procedure Gen(Fk−1,E) is designed to generate temporal pi-patterns whose refinement level is k, by specializing the frequent temporal pi-patterns of Fk−1 (whoserefinement level is k − 1) in such a way that the patterns produced are in P(E). The coreof procedure Gen is the two specialization operators (ρE and ρI) defined below.
Definition 5 (Extension) Let E be a regular expression and M ∈ P(E), M = (K(x1, ., xn), D, T ). The extension operator ρE executed over M is defined as fol-lows: (a) if v(M) = (n, c) with c > 0, then ρE(M) = ∅; (b) if v(M) = (n, 0) thenρE(M) is the set of temporal pi-patterns M0 = (K(x1, ., xn), D0, T 0) ∈ P(E) such thatD0 = D∪{pn+1(z1, . . , zl, en+1)} where pn+1 ∈ R, z1, ., zl are data variables and en+1 isa interval-time or a point-time variable (according to the sort of the time attribute of pn+1) and T 0 = T ∪ {r1(e1, en+1), . . , rn(en, en+1)} where ri are temporal predicates, and T 0is complete and consistent. Clearly, for each i = 1, ., n the temporal predicates ri aretaken according to the type of the time variables ei, en+1 (point or interval) (see Figure 2).
For instance, if both ei and en+1 are point-time, ri must be the temporal predicate before.
The following example illustrates the operator ρE: Example 6 (Running Example) Let E be the regular expression Hist(x,*,#)∗Med(x,*,#)∗.
The pi-pattern M = (Patient(x), {Hist(x,y,t1)}, ∅) belongs to P(E).
Let us consider the following pi-patterns: M1 = (Patient(x), {Hist(x, y1, t1), Hist(x, y2, t2)}, {before(t1, t2)})M2 = (Patient(x), {Hist(x, y1, t1), Hist(x, y2, t2), Med(x, z, e3)}, {before(t1, t2), meets(t2, e3)})M3 = (Patient(x), {Hist(x, y1, t1), Hist(x, y2, t2), Med(x, z, e3)}, {before(t1, t2), starts(t1, e3), during(t2, e3)})M4 = (Patient(x), {Hist(x, y1, t1), Hist(x, y2, t2), Med(x, z1, e3)}, Med(x, z2, e4)}, {before(t1, t2), starts(t1, e3),overlaps(e3, e4)}) M1 is in ρE(M), but M2 is not in ρE(M1), because meets is not a valid relationship between t2 and e3. The pattern M3 is in ρE(M1), but M4 is not in ρE(M3), because thereis no explicit nor implicit relationship between the temporal variables e4 and t2, that is,the set of temporal predicates in M4 is not complete.
Definition 6 (Instantiation) Let E be a regular expression and M ∈ P(E), M =(K(x1, ., xn), D, T ), where D = < D1, ., Dm >. The instantiation operator executedover M produces the set ρI(M) of temporal pi-patterns M0 = (K(x1, ., xn), D0, T ),where D0 is obtained by replacing some variable yk (yk 6= xl for l ∈ [1, n]) occurring insome Di = p(y1, . . , yp, e) by a constant c in such a way that if the string of modes cor-responding to D is w1w2.wm and wi = p(u1, . . , up, #), then: (a) uk = ∗; (b) for everyl > k, if ul = ∗, then yl is not a constant; (c) for every j > i, if wj = p0(v1, . . , vq, #),then for every l ∈ [1, q], if vl = ∗ and Dj = p0(z1, . . , zq, e), then zl is not a constant.
Intuitively, condition (b) states that if some variable in a data atom is instantiated then no other variables on the left of this one in the atom can be instantiated in further stepsof the candidate generation process. Condition (c) states that if some variable in a dataatom p is instantiated then no other variables appearing in data atoms coming after p inthe pattern can be instantiated in the future. These conditions are necessary for assuringoptimality of the specialization operators, as stated in Lemma 2 below. The followingexample illustrates the operator ρI: Example 7 (Running Example) Let E be the regular expression E = Med(x,*,#)∗Symp(x,*,#)∗. Consider the following pi-patterns in P(E): M1 = (Patient(x), {Med(x, w, e1), Symp(x, y, e2)}, {before(e1, e2)})M2 = (Patient(x), {Med(x, w, e1), Symp(x, dizziness, e2)}, {before(e1, e2)})M3 = (Patient(x), {Med(x, penicillin, e1), Symp(x, y, e2),}, {before(e1, e2)})M4 = (Patient(x), {Med(x, penicillin, e1), Symp(x, dizziness, e2)}, {before(e1, e2)}) According to definition 6, M2 and M3 are in ρI(M1) (i.e., are valid instantiations of M1), M4 6∈ ρI(M1), M4 6∈ ρI(M2), M4 ∈ ρI(M3).
Two temporal pi-patterns M and M0 are said to be equivalent if M0 is obtained from M by renaming its variables.
Theorem 2 The specialization operator ρ defined as ρ(M) = ρI(M) ∪ ρE(M) satisfiesthe following properties: • completeness: every pi-pattern M ∈ P(E) can be obtained by successively apply- ing ρ to some pi-pattern with the refinement level 1.
• optimality: suppose that the regular expression E verifies the following condition: for all distinct modes m1 and m2 appearing in E, associated to the same predicatep, the positions containing the symbol * are the same in boths modes. Under thiscondition, the instantiation operator ρ is optimal, in the sense that a pi-patterncannot be obtained by specializing two non-equivalent pi-patterns.
Proof.
(1) Completeness: Let M = (K, D, T ) ∈ P(E) and ν(M) = (n, c). We will showby induction on n that for all c ≥ 0, M is obtained by successively applying ρ to somepi-pattern M1 whose refinement level is 1.
• Let n = 1. Let us show by induction on c that M is obtained by successively applying ρ to some pi-pattern M1 with refinement level 1. For c = 0: the re-finement level of M is 1 and M = ρ0(M) (we apply ρ "0 times" over M). Letc ≥ 1 and let us suppose that the result is true for c − 1. Let p(t1, ., tk, en)the last atom in D where, for some i ∈ {1, ., k}, ti is a constant ∈ C. Let m= max{i 1 ≤ i ≤ k such that ti ∈ C}. Let M0 be the pi-pattern obtained byreplacing ti by a new variable y in the atom p(t1, ., tk, en). That is, we replacethe data atom p(t1, ., ti, ., tk, en) by p(t1, ., y, .tk, en). Then, M is obtainedby applying ρI over M0. By the induction hypothesis, since ν(M0) = (n, c − 1)then M0 is obtained by successively applying the specialization operator ρ over api-pattern M1 with refinement level 1. Then, M = ρI(M0) is also obtained in sucha way.
• Let us suppose the result is true for n − 1 and let us show it is true for n ≥ 2. We will show that by induction on c ≥ 0. For c = 0: M =< p1(t1, t1, ., t1 , e , tm, ., tm , e is a variable, for all 1), ., pm(tm m) >, and tij i, j. It is easy to see that M is obtained by applying successively the operatorρE to the pi-pattern < p1(t1, t1, ., t1 , e 1) > with refinement level 1 (the proof is achieved by induction on m, using the fact that each prefix of M is in P(E). Forthe case c ≥ 1, we employ the same argument used in the case n = 1 and c ≥ 1.
(2) Optimality: For lack of space the proof is given in the Appendix.
Figure 3 illustrates the completeness and optimality of ρ. M3 cannot be obtained by specializing both M1 and M2, since they are non-equivalent pi-patterns. In fact ρexecuted over M1 does not produce M3 since according to Definition 5(a), each time api-pattern is instantiated, it cannot be extended any more. M5 cannot be obtained byspecializing both M3 and M4, since they are non-equivalent pi-patterns. In fact ρ executedover M4 does not produce M5 since according to Definition 6(c), each time a position ina pi-pattern is instantiated no positions at its left can be instantiated in future steps.
The Validation Phase. The support counting of the generated pi-patterns is a rathertechnical procedure and its details are omitted here. Roughly speaking, an equivalence M = Patient M = Patient M = (Patient(x), {Med(x,y,e ),Symp(x,z,e )},T ) Level 3 M = (Patient(x), {Med(x,penicillin,e ),Symp x,z,e )},T ) M = (Patient(x), {Med(x,y,e ),Symp(x,dizziness,e )},T ) M = (Patient(x), {Med(x,penicillin,e ),Symp x,dizziness,e Figure 3. Candidate Generation: executing the specialization operators level by level.
relation over the set of generated patterns is defined, depending on the sequence of modesappearing in the patterns. A SQL query Q(C) is associated to each equivalence class C.
Thus, to each equivalente class C, a database DC is associated. An important point toremark here is that the database DC has a unique relation, namely the answer of queryQ(C) executed over the input database D. Thus, in order to evaluate the support of thepatterns in class C, the relation DC is scanned once at each pass. A hash-tree structure isused for optimizing support counting in each equivalence class. The construction of thesehash-trees have been inspired on the ones used in [Srikant and Agrawal 1996].
The following result is a corollary of Theorem 2 which guarantees the soundness and completeness of MILPRIT∗: Corollary 1 (Soundness and Completeness) Let P be the set of temporal pi-patternsreturned by the algorithm MILPRIT∗. Then (1) all patterns in P are frequent and satisfythe constraints enforced by the regular expression E and (2) all frequent and valid patternsare in P.
Proof. (1) If P ∈ P then P is frequent since, after the validation phase, only frequentpatterns are returned. Besides, step 11 of MILPRIT∗ guarantees that only valid patterns(with respect to E) are returned. (2) Let P be a frequent and valid pattern. Then P isp-valid. By Theorem 2 P is obtained by applying the specialization operator ρ n timesover a pi-pattern M in the refinement level 1. It is easy to prove, by induction on n, thatP is obtained during the generation phase of MILPRIT∗, in the refinement level n + 1.
Since P is frequent, it will be returned in step 7 of MILPRIT∗. Since P is valid it will bereturned in step 11 of MILPRIT∗.
5. Experimental ResultsIn this section, we discuss the performance and scalability of algorithm MILPRIT∗through a set of experiments using both synthetic and real-life datasets. Our experi-ments were performed on an IBM SuperWorkstation7043A-8R with 2.4GHz and 4GBof main memory, running Linux. The DBMS PostgreSQL has been used to handle theSQL queries at the validation phase, as discussed before.
5.1. Synthetic DataWe have developed a synthetic data generator which generates regular expressions andtemporal databases where pi-patterns satisfying a regular expression may appear. The regular expressions generated have the format (B∗B∗.B∗), where each block B form (T1.Tm), and each Tj (called term) has the form (mj +.+ mj ), where mj are modes. Thus, our generator can produce several types of regular expressions, by varyingthe number of blocks (k), the maximum number of terms in each block (m) and themaximum number of modes appearing in the terms. For instance, one regular expressionwhich fits into this format is E = (Med(x,*,#)Symp(x,*,#))∗ (Med(x,*,#)Hist(x,*,#))∗. Thisexpression has two blocks and two terms in each block. For each term there is only onemode which can be chosen. In Table 1, we list the parameters used in the synthetic datagenerator, allowing a great variety of temporal databases and regular expressions to beproduced. For the tests presented in this paper, we use the following notation for the inputdatasets: Ix − P y − Mz − Us − Bt − T u − V v − Qw, where x, y, z, s, t, u, v and w arevalues for the input parameters I, P, M, U, B, T, V and Q, respectively. The remaininginput parameters (S, D and F ) are set as (S = U , D = U and F = U ). The parameter Q refers to the number of valid pi-patterns which surely appear in the generated dataset.
Table 1 - Parameters used for the synthetic data generation.
Parameters Number of tables with intervals Number of tables with points Number of attributes per table (not temporal) Number of tuples per table - in '000s Size of the domain of reference attributes - in '000s Size of the domain of non-reference attributes - in '000s Size of the domain of temporal attributes - in '000s Number of blocks of the regular expression Number of terms per blocks of the regular expression Number of modes per term of the regular expression Minimum amount of pi-patterns - in '000s Performance Analysis. Figure 4(a) shows the performance of MILPRIT∗ with respectto support variation. The tests have been executed over the dataset I2 − P 2 − M2 −U5 − B2 − T 2 − V 1 − Q1. As the minimum support is increased from 0.5% to 3%, thenumber of frequent patterns at each pass decreases, as expected. Thus, the execution timeof the algorithm decreases, since few candidates are potentially frequent for high values ofminimum support. Figures 4(b), (c) and (d) show how the algorithm performs with respectto variations in the regular expression. In these tests the minimum support has been set as2%. The datasets used in these tests were I2−P 2−M2−U5−Bt−T 2−V 1−Q1, wheret ∈ {1, 2, 3, 4, 5}, I2 − P 2 − M2 − U5 − B2 − T u − V 1 − Q1, where u ∈ {1, 2, 3, 4, 5}and I2 − P 2 − M2 − U5 − B2 − T 2 − V v − Q1, where v ∈ {1, 2, 3, 4, 5}. As thevalues of parameters B, T or V (respectively the number of blocks, the number of termsin each block and the number of choices for modes in each term) increase, the numberof long patterns potentially frequent increases as well. Thus, the execution time of thealgorithm increases too, since longer patterns have to been tested at the validation phase.
Notice that the number of attributes of the table which is scanned to evaluate the supportfor these long patterns increases as the size of the patterns increases.
Figure 5(a) shows the relationship between the number of generated and pruned pi-patterns during the execution of MILPRIT∗ over the dataset I2 − P 2 − M2 − U5 −B2 − T 2 − V 1 − Q1 with minimum support 0.5%. Note that the number of prunedpth's is quite expressive, and consequently, a smaller amount of pi-patterns are evaluatedat the validation phase. Figure 5(b) shows the execution time associated to each phase P 2-M 2-U 5-B 2-N 2-V 1-Q 1
I2-P 2-M 2-U 5-B x-N 2-V 1-Q 1
tio 30000
Minimum Support (%)
Number of Blocks
I2-P 2-M 2-U 5-B 2-N x-
I2-P2-M2-U2-B2-T V 1-Q 1
I2-P 2-M 2-U 5-B 2-N 2-V x-Q 1
Number of Terms
Number of Modes
Figure 4. Performance results
(generation-pruning-validation) at each refinement level. The tests also have been exe-cuted over the dataset I2 − P 2 − M2 − U5 − B2 − T 2 − V 1 − Q1 with minimumsupport 0.5%. The execution time at the validation phase, especially at higher levels, isquite high. This can be explained by the fact that calculating the support of long patternsat high refinement levels involves the execution of SQL queries with a high number ofjoins.
I2-P 2-M 2-U 5-B 2-N 2-V 1-Q 1
I2-P 2-M 2-U 5-B 2-N 2-V 1-Q 1
tte 25000
Figure 5. (a) Number of generated and pruned pi-patterns, (b) Amount of time spent in
each phase at different levels
Scalability For the scalability experiments, the parameters related to the database havebeen changed. In these experiments, we used the default parameters for the regular ex-pression and fixed the minimum support as 2%. We first show how MILPRIT∗ performsas the number of tuples per table (U) increases. The datasets considered here were I2 −P 2 − M2 − Us − B2 − T 2 − V 1 − Q1, where s ∈ {20000, 40000, 60000, 80000, 100000}.
Figure 6(a) shows the results of these tests. Clearly, the higher the number of tuples inthe database tables, the longer the execution time of the algorithm, since the validationphase (the support evaluation) is the most computationnally costly. It can be noticedthat MILPRIT∗ scales almost linearly with respect to the number of tuples in the dataset.
Figure 6(b) shows how MILPRIT∗ scales up as the number of attributes per table (M) in-creases. We show the results for the datasets I2 − P 2 − Mz − U5 − B2 − T 2 − V 1 − Q1,where z ∈ {2, 3, 4, 5}. As the number of attributes in the base tables increases, the num-ber of attributes of the table which is scanned in the validation phase3 increases as well.
Consequently the execution time increases. Figure 6(c) shows how the execution timeincreases as the number of generated patterns increases. This experiment consisted inrunning the algorithm over datasets I2 − P 2 − M2 − U5 − B2 − T 2 − V 1 − Qw, forw ∈ {1000, 1500, 2000, 2500, 3000, 3500, 4000}. In these three scalability tests, we cannotice that MILPRIT∗ scales up almost linearly with respect to the corresponding param-eters (number of tuples per table, number of attributes per table and number of interestingpatterns appearing in the dataset).
I2-P 2-M 2-U x-B 2-N 2-V 1-Q 1
I2-P 2-M x-U 5-B 2-N 2-V 1-Q 1
Tuples - in '000
Nº of Attributes per Table
I2-P 2-M 2-U 5-B 2-N 2-V 1-Q x
Frequent Patterns - in '000
Figure 6. Scalability results
The AMDI4 is part of a project in our laboratory, which aims at building a system support-ing an indexed atlas of digital mammograms. This system intends to be used by radiolo-gists as well as radiology students, in order to help them in breast cancer diagnosis. TheAMDI database 5 stores, for each patient, a series of digital mammograms, taken duringits lifetime. Besides image data, some important temporal data is stored, related to thepatient habits and life style: in which periods of time she smoked, or took contraceptivedrugs, for instance. The data used in our tests were supplied by Instituto Victorio Valeri deDiagn´osticos M´edicos, Ribeir˜ao Preto-SP, Brazil. For the time being, the AMDI databasestores information about 1365 patients. In the future, as the system will be available on-line, it will be possible for radiologists to insert new patient's records and images. TheAMDI relational schema is R = {Patient(PName), Contraceptive(PName, CName, Tit), 3This table is obtained as the result of executing a SQL query over the temporal dataset.
4The AMDI system is avaliable on www.lcc.ufu.br/amdi.
5The AMDI database is avaliable on www.lsi.ufu.br.
Antidepressant(PName, AName, Tit), Hormone(PName, HName, Tit), Tobacco (PName,Amount, Tit), Drink(PName, Type, Tit), FirstPregnancy(PName, Tpt), Menarche(PName,Tpt) Menopause(PName, Tpt), Birads(PName,Category, Tpt)}. The relation Birads storesinformation related to the BIRADS classification, standard published by ACR (AmericanCollege of Radiology)6. The attribute Category may assume five different values, eachone corresponding to a mammogram class. MILPRIT∗ has been executed over the AMDIdatabase in order to find temporal pi-patterns relating the evolution of a breast cancertumor and the patient's habits and life style. For the experiment carried out over thisdatabase, we considered the time granularity δ = semester.
Some of the patterns found were: P2 = (Patient(x), {(x,20,e1), Birads(x, 2, s2)}, {before(e1, s2)}) P3 = (Patient(x), {FirstPregnancy(x,YES,e1), Menopause(x,YES,f2), Birads(x, 2, s3)},{before(e1, f2), before(e1, s3), before(f2, s3)}) The pi-pattern P1 can be interpreted as follows: patients start taking the antide- pressant fluoxetine as soon as they enter the menopause. This pi-pattern has been verifiedby 25 patients (support = 1.8%). The pi-pattern P2 has been verified by 14 patients (sup-port = 1%) and can be interpreted as follows: patients which smoke 20 cigarettes perday stop smoking after their mammogram is classified with Birads 2. The pi-pattern P3can be interpreted as follows: patients who have already had children presents mammo-grams with Birads 2 after entering the menopause, this pi-pattern has been verified in 177patients (support = 12%).
We notice that, for the time being, MILPRIT∗ has not mined expressive patterns relating patient's lifestyle and cancer occurrence as we expected in the beginning of theprojet. Clearly, this is due to the fact that in the current stage of the AMDI database, withonly 1365 records, the Birads classification of the mammograms is almost unchanged.
Moreover, there are no cases of mammograms with Birads 5 in our database. We hopethat as the AMDI database gets more populated and diversified, MILPRIT∗ will be able todiscover more interesting patterns.
6. Conclusion and Further WorkIn this paper we introduced a first-order (or relational) temporal pattern mining frameworkwhere time can be viewed as intervals as well as points in a straight line. We also proposeda formalism to specify a broad class of constraints over such temporal patterns. We de-veloped and implemented the algorithm MILPRIT∗ designed to mining temporal patternswith constraints. For the time being, our temporal pattern formulation does not include apattern where several data atoms are related to the same temporal variable. For instance,we does not consider patterns like (Patient(x),{Med(x, penicillin, e), Med(x, xanax, f )},{equal(e, f )}).
Our future research focuses on three main targets: (1) the development of a sec- ond algorithm for mining pi-patterns based on the technique PrefixSpan [Han et al. 2004] 6Breast Imaging Reporting and Data System BIRADS. American College of Radiology, 2004.
which presents a much better performance when compared to Apriori-based techniquesin the sequence pattern mining context, (2) the refinement of the method in order to in-corporate more general temporal patterns and (3) the development of a tool for temporalrelational pattern mining.
Allen, J. F. and Ferguson, G. (1994). Actions and events in interval temporal logic. Tech- nical Report TR521.
Antunes, C. and Oliveira, A. L. (2005). Constraint relaxations for discovering unknown sequential patterns. In Knowledge Discovery in Inductive Databases, LNCS 3377,pages 11–32. Springer Berlin, Heidelberg.
Bettini, C., Wang, X. S., Jajodia, S., and Lin, J.-L. (1998). Discovering frequent event pat- terns with multiple granularities in time sequences. IEEE Transactions on KnowledgeData Engineering, 10(2):222–237.
B¨ohlen, M. H., Snodgrass, R. T., and Soo, M. D. (1996). Coalescing in temporal databases. In VLDB 1996, pages 180–191.
Boulicaut, J.-F., De Raedt, L., and (Eds.), H. M. (2006). Constraint-based mining and inductive databases. In European Workshop on Inductive Databases, volume 3848 ofLecture Notes in Artificial Intelligence.
de Amo, S. and Furtado, D. (2005). First-order temporal pattern mining with regular expression constraints. In 20th Brazilian Symposium on Databases, pages 280–294.
de Amo, S., Giacometti, A., Furtado, D. A., and Laurent, D. (2004). An apriori-based approach for first-order temporal pattern mining. In 19th Brazilian Symposium onDatabases, pages 48–62.
de Amo, S., Giacometti, A., and Pereira-Jr., W. (2007). Mining first-order temporal in- terval patterns with regular expression constraints. In 9th International Conferenceon Data Warehousing and Knowledge Discovery (DaWaK 2007), LNCS 4654, pages459–469.
Dehaspe, L. and Toivonnen, H. (1999). Discovery of frequent datalog patterns. Data Mining and Knowledge Discovery, 3:7–36.
Emerson, E. A. (1990). Temporal and modal logic. In van Leeuwen, J., editor, In Handbook of Theoretical Computer Science, Volume B: Formal Models and Seman-tics, pages 995–1072. Elsevier Science Publishers.
Garofalakis, M., Rastogi, R., and Shim, K. (1999). Spirit: Sequential pattern mining with regular expression constraints. In The VLDB Journal, pages 223–234.
Han, J., Pei, J., Mortazavi-Asl, B., Pinto, H., and Dayal, U. (2004). Mining sequential pat- terns by pattern-growth: The prefix-span approach. IEEE Transactions on Knowledgeand Data Engineering, 16(11):1424 – 1440.
Hoppner, F. (2001). Discovery of temporal patterns – learning rules about the qualitative behavior of time series. In Proceedings of the 5th European Conference on Principlesand Practice of Knowledge Discovery in Databases, pages 192–203.
Jacobs, N. and Blockeel, H. (2001). From shell logs to shell scripts. In Rouveirol, C. and Sebag, M., editors, Proceedings of the 11th Int. Conf. on Inductive Logic Programming,volume 2157 of LNAI, pages 80–90.
Lattner, A. and Herzog, O. (2005). Mining temporal patterns from relational data. In LWA 2005, pages 184–189.
Lee, S. D. and De Raedt, L. (2002). Constraint based mining of first order sequences in seqlog. In KDID'02, pages 76–92.
Lorincza, H. A. and Boulicaut, J.-F. (2003). Mining frequent sequential patterns under regular expressions: a highly adaptive strategy for pushing constraints. In ICDM'03,pages 316–320.
Masson, C. and Jacquenet, F. (2003). Mining frequent logical sequences with SPIRIT- LoG. In Proceedings of the 12th International Conference on Inductive Logic Pro-gramming, volume 2583 of Lecture Notes in Computer Science, pages 166–181.
Springer.
Pei, J., Han, J., and Wang, W. (2002). Mining sequential patterns with constraints in large databases. In Proceedings of the 2002 ACM CIKM Conference, pages 18–25.
Srikant, R. and Agrawal, R. (1996). Mining sequential patterns: Generalizations and performance improvements. In EDBT'96, volume 1057, pages 3–17, Avignon, France.
Springer-Verlag.
Zaki, M. J. (2001). SPADE: An efficient algorithm for mining frequent sequences. Ma- chine Learning, 42(1-2):31–60.
Appendix: Proof of Theorem 2 - Optimality.
Let M = (K, D, T ) ∈ P(E),M = (K, D, T ) and D = {p1(t1, ., t1 , e , ., tn , e 1), ., pn(tn n)} and let ν(M ) = (n, c).
1. Case 1: if c = 0 then M must be generated by extension, that is, M = ρE(M1), for some pi-pattern M1. Since the atoms in D are ordered according to the partialorder introduced in Section 2, then M0 = (K,D0,T 0) ∈ P(E) with • D0 = {p1(t1, ., t1 , e , ., tn−1 , e 1), ., pn−1(tn−1 • T 0 = {r(ei, ej) ∈ T i, j < n} 2. Case 2: if c > 0: Let us suppose that M = ρ(M1) = ρ(M2). Notice that, since c > 0, it is not possible that M is obtained by extension from M1 or M2. Then,M = ρI(M1) = ρI(M2). Let M1 = (K,D1, T 1) and M2 = (K,D2, T 2).
• D1 = {p1(x1, ., x1 , e , ., xn , e 1), ., pn(xn • D2 = {p1(y1, ., y1 , e , ., yn , e 1), ., pn(yn The strings of modes associated to M1 and M2 are respectively7: • W 1 = p1(u1, ., u1 ), ., p • W 2 = p1(v1, ., v1 ), ., p (I) Since M = ρI(M1) then there exists i1, j1 (uniquely determined) such that ui1j1= ∗, xi1 is a variable and ti1 is a constant. Moreover, for all i, j, if i 6= i j 6= j1 then xi = ti .
(II) Since M = ρI(M2) then there exists i2, j2 (uniquely determined) such thatui2 = ∗, xi2 is a variable and ti2 is a constant. Moreover, for all i, j, if i 6= i j 6= j2 then xi = ti .
We must consider three possibilities, namely: (a) i1 = i2 and j1 < j2: in that case, both ti1 and ti2 are constants (according to points (I) and (II)). Moreover, since M = ρI(M1) and j1 < j2 thenM is obtained by instantiating a variable in position j1 in the i1-th atom.
But in this i1-th atom, a variable in position j2 > j1 is already instantiated.
Contradiction (see Definition 6).
(b) i1 < i2: in that case vi2 = ui1 = ∗ and xi1 , yi2 are constants. That is impossible, since by Definition 6, a variable xi1 cannot be instantiated if it appears in the i1-th atom and a variable xi2 in the i 2-th atom (i2 > i1) has already been instantiated before.
(c) i1 = i2 and j1 = j2: in that case we have ui1 = vi2 = ∗ and xi1 , yi2 are variables. So, M1 and M2 are equivalent (M1 can be obtained from M2 byrenaming its variable xi1 as yi2).
7For the sake of simplifying the notation, we omitted the symbol # at the end of the modes.

Source: http://www.deamo.prof.ufu.br/arquivos/IJDWM2008.pdf

Sexual offender treatment: karen harrison

Sexual Offender Treatment ISSN 1862-2941 Legal and Ethical issues when usingAntiandrogenic Pharmacotherapy with SexOffenders Karen HarrisonUniversity of the West of England - Bristol [Sexual Offender Treatment, Volume 3 (2008), Issue 2] The treatment of sex offenders and more specifically the treatment of high-risk sex offenders is asubject of great importance for practitioners, professionals, policymakers and the public at large.Whilst treatment is thought to largely centre upon cognitive-behavioural methods and otherpsychotherapy techniques, in more recent years the use of pharmacotherapy has also begun togain ground. Current debate often centres upon how effective such treatment is; with bothsupporters and opponents of its use existing. This article, however, does not specifically look atwhether pharmacotherapy as a method of treatment with sex offenders actually works, but ratherlooks at the legal and ethical issues surrounding its use. In particular it considers issues such aswhether the treatment should be voluntary or mandatory; whether it should indeed even beclassified as treatment or should instead be seen as punishment and finally whether it should beused with convicted offenders or made freely available to all.

Cover.doc

ELECTROSPINNING OF POLY(VINYL ALCOHOL)/CHITOSAN FIBERS FOR WOUND DRESSING APPLICATIONS MS.SUNISA PANBOON A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE (MATERIALS SCIENCE) DEPARTMENT OF INDUSTRIAL CHEMISTRY GRADUATE COLLEGE KING MONGKUT'S INSTITUTE OF TECHNOLOGY NORTH BANGKOK ACADEMIC YEAR 2005 ISBN 974-19-0476-2