2.1Mathematical Models in Artificial Intelligence Mathematical models are ways of expressing the understanding of the world in terms of mathematical relationships. A mathematical model is a set of rules applied to estimate unknown information using what is known. It is a mathematical version of a physical system, which is both sufficiently realistic to provide a reliable picture of that system and simple enough to work with. The model provides a context in which the system may be studied with clarity, and enables a sufficiently precise description and prediction, simply by eliminating the nonessential [30]. Mathematical models can be implemented by computer programs. In particular, in the field of Artificial Intelligence (AI), mathematical models are used to simulate or produce an intelligent behaviour. AI was defined by Minsky as: '... the science of making machines do things that would require intelligence if done by men' [62] (p.23). In many reallife situations, experts need to perform their tasks in the context of uncertain information. Uncertainty can evolve from various sources. Firstly, the task might be performed with incomplete information. Secondly, there might be uncertainty inherent in the domain. In addition, the terms involved can be vague and not always deterministic. Finally, there might be uncertainty in the observations made [82] (chapter 14) [50] (chapter 1). Whatever the reason, since the majority of realworld phenomena in which one would be interested are uncertain (or random), it is natural to introduce statistical models, in which Probability Theory serves as the main mathematical tool.
2.1.1Probability Theory When modelling simple systems it is possible to elude to the socalled IFTHEN rules, which imply that if the value of variable A is ‘a’ then the value of variable B is ‘b’. These systems are “categorical”, i.e., in which reasoning is performed using classic logical inference. However, reasoning with uncertainty introduces additional complexity to the extent that categorical “IFTHEN” definitions are insufficient to describe the system, either because not all variables are known, or because incorporating all variables in the decision rule makes it too complicated and, hence, not useful. These factors require the omission and/or approximation of information within the model, thereby introducing yet another degree of uncertainty.
Thus, in systems that involve uncertainty, the categorical approach is clearly impractical. Even if A is ‘a’, there is uncertainty whether B is ‘b’ and guessing it can be futile. Nevertheless, the likelihood of the event “the value of variable B is ‘b’ ’’ can be assessed. For this reason, experts in sciences such as physics, genetics and economics have chosen to base their models of uncertainty on probability. By contrast, the use of probability in AI was not widely accepted and was considered inadequate for practical reasons. This was mainly due to the fact that it appeared to require the enumeration of an exponential number of possibilities (the full joint probability distribution) as a knowledge base, and thus required a vast amount of computation for reasoning. As a result, a variety of alternative methods were developed and classified as NonNumerical techniques [72] such as NonMonotonic Logic [61], Default Logic [77], and Numerical techniques based on various calculi, other than probability calculus, such as Belief Functions (DempsterShafer) [86], Fuzzy Logic [112] and Certainty Factors [36].
Pearl argues that in order to model intelligent behaviour one needs to model commonsense reasoning, to reach decisions based on incomplete or uncertain information [72]. Probability Theory provides methods for elucidating the way belief should change as information is obtained. Likewise, the reasoning process that characterises human intelligence can be viewed primarily as the assessment of the assumptions under which one should hold a belief, and those under which the belief should change. Taken together, probability theory seems the natural solution for dealing with uncertain reasoning. Hence, instead of looking for alternative approaches to model uncertainty, researchers are searching for computational solutions for the practical problems involved in the usage of Probability Theory. The Mathematical Basis of Probability Theory Let be a set. Each is called an event, each is called an elementary event and itself is called the sample space. It is helpful to think of as the set of all possible outcomes of an experiment. For example, consider the experiment of tossing a single coin, in which case ={“head”, “tail”}. A repeated experiment can be represented as a cross product of sample spaces of a single experiment, e.g. if two coins are tossed then the sample space is ={(“head, head”), (“head, tail”), (“tail, head”), (“tail, tail”)}. A different experiment can be observing the type and colour of a car. Thus the sample space is the cross product of the two sets: the first is the set of all the possible car types and the second is all possible car colours.
In this thesis I will discuss only finite and discrete spaces. Thus, is always assumed to be a finite space. Events are represented by capital letters such as A, B, C. Elementary events are represented by lower case letters (i.e. a, b, c)
A probability function P is a function defined on the set of events. It assigns to every event a numerical value in [0,1]. This function must satisfy the axioms of probability [81]:
For every , . (that is, the probability of the entire space is one). For every integer n, if n events are mutually exclusive (i.e. for every ) then P() = .
From axioms A1 and A2 it follows that for every event A, We denote the complement of A in by ¬A, that is, ¬A=(Ω\A). Since (A∩¬A)= and (AU¬A)= then from axioms A1 and A3,
According to our assumptions, the probability function assigns a value to each elementary event in the sample space. Therefore, the probability of an event A = {} in , must satisfy .
If A and B are events in Ω then the joint probability of A and B, denoted by , is the probability that A and B both will occur, i.e. .
If are events in Ω such that and for every , then given the joint probability the probability of can be calculated through the law of total probability
Conditional probability is defined as the probability of an event A given some additional information or evidence B. If A and B are events, and if P(B) is known then P(AB) is the probability that A occurs given that B occurs, that is
Bayes’ Rule Using the notion of conditional probability (equation (3) above), note that P(A,B) = P(B)P(AB), which, combined with symmetry, yields P(A,B) = P(B,A) = P(A)P(BA). Hence,
This is known as Bayes’ Rule.
Events A and B are said to be independent if P(A,B) = P(A)P(B). Thus, A and B are independent if and only if P(A) = P(AB) or P(B)=P(BA).
Note that if the probability of A is known, then according to (1), so is P(¬A). Therefore, to compute P(B) it suffices to know P(BA). Indeed, according to (2):
In particular, when combining (5) with Bayes’ rule (4), it follows that:
Note that (6) enables one to calculate the conditional probability of A given B, by using the probability of B given A. This can be interpreted as the ability to use known information for calculating unknown information. For example: assume that one attempts to design a system for medical diagnoses, which includes diseases (hypotheses) and symptoms (evidence). A very simple model could be the following: let be the set of all people, Dis be the set of people who have the flu, and Symp be the set of people who have a cough. The probabilities that 1) someone has the flu 2) someone coughing, given that this person is sick with the flu 3) someone coughing, given that this person is not sick with the flu, can be assumed to be known. Indeed, this information could be found in the literature or be obtained from experts. However, when a particular patient visits the Clinic, the symptoms are observed and the presence or absence of disease has to be inferred. Thus, according to (6):
The right hand side of the equation may be computed.
In the context of artificial intelligence, equation (6) enables a system to base judgment and decisions on diagnostic reasoning (e.g. given some evidence – infer an hypothesis), predictive reasoning (given the hypotheses  what evidence should be expected) or any mixed reasoning. In general, when designing an expert system based on probability theory, Bayes’ Rule ensures that the experts need to provide the probabilities only in one direction and the system can calculate the probabilities for any direction.
Viewing probability There are two main approaches to viewing probability, both of which utilise the Probability Theory. In the “Frequentist Approach”, the probability P of an uncertain event A is defined as the frequency of that event based on previous observations. This approach is suitable as long as there is accurate information about past instances of the event. The purpose of the inference in this approach is to find the change in the probability of events once evidence is observed.
The “Bayesian Approach” is suitable for situations in which information about past instances does not exist. This approach suggests assigning a subjective degree of belief based on prior knowledge as the probability measure. In this view, Probability Theory can be defined as the study of how knowledge affects belief. Using the Bayesian (‘belief’) view of probability does not mean that statistics is completely ignored. The ‘belief’ is based on information from previous observations, when available. In this approach probabilities have an interpretation and measure the ‘degree of belief’. Therefore, the purpose of inference is to find the change of belief once additional evidence is observed.
There is more than one way to define a probability function. The Frequentist assigns the probability of an elementary event as the proportion of the occurrences of that event relative to the number of experiments. For example, consider a coin tossing experiment. The sample space includes the two elementary events “heads” and “tails”. If the experiment is conducted X number of times and the result was “head” X/2 times then the probability of the elementary event “head” is P(“head”) = ½. By contrast, the Bayesian analyst will define a probability function based on his/her degree of belief in the occurrence of elementary events. In this case, the probability of ½ may be assigned if the coin is assumed to be fair.
Random Variables Throughout this thesis I will restrict my discussion to systems that can be described by a finite set of variables defined on a finite domain and into finite domains.
Let be a set and let P be a probability function on . A random variable X is a function from to a set . The probability distribution of X is the function, given byfor every.
From here on, we denote the event by .
The probability of an event A is usually denoted by P(A), which denotes a number. The probability distribution of a random variable X is denoted by P(X), which denotes a function. To avoid confusion I have chosen to denote probability distribution by F and a probability distribution of a random variable X by , the notation used in the literature regarding probability theory [81].
Given two random variables X_{1} and X_{2}, which map respectively, the joint probability distribution (JPD) of X_{1} and X_{2} is the function , given by .
The “level sets” of a random variable, called the states or the values of the variable, form a complete and mutual exclusive partition of Ω, that is , and each two “level sets” are disjoint. Hence, the total law of probability can be formulated using the joint distribution of two random variables. Indeed, for any X_{1} and X_{2}, and every
Moreover, one can define the conditional probability of X_{1} given X_{2} as the function:
.
Hence, applying Bayes’ rule, it follows that for every . In a similar way, we define X_{1} is independent of X_{3} given X_{2} if for every .
If X_{1} is indeed independent of X_{3} given X_{2} this is denoted by IndF(X_{1},X_{3}X_{2}).
From here on I will use the following notations: as before, the capital letters A, B, C represent events. The capital letters X, Y, Z represent random variables. Bold capital letters X, Y, Z, represent sets of random variables.
The definition of conditional probability of X_{1} given X_{2} and X_{3} (see above) can be extended to where X, Y, Z are sets of random variables. In this case IndF(X,YZ) denotes the fact that . To simplify notation this can be denoted as: .
The Chain rule: Decomposition of the joint distribution. The definition of conditional distribution leads to a decomposition of the joint distribution, namely that , which is called the Chain Rule. This rule gives an alternative way of representing the JPD . Instead of specifying all the values of a specification can be in the form of two separate tables, which contain the values of and of .
The Chain Rule can be extended to the general case of n variables:
Joint Probability Distribution and Factorisation. When dealing with a statistical model, one has to obtain the knowledge base for the system to work on (the JPD of the model variables), and to calculate probabilities of events conditioned on new information [65](p.175).
The information captured by the JPD is all the probabilities of every possible combination of the values of all the variables. Once the JPD is defined, using it with the axioms, the definitions and the results mentioned above, enables answers to any probabilistic question regarding the variables and their relations in light of new knowledge.
Unfortunately, the use of the JPD raises practical problems that are size related: the information is clearly exponential in the number of variables. Even for a system that involves 30 binaryvalued variables, one needs to specify and store 2^{30} (1,073,741,824) values. It is not manageable to elicit all these numbers. Moreover, storing the information is impossible. Finally, the computational cost can be enormous [65](p.175). These problems make the JPD too large to enable a useful decision (section 2.1.1).
Fortunately, when modelling almost any system, there are independency assumptions (mainly conditional independencies) that can be made in order to simplify the model. These assumptions can be made either because they are true or close enough to be true. The next consideration would be how these assumptions can be exploited in order to save elicitation, storage space and computational cost.
The Chain Rule formula provides a modular means of calculating the full JPD by using Local Probability Models (LPMs). However, using only the Chain Rule does not simplify the computation, and the number of independent parameters to be elicited and stored stays the same. Nevertheless, if the independency assumptions are used, a simpler LPM can be obtained, saving elicitation, storage space and computational cost.
For example: if the system includes ‘n’ binary variables and the assumptions that X_{2},…X_{n} are conditionally independent given X_{1} then: and instead of specifying 2^{n}1 parameters one needs to specify only 2n – 1 parameters. (For n=30: 1,073,741,823 vs. 59). Exploiting independencies Exploiting independencies is obviously very useful. It may sometimes make the difference between a domain that can or cannot be modelled, or a solution to a problem being tractable or not. Therefore it is important to explore ways to specify as many existing independencies as possible.
A straightforward approach for exploiting all independency assumptions is to keep a list of all independencies in the form IndF(X,YZ)=true, where X, Y, and Z are disjoint subsets of variables. Since the size of this list is exponential in the number of variables, this solution will require a large storage space that is not realistic even for a small number of variables [8].
A different approach suggests keeping only one initial set of assumptions. In order to maintain consistency with the axioms of probability, the relation IndF() must be constrained. This means that once some independencyassumptions are made, other independencyassumptions must follow. These constraints are called SemiGraphoid [72](p.84) and they are: Symmetry: IndF(X,YZ) <═> IndF(Y,XZ) Decomposition: IndF(X,YUWZ) ═> IndF(X,YZ) & IndF(X,WZ) Weak union: IndF(X,YUWZ) ═> IndF(X,YZUW) Contraction: IndF(X,YZ) & IndF(X,WZUY) ═> IndF(X,ZYUW) If one has an initial set of assumptions, the probability axioms can be used to derive the new independency assumptions that must hold. Alternatively, given a new assumption, one can test whether it can be derived from the axioms. Unfortunately, the computational complexity in this case can be too high to be practical [72](p.89). Instead, the solution for defining a set of initial independency assumptions and a simple way to infer or check new assumptions can be found in the form of a graph structure in Graph Theory. 2.1.2 Graph Theory Graphical models represent a JPD by integrating Probability Theory and Graph Theory. Explicitly, they represent dependencies, independencies and conditional independency assumptions within a given set of random variables in a graphical fashion.
Definition: A graph is a pair (V,E), where V is a finite set and E is a binary relation on V [103]. Informally, a graph is comprised of a finite set of nodes (vertices) and a set of arcs (edges), each of which connects a pair of nodes. Two nodes that are connected by an arc are said to be adjacent.
Definitions (based on Graph–Theory) If two nodes connected by an arc are ordered (X, Y), then the arc has a direction assigned to it, i.e., the arc leads specifically from node X to node Y and not in the other direction. In this case the arc is directed. If all arcs are directed, the graph is called a directed graph. For a directed graph, a trail is a sequence of consecutive nodes where the direction of the arcs can interchange. If the arcs on a trail are all in the same direction, the trail is defined as a path. A cycle is a path that starts and ends at the same node. A directed acyclic graph (DAG) is a directed graph that has no cycles. In a DAG, two adjacent nodes are said to have a parent/child relationship and the arc points from the parent to the child. An extension of the parent/child relationship is the ancestor/descendant relationship. If X_{1} is the parent of X_{2} and X_{2} is the parent of X_{3} and so on up to X_{n} then X_{1}, X_{2},… X_{n1} are all ancestors of X_{n}, and X_{n}, X_{n1},…,X_{2} are descendants of X_{1} [9, 65] (chapter 10). The set of all parents of X is denoted by Parents(X), the set of all children of X is denoted Children(X) and the set of all nodes that are not descendents of X is denoted by NonDescendent(X). The set of all ancestors of X in a graph G is denoted Ancestors(X). A sub graph of G containing only the ancestors of a node X and the arcs between them is denoted G_{Anc}(X). This definition can be extended to a set of nodes X={X_{1},…,X_{n}} in the following way: G_{Anc}({X}) = G_{Anc}(X_{1})U…U G_{Anc}(X_{n}) and is denoted G_{Anc}(X). Note that the union of graphs is the graph containing all the nodes and arcs of the individual graphs. A triplet nodeset X, Y, Z is called a headtohead structure (or a Vstructure) if there is an arc from X to Z and from Y to Z and there is no arc between X and Y (Figure 2 .2(a) is a headtohead structure while Figure 2 .2 (b) is not). In this case, Z is considered to have a headtohead structure. Figure 2.2 Two directed graphs. In both structures there are two incoming arcs to Z. However, only (a) is a headtohead structure. 2.1.3Graphical Models If a system is described by a set of variables U and their interaction, then in a graphical representation of this system the nodes correspond with the variables U and an arc between two nodes represents immediate possible interaction between the variables that are represented by these nodes. If the interaction between the variables in U is given by a JPD F on U, then a graph G represents F if the topology of G reflects only independency assumptions that are in F. The graph is a Qualitative representation of a JPD.
Herein, when a graph represents a set of variables, I make no distinction between the variables and the nodes, and the same notations are used (capital letter X, Y, Z for nodes and bold capital letters X, Y, Z for sets of nodes).
Advantages of a graphical representation of a JPD As mentioned above, independencies are important properties of a distribution, since they can be exploited for reducing the computational cost of inference and are important in understanding the behaviour of a system. Therefore, an explicit representation of independencies by a graph structure is useful from both perspectives. From a computational perspective, it provides a data structure that can be used by efficient algorithms. From a human perspective, it provides a clear interface to describe complicated systems. Hence, the graph should represent independencies that are in the JPD so that they can be exploited, but should not represent independencies that are not there. However, if for constructing the graph all independency and conditional independency assumptions are to be considered, one would be facing the problem of handling all relations IndF(X,YZ) where X, Y, and Z are disjoint subsets of the variables, as mentioned above. To avoid this, two definitions are needed: A separation criterion by which all independencies are read from the graph topology, i.e. given sets of nodes X, Y, Z in graph G, are X and Y separated by Z. For graph G to represent a JPD F, whenever X is separated from Y by Z in G, then X is independent of Y given Z in the JPD F. A set of initial assumptions that the graph structure must hold in order to represent a given JPD.
The graph can be constructed according to definition 2 and all (conditional) independency assumptions can be read using definition 1.
The separation criterion used to read independencies from the topology of the graph and the set of initial assumptions depend on the kind of graph being used.
Types of graphical models There are two main graphical models: those that are represented by undirected graphs and are called Markov Networks, and those that are represented by directed graphs and are called Bayesian Networks (BNs) [72]. Both types of models consist of two parts, a qualitative part that is a graph, and a quantitative part. The qualitative part (the graph) depicts the direct relationship between the variables. The quantitative part (the parameters) encodes the strength of the relationships. If the graph contains both undirected and directed arcs it is called a Chain Graph [27]).
Since the main focus of this thesis is BNs, herein I shall focus on this type of graphical models.
