## 2.3Knowledge Engineering for BN construction In this section I summarise the literature regarding KE, a process the proposed approach in this thesis is designed to support. I describe two aspects of the KE process, namely knowledge elicitation from experts and learning the structure from data.
The construction of a BN falls under the more general discipline of KE of Mathematical Modelling. KE of a BN is the actual process of designing and structuring the BN model.
Both parts of a BN model, i.e. the qualitative (structure) and quantitative (parameters), can be learned through elicitation of knowledge from experts, or by an automatic algorithm, or through the combination of the two. In learnt networks, the final network is selected using statistical measures in a process called **model selection**. In elicited networks the final network is selected using a process called **model assessment**, where the experts need to decide whether the model is satisfactory.
KE through elicitation of knowledge from experts is a process of construction of a model that captures the essentials of the experts’ perceptions of the world. It is not a process of extraction of a pre-existing structure [44, 58]. Helsper and van der Gaag [44] propose to capture the knowledge of a domain into an ontology, in which the structural assumptions of the domain are explicitly represented in some way.
Whether the structural assumptions are represented explicitly by an ontology or not, the process of designing a BN through elicitation of knowledge from experts involves the following [18, 26, 44, 68]:
Specifying the qualitative part, i.e. a set of graphical modelling decisions based on structural assumptions consisting of: Identification of the variables that are of importance and their possible values. Identification of the relationships between the variables discerned and their expression in a graphical structure. Specifying the quantitative part, i.e. a set of parameters for each node. Assessment of the network.
In principle, the tasks of identifying the structure, identifying the probabilities and assessing the network can be performed successively. However, in its initial steps, the purpose of the process is to learn and to plan. To this end, the process of KE iterates over these tasks, where in early cycles a very simple prototype of the model is developed. The iterations within the process of KE gives the DEs time to gain knowledge about BN technology and what it enables, and provides the BN-expert with time to gain understanding of the domain. In addition, this iteration is required because there is a great deal of interplay between the different tasks. Every decision regarding one task can affect the other tasks. Only when the experts are satisfied with the model as an operational version will future cycles become new versions [26, 58].
### 2.3.1Graphical elicitation
**Structural assumptions and graphical modelling** The first step in building any mathematical model of a system is to model the domain, that is: To identify the concepts in the system to be modelled. These concepts are the elements in the system that the expert wants to measure, control, manipulate or reason about. To identify what the relationships between the different parts are. These are called the structural assumptions of the domain because they are used to form the structure of the model, and can explicitly be represented by ontology [44].
The next step is to translate these structural assumptions into the specific syntax of the mathematical model. In BNs syntax, one has to translate these assumptions into Graphical modelling decisions according to the structural assumptions of the domain. These include: choosing the relevant concepts, translating them into variables, and deciding on their possible values and the relationships between them. Occasionally, these two notions can be confused with each other (section 5.2.3).
**The variables** The purpose of a model is to reason about events that have not been observed. These events are organized into a set of *query variables* (or *hypothesis variables*). The expert can provide the model with information that is organized into a set of *observable variables.* The inclusion of *mediating variables* can be of convenience, reducing the number of probabilities, avoiding over-confidence in diagnosis [50, 69, 111] or imposing some relationships between two variables (e.g. mutual exclusiveness). It is important to note that this categorical separation between the different types of variables is logical and has no effect on the inference process in the sense that once the network is constructed, any node can be used as query, observation or mediating regardless of its original classification. The next step is to decide on the variables’ states. Every variable is a collection of an exhaustive set of mutually exclusive events. Therefore, for each variable, the right partition should be chosen to include all events of interest.
**The structure** After the variables are identified, the expert needs to characterise the relationships between them. As described above, the procedure for constructing the structure using the definition of Markov condition depends on the order of the variables and, therefore, is not good for practical purposes. In general, for the purpose of eliciting the graphical modelling, the formal notions of directional conditional independency can be replaced by the informal notions of causation and influence [72].
**The cause-and-effect approach in BNs** The ‘cause-and-effect’ approach is the most widely used in the construction of the network. This approach is very helpful because humans are good at making judgments concerning immediate cause-and-effect relationship among variables, and therefore the task of determining the structure by identifying the direct causes is natural [38, 50, 65, 72]. This approach also enables the application of local thinking in the construction process. The expert determines what are the direct causes of each node and even though the decisions are made locally their sum is complete and consistent. In addition, causal relationships often assert conditional independencies and therefore tend to produce a sparse structure [72]. Thus, the most important principle in viewing BNs structure is that one can regard an arc from A to B as indicating that A ‘causes’ B.
### 2.3.2Graph assessment Elicited structures are assessed by how well their topology represents the independency assumptions of the domain, namely, how well the representation meets the experts’ expectations. It is well known that in order to check independency assumptions the d-separation criterion (section 2.2.1) can be used [44, 57, 72](p. 117). However, there are problems in the actual implementation of existing methods, as shown below:
A list of all d-separation situations can be produced by the BN expert for reviewing by the DEs.
__Problem:__ the size of such a list is exponential in the number of nodes.
Pearl [72][p. 118] suggests that the procedure of testing d-separation can be handled by visual inspection of the graph.
__Problem:__ while this may be the case for BN experts when the network at hand is small, it is far from being obvious to someone who is not a BN expert [58, 96].
There are oracles that implement d-separation. An example of such implementation is Wimberly’s applet [108]. This oracle expects a DAG D, a set of nodes **X**, a set of nodes **Y**, and a set of nodes **Z**, it then gives the answer yes or no for the question “is **d-sepD***(***X***,***Y***|***Z**)”. The answer to the question does not have to be inspected by the user as the oracle provides it.
__Problem:__ telling the user that two nodes are d-separated gives some information that is, again, meaningless unless the user is a BN expert.
A problem that is common to all these methods is that the DE may not know how to assess the structure, where to start and what should be examined.
Another approach that is reported in the literature is to assess dependency assumptions by presenting the user with **case descriptions** to help them access the relevant knowledge [96]. A case description describes a situation from the domain and end with a question regarding the flow of information in the described situation. These case descriptions comprise sentences that describe the variables of interest and the relationships between them as represented by the graph structure. They present a question for the DE to affirm or deny, all in ‘a story style’ and in the domain terminology. This approach can obviously be used for assessing independency assumptions. Using this method the BN expert chooses the relationships to be investigated.
__Problem:__ this approach is very time consuming [44], where every situation needs a tailored story by the BN expert in order to be assessed and therefore any change needs a new tailored story.
Thus, although it is widely accepted that it is important to review and assess the independency assumptions that are represented by the topology of the graph, possibly using the d-separation criterion, the question of how to (efficiently) assess the graph topology is still open.
### 2.3.3Parameters elicitation Parameters in BNs are the probabilities of the states of a node given its parents. There are various excellent sources of probabilities but each has some disadvantages: Data Collection: The data may contain biases evolving from the strategies used for the collection of the data. The data do not always match the variables and values that are of interest for the model, and in many situations contain missing values. Literature: Almost never covers everything. The conditional probabilities may be given in the opposite direction of what is required by the model. Some probabilities are not found at all. In addition, a discrepancy between the characteristics of the population from which the information was collected and characteristics of the population for which the model is developed can make these probabilities completely irrelevant. Human Expert-Knowledge: Bias and calibration problems with eliciting probabilities from experts have been described [23, 52].
To overcome these problems a large number of methods for supporting the assessment of probabilities have been developed [79, 107]. These methods can essentially be categorised into: (i) a direct approach in which the experts are asked a question like “What is your belief regarding the probability that an event A will occur?” This approach can be supported by visual and verbal methods such as probability scale and probability wheel. (ii) an indirect approach in which the two main methods are the ‘betting’ and the ‘lottery’ approaches [78, 101]. This approach is not always intuitive and easy for the experts to use [98]. To partially overcome the limitation of these approaches, other methods have been developed: (i) experts can provide both qualitative and quantitative information [25, 95] and (ii) the experts are required to make pair-wise comparisons between events [83].
In general, the main problem with knowledge elicitation from experts is that it is time-consuming and, therefore, considered the most daunting task in the construction of a BN [26]. To reduce this burden, the expert can apply Local Model Structure assumptions for a variable and its parents. The **Local Model Structure** is an additional structure in the domain, which is reflected by the parameters. In this case, only some (or one) probabilities are assessed directly and a rule for the computation of the rest of the probabilities is provided instead of the direct assessment of all probabilities. Such rules are the noisy-OR noisy-AND and their generalisations [40, 72, 88] or symmetric independence [37]. These rules make use of further causal-assumptions in the domain. These rules are additional structural assumptions of the domain [58]. However, since they are reflected only through the numbers they are beyond the scope of this thesis. In this thesis I shall refer to structural assumptions as only those that are reflected by the graph. Another way to reduce this burden is to apply changes to the graph structure such as divorcing parents [70] and omitting weak links [54].
### 2.3.4Parameters assessment In general, BNs are sensitive to the relative size of the numbers in each of the CPTs, i.e., the relative sizes of P(X|Y) and P(X|¬Y) are of significance, especially if one of the numbers is in an order magnitude larger than the other. There is evidence for the inaccuracies in the parameters in both BNs that are insensitive [75], as well as sensitive [14]. Therefore, how inaccuracies in the numbers affect the model-output probably depends on the domain and the model. There are several methods that are collectively called **Sensitivity-Analyses **and **analyse model behaviour** to study the effects of inaccuracies on the model output:
In **Sensitivity Analysis** one of the parameters of the network, or more, is varied systematically while all other parameters are kept fixed and the effect on other parameter(s) of interest is examined [17]. Efficient computational methods for applying sensitivity analysis to BNs have been developed [17]. These techniques can be used to identify the parameters that require higher level of accuracy.
In **Constraints enforcement** **analysis** the analysis can identify parameter changes needed to enforce a constraint with regard to the difference or the ratio between two nodes of interest [14].
Other forms of sensitivity analysis examine the behaviour of the model with respect to observation. This is called a **What-If analysis** in which states of some variables are varied systematically and nodes of interest are examined [22] and **Conflict analysis**, which measures conflicting evidence [51]. In **Importance analysis** (e.g. Netica [66]), the most influential nodes with respect to a chosen node of interest are selected.
### 2.3.5Model Assessment To date, there is no established methodology for the KE process of BNs as a whole nor is there a methodology for the evaluation process in particular [44, 68]. Nevertheless, there are some case reports concerning the KE and the evaluation process. The purpose of the assessment is to examine how adequate are the models to the world one is trying to capture. Therefore, elicited models are assessed by how well their predictions on particular case-scenarios meet the experts’ expectations.
One approach is the __Case-Based__ assessment, in which the answer of the network (i.e. the posterior probabilities) is tested in individual case scenarios [58, 68]. The line of reasoning of the network can also be tested, i.e. is the correct explanation provided for an ‘entered’ case? This line of reasoning is determined by the Net-posteriors and the flow of information on the Net-structure; that is, through what trails the observed information changed the posteriors of any hypotheses-nodes. This can be done by using explanation methods for BNs (see section 2.5) [26].
Another approach is the __General-View__ assessment, in which the network is tested on massive data (a large number of cases), the results obtained from the network (the posterior probabilities) are compared with the experts’ expectations for each case and the output of this assessment is represented as a table that summarises the differences [68, 97].
Both approaches are intended to draw attention to indications of incorrect topology or probabilities in the model. ### 2.3.6Problems arising during the KE process The argument that the causative approach to the construction of the graph is very intuitive and therefore easy to apply is far too simplistic [58]. While causal structures tend to work well and the graphs tend to be sparse, problems may still exist.
Some relationships are not causal. The probabilistic dependencies and conditional- independencies are fundamental considerations in the construction of the network, rather than the perceived causal relationships. Moreover, when constructing a BN, the expert needs to acknowledge all statistical correlations because the graph should satisfy the Markov condition. However, not all correlations between variables can be given causal interpretation [44] and some may have an overlapping meaning [106]. These relationships may be inadvertently ignored. The causal relationship of the domain is not always known and therefore the causal mechanism of the domain may not always be clear. The implications of the BN structure are not always clear to the DE. In other words, the concept of a graph as a representation of dependencies-independencies relationships is far from being easy to explain or understand [57, 58].
After overcoming these problems and constructing the graph intuitively using the causal approach, the issue of ‘trades off’ needs to be dealt with. The causal relationship might be hampered due to this process of ‘trade-offs’ and the intuitive approach may not be sufficient anymore.
**Trade-offs** A desired mathematical model should have three main characteristics: simplicity, accuracy and explanatory power [13].
The model should follow the Occam’s Razor principle, which says: “one should not increase, beyond what is necessary, the number of entities required to explain anything”. A model should be as simple as possible, as long as it works. The model must have a good diagnostic and predictive power. It must agree with the majority, not necessarily with all observations. The model should have explanatory power. Producing the right output is not enough. The model should also be able to explain the output by giving a reasonable answer to the question “Why?” In other words, the model should have a reasonable line of reasoning.
If a balance is achieved between the different characteristics the model is a** **so-called** Good Model. **In reality, these requirements lead to inherent trade-offs in a model construction. In addition, practical considerations need to be addressed. The model should be: Practically tractable (i.e. gives an answer in a reasonable time). A model with less accuracy, but one that can give a prediction in a reasonable time, can be much more useful. Practically inexpensive (i.e. keeping construction and maintenance efforts as low as possible). Once all variables, their partition and the relationships between them have been defined, the numerical parameters for the model need to be assessed. If there are thousands of parameters, each parameter needs to be assessed individually and this model may be undoable for practical reasons.
These considerations can impose changes on the network structure, making it subjective to considerations other than intuitive causal relationships. This is a **Practical Model. **Nevertheless, no matter what interpretation is given to the network, its structure always represents the dependencies-independencies relationships in the domain, which should be correct (subject to the trade-offs approximation mentioned).
**Implementation of Trade-offs within the process of designing a BN**
From the above it is clear that trade-offs in the construction of a BN can have a great impact on decisions regarding the network structure and *vice versa. *
**A simple model** The guideline for the simplest model is that whatever criterion is chosen, the simplest model among all working models should be preferred. In BNs a simpler model is usually the one that minimises probability elicitation. It can be a model with a simpler structure, one that contains fewer nodes, states for the nodes and arcs. On the other hand, a simpler structure can contain more nodes, more states and fewer arcs, thereby increasing conditional independencies and reducing the number of probabilities. It can also be a model that contains more structural assumptions in the local model structures.
**An accurate model** Although part of the accuracy is determined by the probabilities and the use of sensitivity analysis can help to increase accuracy through the numbers, the accuracy of the model is also determined by the structure. The numbers cannot overcome the conditional independencies that are encoded in the graph structure and therefore need to be represented correctly or the accuracy of the network will decrease. Also, if the structure includes too many details, its accuracy can be decreased as information can be spread over too many variables and states.
**A model with explanatory power** The explanatory power is essential in some domains, particularly when using the model as knowledge base for tutoring purposes and in decision-support systems (e.g. medicine) [92]. The probabilities quantify the influence of one variable on another but it is the structure that encodes whether one variable influences another under any circumstances of observed knowledge. Therefore, it is the structure that plays the pivotal role in the explanatory power of the model.
**A feasible model** As part of its structure, the model needs to contain a reasonable number of variables. The state space cannot be too big and the connectivity of the network cannot be too high. In order to allow the model to be constructed and maintained practically, the expert needs to minimize the number of probabilities to be elicited by applying local structural assumptions [58] or by applying changes to the network structure, i.e. reducing the number of nodes or states [102], increasing the number of independency-assumptions [54], [42], or adding nodes, states and arcs to reduce the number of parents of nodes [70].
When applying changes to the structure of the model in order to get the desired characteristics, examining the network structure for errors can be useful. Although this process can be done by testing the network as a whole, it is obviously practical to test it before eliciting the numbers. Since the structure is no longer subject to causal relationships only (if at all) the testing should explore the relationships that are encoded by the structure, i.e. the dependencies independencies relationships in the domain. This is the problem this thesis addresses.
The process of constructing a BN through knowledge elicitation from experts is time consuming, involves experts and is expensive. Therefore, a lot of work has been done on automated algorithms for learning BNs from data. In the next section I shall summarise some of the main methods. ### 2.3.7Automated Construction of a Network When comprehensive data are available, both parts of the model can be constructed automatically through learning algorithms, yielding a learnt network, which can be helpful in gaining understanding of the domain [38]. However, understanding the structure of the learnt network and what it represents is not simple and obvious. An investigation of the structure can be important and helpful in examining whether the learnt structure represents the experts’ knowledge of the domain and how, as well testing the limitations of the structure, what it represents and what it does not. In the following section I will discuss the background of the automated construction of the network.
**Parameters** Once the structure of the model is known, learning the appropriate probabilities from the data is based on studying subsets of the data that satisfy various conditions [38] and utilize standard procedures in statistics [10].
**Structure** There are essentially two main approaches for learning the structure. **Identification Methods **are based on independencies tests. The algorithm studies the dependencies and independencies among the domain variables and then constructs a graph that represents these relationships as close as possible [73] [87]. **Scoring and Search Methods **are based on a scoring metric, a function that combines the model, the data and possibly prior knowledge to yield a number that measures the agreement between the model, the data and the prior knowledge. These methods typically use heuristic **Search Methods** to explore the space of possible models in order to find models to score. These methods vary in the way they score the model and the search methods they use.
Many of the scoring methods are based on calculating the probability of the structure given the data. The structure with the highest probability is selected. A different approach applies the Occam’s Razor principle in conjunction with the highest probability [60]. Other approaches approximate this probability. These methods find a local maximum probability as a function of the physical probabilities associated with the structure. Such a maximum will always choose the most complex structure and therefore the score includes a penalty based on the complexity of the Net. Some examples for such methods are the A information Criterion (AIC) [3], The Bayesian Information Criterion (BIC) [38] and the Minimum Description Length (MDL) [80] which is based on the Minimum Message Length (MML) approach [100].
These methods require the assignment of an initial probability to all possible structures. In general, the user constructs a prior network structure for the domain, which serves as the best structure to represent the domain. The prior probability of a structure is then computed as a function of the number of arcs that are not the same in the two structures.
Structures that statistically score high represent the domain adequately but may not necessarily have adequate explanatory power and do not always represent the experts’ knowledge of the domain. These discrepancies may evolve from the following reasons:
1) Discrepancies that evolve from the algorithms
**The algorithms search for statistical patterns whereas model designers are searching for logical patterns**. The learning-algorithms are searching for patterns (i.e. statistical relationships) in the data. However, designers of models wish to capture logical patterns in a model thus focussing only on significant and meaningful patterns. The statistical model and the logical model do not always match [31]. For example, in the user-modelling domain, in addition to the statistical relations the designer is interested in patterns of user-behaviour. Most algorithms use heuristic search methods for structure-selection and thus contain randomness. A structure scores ‘high’ because statistically it gives ‘good’ results. Among all structures that score high some may represent the domain better than others but the selection is random. The algorithm is not capable of determining whether the selected structures are logically ‘good’ (have explanatory power).
2) Discrepancies that evolve from the data
**The data do not always represent what model designers are trying to model**. The existing data are usually collected for various purposes using different collection strategies and therefore may contain undesirable biases [26]. For the purpose of learning-algorithms, the data are usually split randomly into a training-set and a testing-set. In this process important information may be lost from the training-set and the data may not represent the domain adequately anymore.
To partially overcome these problems, the most important feature of these methods is that they allow **additional prior knowledge **to be taken into consideration. This knowledge can be incorporated into the prior structure, given by assigning different penalties to various nodes for different combinations of parents [11] or by asserting that some arcs in the prior structure must be presented and the priors of structures that do not satisfy the constraints are set to zero. The ability to use prior information is one of the strengths of these methods as it enables the mixture of subjective experts’ knowledge with objective statistical measures on which the resultant structure is based. |