Mixing ICI and CSI Models for More Efficient Probabilistic Inference

Mixing ICI and CSI Models for More Efficient Probabilistic Inference Michael Roher Advisor: University of Guelph, 2020 Dr. Yang Xiang Bayesian Networks (BNs) concisely represent probabilistic knowledge of uncertain environments by encoding causal dependencies and exploiting conditional independencies between variables. The strength of each variable’s dependence on its parents is quantified by a conditional probability table (CPT). However, these CPTs suffer from an exponential growth on the number of parents. To address the exponential growth, various local models have been introduced for representational savings and further inference efficiency. Some exploit context-specific independence (CSI), which concisely encode duplicated probabilities. Others exploit independence of causal influence (ICI), which encode causal relationships between variables. Existing techniques apply only ICI or only CSI in a BN, such that exploiting one model sacrifice savings yielded by the other. We develop an exact inference framework for BNs modelled with both: We apply Non-Impeding Noisy-AND Trees for ICI, and CPT-trees for CSI. The experimental evaluation demonstrates a significant inference efficiency gain beyond what is attainable by exploiting only one type of model.


Introduction 1.1 Overview
Uncertainty is ubiquitous in the real-world. Whether a doctor is diagnosing a patient, a robot vacuum cleaner is sweeping a room, or a gambler is playing a card game in a casino, we are likely making decisions without complete knowledge of an uncertain environment. In the context of artificial intelligence, a common method of representing uncertain knowledge is Bayesian probability theory. Bayesian probability theory models the subjective belief of an agent by a probability of an event, given the knowledge that another event has occurred.
Consider the example of a doctor diagnosing whether or not a patient has a certain disease. The doctor will assess multiple factors, some of which are observed, while others are unknown. The observed factors may include the patient's medical history, blood tests, and any pre-existing conditions. But, these tests are not perfect, and there may be false positives, or symptoms that are imperceivable through these tests. Thus, the doctor may represent the patient's likelihood of suffering from the disease as a probability, conditioned on the medical history, test results and preexisting conditions. One method of representing the agent's subjective knowledge of the environment is through a joint probability distribution (JPD). Given a set of variables in an environment, the JPD consists of a table specifying every instantiation of all variables in the set. Each instantiation has a corresponding probability specifying the likelihood of the instantiation occurring. In small environments, a JPD may be sufficient for inference. But, as we increase the number of variables, the size of the JPD increases exponentially, quickly leading to models requiring billions of instantiations.
The exponential growth of a JPD is not simply a representational issue. It leads to computational inefficiencies during inference and can ultimately lead to intractability for large environments. One method to reduce the exponential growth is by applying a model that exploits conditional independencies between variables. For example, the likelihood of a patient suffering from a disease may be dependent on a certain genotype, which may itself be affected by the parents' genotype. But, once the patient's genotype is identified, the parents' genotype is no longer relevant.
A JPD does not take advantage of the independencies between variables. This was addressed by Pearl and others who introduced Bayesian networks (BNs) [12] to model the structure in the environment. A BN consists of two components: a directed acyclic graph to encode the dependencies and exploit the conditional independencies between variables, and a conditional probability table (CPT) for each node in the graph to quantify the strength of the dependency of the given node on its parent variables. This reduces the exponential explosion on the number of variables from exponential in JPDs to linear in BNs.
However, the CPTs are still exponential over the number of dependencies a node has in the graph. Previous research has noted that a CPT can be expressed more efficiently by replacing it with an alternative structure. Many of these alternative structures can be grouped into two classes: Independence of Causal Influence (ICI) Encodes the relationship between a variable and its dependencies more efficiently. For example, consider a patient recovering from a headache by taking medicine and increasing water intake. An efficient ICI model would represent the patient's recovery by each treatment individually. That is, it would require the probability of a patient recovering from the headache by only taking medicine, and the probability of a patient recovering from the headache by only increasing their water intake. Operations on these events allow for the probabilities of all other combinations of treatments (i.e., both treatments, or neither treatment) to be computed. These operations reduce the number of instantiations required to specify this model from exponential to linear on the number of treatments. Models in this class include Noisy-OR [12], Noisy-MAX [8], DeMorgan [10] and Non-Impeding Noisy-AND Trees (NAT or NIN-AND) [21]. Further details of ICI models are available in Section 2.7.
Context-Specific Independence (CSI) Encodes duplicate values in probability tables more efficiently. For example, consider a patient's recovery from surgery that is dependent on whether or not they receive physiotherapy, and the skill of the physiotherapist. If the patient completes physiotherapy, then the likelihood of recovery increases. The magnitude of the increase is dependent on the skill of the physiotherapist. A highly skilled physiotherapist will result in a significant increase to the likelihood of recovery, whereas a less skilled physiotherapist will result in a small increase to the likelihood of recovery. On the other hand, if the patient declines physiotherapy, then the likelihood of recovery decreases.
In the most naive form, this would require four instantiations. However, it is observed that when the patient declines physiotherapy, the likelihood of recovery decreases, regardless of the skill of the physiotherapist. A CSI model would encode this model efficiently by exploiting the fact that the variable encoding the physiotherapist's skill is redundant when the patient declines physiotherapy.
This would reduce the number of instantiations to three. Models in the CSI class include default tables [3], CPT-trees [3], rule-based CSI [14], and algebraic decision diagrams [4]. Further details of CSI models are available in Section 2.8.
Unfortunately, inference methods for Bayesian networks modelled with CPTs are not directly compatible when alternative structures are used. Most naively, the alternative structures may be expanded into exponentially sized CPTs in order to execute these inference methods. This is undesirable as it discards all savings the alternative structure provides. Instead, significant research has been directed towards identifying techniques that prepare a BN modelled with alternative structures for efficient inference. The specific techniques vary depending on the alternative structure.
For instance, some models may have special conversion methods that expand the alternative structure to a probability table while maintaining computational savings.
To our knowledge, all previous research has focused on replacing each probability table in a BN with one class of alternative structures. This restricts the BN to the same class for all variables. Since these alternative structures apply on a per-variable basis, it is plausible they can coexist in the same environment. Restricting ourselves to one class relinquishes the opportunity to exploit the other class in the same BN.
Moreover, if the variable is not well-suited for the alternative structure, then it must be approximated by the alternative structure, or represented by a probability  The purpose of this thesis is to develop an inference framework for BNs modelled with both ICI and CSI alternative structures. When both exist in a Bayesian network, we apply NAT models for ICI and CPT-tree models for CSI. Each alternative structure is then compiled into a probability table through special conversion methods that preserve computational savings. The result is an efficient BN where each variable is quantified by a tabular representation. This efficient BN is then compatible with any typical BN inference method.

Contributions
In this thesis, we make four main contributions: 1. We propose a framework to exploit both NAT and CSI local models in probabilistic inference. We evaluate the efficiency of this framework on Lazy Propagation, an inference technique shown to attain a two orders of magnitude speedup on very sparse Bayesian networks [11].
2. We empirically demonstrate that one class of alternative structures cannot be efficiently and exactly encoded by the other class; thereby, validating the necessity of this research. 3. We generalize and formalize the CPT-tree network transformation algorithm by specifying a comprehensive algorithm suite. This advances the initial idea presented by Boutillier et al. [3]. 4. We establish the existence of CSI in real-world BNs. These results, in conjunction with previous research showing the existence of ICI in real-world BNs [22], demonstrate the coexistence of both ICI and CSI in the real-world.

Thesis Layout
The remainder of this thesis is laid out as follows. Chapter 2 is a summary of the background knowledge underlying this thesis. Chapter 3 empirically demonstrates that NAT models and CSI models cannot efficiently and exactly encode each other.

Background
In this chapter, we present an overview of the background material underlying this thesis. The chapter is organized as follows: Sections 2.1 to 2.3 review graph theory fundamentals, two common interpretations of probability theory, and potentials, respectively. Section 2.4 reviews joint probability distributions and an accompanying inference method. Section 2.5 reviews Bayesian networks and two accompanying inference methods. Section 2.6 reviews accuracy metrics to evaluate the closeness of an approximation from its ground truth. Finally, we review independence of causal influence in Section 2.7 and context-specific independence Section 2.8, which each may be exploited for further representational and inference savings.

Graph Theory Fundamentals
A graph, G = (N, E) is a mathematical structure to represent a non-empty set of nodes N connected by a set of edges E. Edges can be undirected or directed. An undirected edge {n i , n j } where n i , n j ∈ N is a symmetric connection between the two endpoints. A directed edge, (n i , n j ) where n i , n j ∈ N is an asymmetric connection between the two endpoints. We note the notation is intentionally different: braces represent an unordered pair, while parenthesis represent an ordered pair. If an edge (n i , n j ) is directed from n i to n j , the node n i is said to be a parent (or source) of n j and n j is said to be a child (or target) of n i . Two nodes are adjacent if they are connected by an edge. A graph with all directed edges is a directed graph, a graph with all undirected edges is an undirected graph, otherwise the graph is a hybrid graph. A walk in a directed graph G = (N, E), is a sequence of nodes (n 0 , n 1 , n 2 , . . . , n k ) with a corresponding sequence of edges ((n 0 , n 1 ), (n 1 , n 2 ), . . . , (n k−1 , n k )) such that each node n i in the sequence is in N and each edge (n i , n i+1 ) is in the sequence is in E. We denote n 0 and n k as the start and end of the walk, respectively. A path is a walk where each node in the node sequence occurs only once. A cycle is a special type of path that starts and ends at the same node (i.e. n 0 = n k ). A graph is said to be cyclic if it contains at least one cycle, and acyclic otherwise. A graph is connected if there is a path between every pair of nodes. A tree is an acyclic connected graph that has exactly one path between every pair of nodes.
A clique of an undirected graph G is a fully-connected subgraph within G. For example, the set of nodes {t, v, y, z, w} in Figure 2.1 (a) is a clique of size 5.

Interpretations of Probability
In partially observable and uncertain environments, an agent will not have complete knowledge of the environment. A common approach to operate under uncertainty is through probability theory. Two interpretations of probability include frequentist and Bayesian.
Frequentist probability interprets probability as the frequency of an event occurring. It is objective and the frequentist probability of an event a converges as the total number of trials approaches infinity: P (α) = lim n→∞ t n where t represents the number of trials where the event α occurs, and n represents the total number of trials. This is commonly approximated by P (α) ≈ t n . However, when an event can only occur once or it is impractical to repeat the event in the real-world, the probability of the event is undefined. For example, the probability of the next global pandemic occurring cannot be observed by repeating experiments.
Bayesian probability interprets probability as a subjective value indicating one's degree of belief in the event occurring. The Bayesian interpretation is capable of representing unrepeatable events. For example, the probability of the next global pandemic occurring may be specified by an infectious disease expert's degree of belief.
Throughout this thesis, we will assume the Bayesian interpretation.

Potentials
In this section, we introduce potentials, which are used in subsequent methods. A potential φ(M ) over a set of variables M is a function, which maps from a set of M 's values val(M ) to a set of non-negative real numbers. Potentials do not necessarily comply with the laws of probability -they may hold values outside of [0, 1]. Every CPT is a potential but not every potential is a CPT.
If M 1 and M 2 are sets of variables with the corresponding potentials φ 1 (M 1 ) and φ 2 (M 2 ), then the product of potentials is specified by: Potentials also support marginalization to sum-out a variable.

Joint Probability Distributions
To represent uncertain knowledge in an environment, one may choose to use a joint probability distribution (JPD). A JPD encodes a probability for each instantiation of a set of variables M . An instantiation of M represents an assignment of value(s) to each variable in M. Figure 2.2 presents a JPD for 7 variables u, v, t, w, x, y, z. Each variable is binary with each value of the form: the variable's letter followed by an index (e.g. {t 0 , t 1 }). The probability for a given instantiation is obtained by locating the instantiation in the JPD. For instance, if t, u, v, w, x, y, z hold t 0 , u 0 , v 0 , w 0 , x 0 , y 0 , z 0 , respectively, then we can obtain the probability of this event as P (t = t 0 , u = u 0 , v = v 0 , w = w 0 , x = x 0 , y = y 0 , z = z 0 ) = 0.0020. However, a critical limitation of JPDs is that the number of probabilities in a JPD is exponential on the number of variables in the environment. The example in

Inference By JPD
A common operation on uncertain knowledge representations is inference. Inference is the process of determining the probability of an event, given a set of observations. To demonstrate inference by JPD, suppose we would like to know the posterior distribution P (z|x = x 0 ) from the JPD in Figure 2.2. The method begins by first updating the JPD to P (t, u, v, w, y, z|x = x 0 ) by the following product rule [20]: The distribution in the numerator P (t, u, v, w, y, z, x = x 0 ) is computed by setting the probability of all instantiations inconsistent with the observation x = x 0 to 0.
In other words, given an instantiation (t , u , v , w , y , z , x ), if x = 0, its corresponding probability is set to 0. The denominator is subsequently computed as the sum of the remaining terms. We then divide each remaining term by the sum resulting in the distribution P (t, u, v, w, y, z|x = x 0 ).
We then apply the marginalization operation to sum-out all non-query variables.
In this example, the exponential nature of JPD requires maintaining a potential of 7 variables or 2 7 = 128 instantiations. For further details on inference by JPD, refer to [20].

Representation
To improve on the exponential explosion of JPDs, Pearl and others [12] proposed Bayesian networks (BN), a more efficient solution to representing uncertain knowledge. Before introducing Bayesian networks [12,20], it is first necessary to define conditional independence.
Conditional Independence Let A, B and Z be disjoint subsets of variables. A and B are conditionally independent given Z, denoted I(A, Z, B), iff Conditional independence can be exploited by observing that it is unlikely all variables will affect all other variables. Instead, it is likely only a subset of the variables in the environment will directly affect a variable. This is the key idea that allows for Bayesian networks to factor JPDs into a more efficient representation.
Formally, a Bayesian network (M, G, P) is a triplet specified in terms of the following: • M is a set of variables.
• G is a directed acyclic graph whose nodes correspond one-to-one to members of M . Each variable in the graph is conditionally independent of its nondescendants given its parents.
• P is a set of conditional probability distributions (CPDs) for each variable m i ∈ M , specifying the distribution for m i over its parents π(m i ): P (m i |π(m i )).
A set of CPDs is said to be a conditional probability table.
Intuitively, the Bayesian network can be viewed as a a graph specifying the dependence structure of the variables, with the conditional distributions specifying the strength of each dependence. An example BN is presented in Figure 2.3. Since a BN is a factored JPD, the JPD for a BN can be retrieved by combining the CPDs according to the chain rule.
The BN uses a DAG in conjunction with multiple, smaller CPTs to efficiently represent JPDs. Each CPT has a size exponential on the family size: represents the largest domain size of the family and f represents the family size.

Inference by Variable Elimination
In Section 2.4.1, we demonstrate that a JPD is capable of performing inference by performing a potential's marginalization and product operations on the joint distribution. This suffers from intractability as the JPD is exponential on the number of variables. Since a BN factorizes a JPD into smaller CPTs by exploiting conditional independence, a valid yet naive approach would be to perform inference on BN by converting the BN into a JPD and performing inference on the resulting JPD.
Suppose, we would like to compute the prior P (z) from the distribution shown in Figure 2.3. For brevity, we omit the introduction of evidence. Refer to [5] for details.
One method of improving on this approach is variable elimination. Similar to inference by JPD, variable elimination is an inference technique that consists of successively summing-out all non-query variables to construct a marginal distribution over the remaining query variables. The key insight, however, is that variables can be marginalized out while keeping the original distribution and all successive distributions in some factored form [4]. This is achieved by rewriting in terms of conditional independencies.
P (z) = t,u,v,w,x,y P (z|y, w, x)P (x)P (t)P (v|t)P (y|v)P (w|t, u)P (u) We factor this expression by re-arranging terms and pushing the summations inside the product operations. This allows for the summations to be performed as early as possible and the product operations as late as possible.
This results in the following computations.
The savings of variable elimination can be observed from these computations.
The largest potentials we maintained were φ 3 (w, t, u, y) and φ 5 (z, y, w, x), which each had 4 variables and 2 4 = 16 instantiations. This is a significant improvement over the inference by JPD approach, which would have required 7 variables and 2 7 = 128 instantiations.
However, variable elimination suffers from a key limitation: It can only estimate one query at a time. Improvements to variable elimination have integrated alternative data structures to cache calculations to avoid re-calculating different queries on the same evidence [4]. However, if the evidence changes, then the cache must be discarded and the variable elimination process must be restarted. This limits the generality of the inference algorithm.
While some alternative methods discussed in this thesis incorporate variable elimination, we do not make use of it due to the aforementioned limitation.

Dependency Structure Compilation
One method of computing reusable, efficient inference queries in a BN is by converting the DAG of the BN into a Junction Tree (JT). The process is demonstrated below on the DAG pictured in Figure 2     Next, we convert the moralized graph into a triangulated graph G T by breaking up all cycles that are longer than 3 nodes in length. Cycles are broken up by adding edges, such that every node amongst a cycle of size 4 or more has connected adjacent nodes. An example of triangulation is presented in Figure 2.4 (b).
The purpose of triangulation is to guarantee the existence of a junction tree.
Formally, a junction tree [20] is a triplet (M, Ω, E), specified in terms of: • M is a non-empty set of variables.
• Ω is a set of clusters such that all variables are contained in at least one cluster.
• E is the set of unordered edges connecting each cluster. Each edge is labelled by the intersection of the two clusters it is connecting. Two clusters Q 1 and Q 2 in Ω may be connected iff Q 1 is not equal to Q 2 and their intersection has at least one variable in common (Q 1 ∩ Q 2 = ∅). The intersection Q 1 ∩ Q 2 must be contained in every cluster on the path between Q 1 and Q 2 .
In the context of junction trees, we refer to a clique as a clique from the triangulated graph, and use cluster and node interchangeably to denote a node in the junction tree.
To generate the junction tree, we begin with an empty graph G JT . A node is added to G JT for each clique in G T , which is not fully contained within a larger clique. Each node is labelled by the variables in the clique. If two clusters have any variable(s) in common, they are included in every cluster on the path between them.
Between adjacent clusters, each edge (separator) is labelled with the intersection of the two clusters. The resulting junction tree is shown in Figure 2.4 (c).
The advantage of a junction tree representation is it may be used for multiple inference queries and can incorporate different observations, as long as the topology (structure) of the BN does not change. Once the junction tree is compiled, we can then perform inference.

Inference by Message Passing
Message passing is an inference technique that passes messages over separators between adjacent clusters in the junction tree [1]. The technique is based off the concept of consistency.
Consistency [20] Let G JT = (M, Ω, E) be a junction tree representation. Let Q 1 and Q 2 be two adjacent clusters in Ω with the separator S. Let their associated potentials be φ(Q 1 ), φ(Q 2 ), φ(S), respectively. Clusters Q 1 and Q 2 are said to be consistent for some constants k 1 and k 2 if: The junction tree G JT is said to be locally consistent if every pair of adjacent clusters in T are consistent. It is said to be globally consistent if for any two clusters (not necessarily adjacent) Q 1 and Q 3 , it holds for some constant k that: Informally, for a junction tree to be locally consistent, the marginals of each pair of adjacent clusters must differ by a scalar multiple. Global consistency is a stricter variation that requires all marginals of each pair of nodes -adjacent or non-adjacent -to differ by only a scalar multiple. It follows that if a junction tree is locally consistent, then it must be globally consistent.
Inference by junction tree passes messages over separators with the objective of making the junction tree locally consistent. In order to make any two adjacent clusters consistent, the algorithm applies the absorption operation to update the separator and adjacent node's potentials.

Algorithm 2.1: Absorbtion
Let Q 1 and Q 2 be adjacent clusters with separator S in a junction tree.
Let their associated potentials be φ(Q 1 ), φ(Q 2 ), φ(S), respectively. Q 1 absorbs from Q 2 by performing the following: With absorption introduced, we can now begin summarizing the inference by junction tree algorithm [11,20]. The initialization step of the algorithm consists of incorporating the CPDs into the junction tree. This involves converting the CPDs into potentials, and assigning each resulting potential to cliques that contain all variables in the potential. Multiple potentials may be assigned to the same cluster. Once all potentials have been assigned, we begin by setting each cluster's potential to the product of all potentials assigned to the cluster.
Next, we update a cluster's potentials for each observation. This is attained by setting the values of all entries inconsistent with the evidence to 0. We then arbitrarily select one node as the root node R. CollectEvidence is then recursively invoked on R to receive messages inwardly from the leaves to the root. When the CollectEvidence algorithm is invoked on a generic clique C i , it invokes CollectEvidence on all other adjacent cliques {C 1 , . . . , C m }. Once these cliques have finished collecting evidence,

Algorithm 2.2: CollectEvidence
Let Q be a cluster in a junction tree G JT .
A caller is either an adjacent cluster or the junction tree G JT itself.
1. Cluster Q invokes Collect Evidence on each adjacent cluster, except caller.
2. After each invoked cluster has completed, Q absorbs from it.
Once CollectEvidence is complete, we distribute the updated evidence outwardly from the root to the leaves by calling the DistributeEvidence algorithm.
When invoked on a clique C i from an adjacent clique C j , the algorithm C i absorbs from C j and then invokes DistributeEvidence on all other adjacent cliques.

Algorithm 2.3: DistributeEvidence
Let Q be a cluster in a junction tree G JT .
A caller is either an adjacent cluster or the junction tree G JT itself.
1. If caller is a cluster, Q absorbs from it.
2. Cluster Q invokes DistributeEvidence on each adjacent cluster, except caller. Treewidth Treewidth of a junction tree is the size of the largest cluster minus 1.
For example, the treewidth of the junction tree in Figure 2.5 is 3.

Lazy Propagation
Lazy propagation [11] is an improvement over traditional messaging passing. The key insight is that we do not need to multiply the potentials assigned to the clusters in the initialization step. Instead, we can defer the multiplication until it is required to do so. The result of deferring the multiplications is fewer calculations and faster inference at the cost of occupying more space. For further information, refer to [11].

Accuracy Metrics
A common method of evaluating the similarity between an approximation and its ground truth is by accuracy metrics. In this thesis, we make use of two accuracy metrics: Euclidean distance and Kullback-Leibler distance.

Euclidean Distance
Euclidean distance (ED) computes the straight line distance between two vectors.
Since we are comparing CPTs, a CPT can be interpreted as a set of indexed CPDs, where each state in the variable's domain represents a dimension in vector space.
Given a ground truth CPT P GT , let P A represent a CPT that approximates P GT , m represent the number of CPDs in the CPT, and n represent the number of probabilities in each CPD. The ED can be calculated as follows: The result indicates the similarity of the CPTs. The value is bounded by [0,1] where a value of 0 represents the CPTs are identical, and a value of 1 represents the CPTs are entirely different. A larger ED value indicates the CPTs differ whereas a smaller ED value indicates the CPTs are more similar. ED treats small differences equally to larger differences.

Kullback-Leibler Divergence
Kullback-Leibler divergence (or Kullback-Leibler distance, or KL distance) [9] computes the distance of the approximation probability distribution P A from the ground truth probability distribution P GT . The metric captures the randomness of P A and magnifies the impact of large differences. It is non-symmetric such that the distance of the ground truth from the approximation is not equal to the distance of the approximation from the ground truth.
Given a ground truth CPT P GT , let P A represent a CPT that approximates P GT where m represent the number of CPDs in the CPT and n represent the number of probabilities in each CPD. The KL distance can be calculated as follows: The result indicates the similarity of the CPTs. The KL distance is bounded within [0, ∞), where a larger value indicates a greater difference between the distributions. A value of of 0 represents the distributions are identical (or very similar). A value ≥ 1 represents a large difference between the distributions.

Independence of Causal Influence Models
A tabular CPT ignores any relationships between parent and child variables resulting in the exponential space complexity on the number of parents. One way of addressing the exponential complexity is by making use of independence of causal influence models (also known as causal independence models). These models represent a child variable as a dependent of its parent variables, such that the parent variables are causes and the child variable is an effect. The key insight of causal models is that the models are capable of encoding each cause occurring independently to produce the effect. In order to compose a causal independence model, we first need to introduce a causal event.
Causal Event An event representing a set of active causes either succeeding or failing to produce an active effect.
To express a causal event, we follow the notation of Xiang [21]. When an active cause c i = true is binary, it is denoted by c + i while an inactive binary cause is represented by c − i . An active binary causal event is represented by e + while an inactive binary causal event is denoted by e − . A causal event of multiple causes successfully producing an effect can be denoted by e + ← c 1 , + , c + 2 , . . . , c + k , while e + ← c 1 , + , c + 2 , . . . , c + k denotes that the causes failed to produce the effect e. A multi-valued cause is denoted by c j i where i ≥ 1 represents the cause index and j ≥ 0 represents the intensity of the cause. A multi-valued effect is denoted by e j where j represents the intensity of the effect. The syntax e ≥ e i (or, conversely e ≤ e i ) is said to represent all effect values with an intensity value greater (less) than or equal to e i . The probability of a causal event is denoted P (e + ← c + 1 , c + 2 , . . . , c + k ). The interaction between causes that produce a common effect may be characterized as reinforcing or undermining.
Reinforcing Interaction [21] Causes which produce a common effect reinforce each other if collectively they are at least as effective as when some are active. Let c 1 and c 2 be two causes that produce an effect e.
For instance, let the effect be a diagnosis of lung cancer, and causes of a diagnosis be smoking cigarettes and exposure to asbestos. Individually, smoking a cigarette or exposing oneself to asbestos may increase one's chance of obtaining lung cancer, but the presence of both together, is more harmful and thus, more likely to increase the chance of obtaining lung cancer.
Reinforcing interactions need not occur only between individual variables. The causal interaction can also occur between sets of variables. This allows for recursive mixtures. Two or more sets of variables reinforce each other if they satisfy failure conjunction and failure independence.
Failure Conjunction [21] Let R 1 , R 2 , . . . , R m be a disjoint set of causes that satisfy This specifies that the failure of the group of causes is determined by the conjunction of each causal failure event. In other words, for the joint causal event to fail to produce the effect, it requires that every single causal event fail to produce the effect.
Failure Independence [21] Let R 1 , R 2 , . . . , R m be a disjoint set of causes that sat- This specifies that the probability of the failure of the group is the product of the individual failure probabilities. Hence, reinforcement between sets requires that each cause fail individually and states the probability of a set of causal failures is their product.
By contrast, an interaction between causes may be characterized as undermining.
Undermining Interaction [21] Causes which produce a common effect undermine each other if they are collectively less effective than some causes acting individually.
Let c 1 and c 2 be two causes that produce an effect e.
For example, let the effect be recovery from lung cancer, and the causes be two forms of treatments, which are known to inhibit each other. Individually each treatment may heal the lung cancer, but together, they counteract, reducing the likelihood of a person recovering from lung cancer.
Likewise, undermining interactions need not occur only between individual variables. The causal interaction can also occur between sets of variables. Two or more sets of variables undermine each other if they satisfy success conjunction and success independence.
Success Conjunction [21] Let R 1 , R 2 , . . . , R m be a disjoint set of causes that satisfy This specifies that the success of the group of sets is determined by each individual set's successes.
Success Independence [21] Let R 1 , R 2 , . . . , R m be a disjoint set of causes that satisfy success independence iff This specifies that the probability of all active causes is the joint probability of each individual active cause.
The notion of reinforcing interactions and undermining interactions occurring at a set-level allows for a recursive mixture of interactions. For instance, two causal events e + ← c + 1 and e + ← c + 2 may reinforce each other, but together e + ← c + 1 , c + 2 , they may undermine a third causal event e + ← c + 3 .

Noisy-OR
Noisy-OR [12] is a causal independence model that encodes reinforcing interactions. It is restricted to binary variables but was later generalized to a Noisy-MAX model which can encode multi-valued variables [8]. The models reduces the complexity of a tabular CPT to linear on the number of causes.
Formally, let C = {c 1 , . . . , c k } be a set of uncertain causes that produces an effect e. Let D be a subset of the causes in C where each cause in D actively produces e. The subset D may or may not be equal to C. A Noisy-OR model represents the causal interactions among C by the following property: However, neither the Noisy-OR nor the Noisy-MAX are able to model undermining interactions. This is a critical limitation that precludes the use of these models in this thesis.

Noisy-AND
The Noisy-AND is a causal independence model that encodes reinforcing interactions over binary variables.
Formally, let C = {c 1 , . . . , c k } be a set of uncertain causes that produces an effect e. Let D be a subset of the causes in C where each cause in D actively produces e. The subset D may or may not be equal to C. A Noisy-AND model represents the causal interactions among C by the following property: In plain language, if all causes are true, then the probability of the active effect is the product of all active causes. If any cause is inactive, then the probability of the active effect is zero. The behaviour of the Noisy-AND gate is said to be impeding since a single inactive cause prohibits an active effect. This prevents the Noisy-AND model from encoding undermining interactions.

Non-Impeding Noisy-AND Trees
The previously discussed causal independence models encode reinforcing interactions and reduce the number of parameters to linear on the number of causes.
However, each of these models have a significant limitation: they cannot encode undermining interactions.
Non-impeding Noisy-AND Tree (NIN-AND Tree, or NAT) is a causal independence model that can represent both reinforcement and undermining relationships.
The model represents the interactions graphically, using a recursive mixture of two types of NIN-AND gates, dual NIN-AND gates and direct NIN-AND gates. Both gate types accept causal events as inputs and produce an output causal event, with a probability specified by the product of input causes' probabilities.

Dual Gate
The dual gate operates similarly to a noisy-OR gate by modelling reinforcing The direct gate operates similarly to a noisy-AND gate, with two key differences.
First, the direct gate models undermining interactions whereas the Noisy-AND models reinforcement. More specifically, the direct gate operates on causal success events and implements the success conjunction and success independence properties. The success conjunction is expressed graphically by the AND gate and the success independence is expressed by the lack of connection between the causal success input events. Second, the direct gate is non-impeding, such that the presence of an inactive cause does not force the output event of the direct gate to be inactive as well. As an example of a direct gate, consider Figure 2.7, which has n causal successes as inputs: e + ← c + 1 , e + ← . . . , c + n and produces the output causal event e + ← c + 1 , . . . , c + n .

Recursive Combinations of NAT Gates
A recursive mixture of dual and direct gate can express complex relationships between causes that produce an effect. This is achieved by connecting the output of one gate as the input of another gate. A connection between gates may be negated, denoted by a white dot, to negate the output of the upper gate before it is passed as input to the lower gate. (e + ← c + 1 and e + ← c + 3 ) and outputs the causal event e + ← c 1 , c 3 . The lower direct gate accepts inputs of the negation of the causal event from the above dual gate (e + ← c 1 , c 3 ) along with the single causal (e ← c + 2 ). The lower direct gate outputs the causal event e + ← c + 1 , c + 2 , c + 3 . To demonstrate how the probability of a causal event is retrieved from a NIN-AND tree, consider the following single-causal probabilities specified for the example NAT topology in Figure 2.8: We can compute the causal event produced by the upper dual gate as the product of the negated single-causals (e ← c + 1 and e + ← c + 3 ): We can then compute the causal event produced by the lower direct gate as the product of the negated output from the dual event and the single-causal

NAT-Modelled Bayesian Networks
A NAT-modelled Bayesian network is a Bayesian network where all local distributions are modelled by NATs instead of tabular CPTs. By contrast, a normal Bayesian network is a Bayesian network where all local distributions are modelled by tabular CPTs.
Inference methods designed for normal Bayesian networks cannot be applied to NAT-modelled Bayesian networks without normalization. Normalization is the process by which each NAT model is expanded into a tabular CPT. This results in a local structure with a size exponential on the family size, thereby discarding all space savings of the NAT model.

Inference by Multiplicative Factorization
Multiplicative factorization is an alternative inference method that preserves some of the savings yielded by the NAT model. It was originally designed by Takawa and D'Ambrosio [17] for the Noisy-OR and Noisy-MAX causal independence models before being extended to NIN-AND by Xiang and Jin [24]. The method works by factorizing a NIN-AND gate in a NAT model ( The graph consists of one node c i per cause in the NAT, a node e for the effect e, and one auxillary node a j for each active value in the effect domain dom(e). The graph is connected as follows: Each cause node c i is connected to each auxillary node a i by an undirected edge, and each auxillary node is connected to the effect node e by a directed edge. Each undirected edge is assigned a potential f (a j , c i ), and the effect node e is assigned a potential f (e, a 1 , . . . , a m ). The hybrid graph can subsequently be compiled into a (lazy) junction tree for inference. Refer to [24] for further details on multiplicative factorization of NAT models.
Multiplicative factorization improves on normalization by preserving some of the causal independencies. It has been shown to improve inference time by up to two orders of magnitude in certain networks [24]. However, if the effect domain becomes large, the fully-connected nature of the auxiliary variables will result in an exponential explosion, thereby limiting multiplicative factorization to smaller effect domains.

Inference by De-causalization
De-causalization is an alternative NAT inference method with the aim of avoiding  De-causalization improves on normalization by preserving some of the causal independencies and multiplicative factorization by eliminating the exponential explosion. It has been shown to improve inference efficiency by up to two orders of magnitude in sparse networks [25].

Existence of NAT Models in Real-World BNs
It is necessary to confirm the existence of NAT models in real-world BNs. If the existence is not proven, then NAT models would be restricted to synthetic or expertspecified BNs. Thus, it is critical we establish that NAT models exist in real-world BNs, which were not originally designed to be NAT modelled. The previous work of Xiang and Baird [22] positively identified the existence of NAT models by experiment.
In this section, we summarize their approach and findings. Their experiment began by sourcing 8 real-world BNs from a popular BN repository, bnlearn. On each BN, they applied NAT compression to convert certain tabular CPTs into NAT models.
Tabular CPTs with less than 2 parents or where the node's CPT is deterministic were not compressed. The resultant NAT modelled BN is denoted a NMBN.
The BNs selected are summarized in Table 2 The results are presented in Figure 2.11. The previous work observed that the distances are reasonably small and that the posterior marginals are reasonably accurate, given that 30 − 50% of the families are NAT modelled. Notably, they also discovered that the posterior error was smaller than the NAT compression error indicating that compression errors are weakened, rather than exacerbated, by the inference.

Context-Specific Independence
Causal independence models are not the only way to address the exponential complexity of tabular CPTs. An alternative approach is context-specific independence (CSI), which exploit relationships between the child variable and some of its parent variables.
Prior to introducing CSI, it is first necessary to introduce the notion of a context.
Context Let n be a generic node in a Bayesian network. A context is an assignment of value(s) to a subset of n's parents π(n).
Consider the variable z in Figure 2 In the following sections, we discuss the various representations of CSI and a compatible method to support inference for each.

Default Tables & Normalization
Default tables are a CSI model that improve on tabular CPTs by only explicitly representing a subset of the instantiations on the parent variables. Formally, a default   Hence, the limitations preclude the use of default tables in this thesis.

Rule-based Representation & Variable Elimination
A rule-based CPT is a set of rules of the form α | Cxt : P (α | Cxt) where α is a variable, Cxt is context and P (α | Cxt) is the CPD specified by its parameters 1 .
The context is encoded as a logical sentence where each assignment in the context is conjuncted together. Each instantiation must be encoded by a rule but many instantiations can be encoded by the same rule.
CSI can be exploited by applying operations on the sentences, which simplify the rules. One such operation is the combination operation, which consolidates rules that correspond to identical CPDs. Let α | Cxt 1 : P (α | Cxt 1 ) and α | Cxt 2 : P (α | Cxt 2 ) be two rules in the same rule base such that P (α | Cxt 1 ) = P (α | Cxt 2 ). The combination operation would replace the two rules with a new rule holding the intersection of Cxt 1 and Cxt 2 : Additional operations include the split operation, and an extension of a potential's product and marginalization operations to rules. The details of these rules are omitted due to space considerations. Refer to [13] for further details. The marginalization and product rules allows for variable elimination to be extended to a set of rules [13]. Unfortunately, variable elimination appears to be the only inference method defined for the rule-based representation. As stated in the introduction to variable elimination on CPTs (Section 2.5.2), this thesis avoids its use due to its inability to incorporate new evidence at runtime without re-compilation.

CPT-tree & Network Transformation
A CPT-tree is a CSI model introduced by Boutillier et al. [3], which uses a tree structure to specify the CPT for a variable x conditioned on its parents π(x). The tree is directed from the root (node with no parents) to the leaves (node with no children). The tree structure restricts each node to have at most one parent resulting in a single path from the root node to each node in the tree.
Each non-leaf node is a variable in π(x). Each outgoing edge from a node is labelled by a value the variable holds 2 . The path from the root node to a leaf node encodes a context. Each leaf node encodes a CPD of x, given the context. We note y 0 y 1 w0 w 1 x 0 x 1 Figure 2.14: CPT-tree for P (g|f, d, e).
that a CPT-tree is not a decision tree. While the semantics and visual appearance are similar; a CPT-tree exists in an uncertain environment whereas a decision tree exists in a deterministic environment.
The CPT-tree for the tabular CPT P (z|y, w, x) is shown in Figure 2.14. The CSI interactions are observable when paths do not include all variables in π(x). Specifically, the CSI interaction I c (z; w, x|y = y 1 ) is encoded by the right-most branch from the root node f . The other CSI interaction I c (z; x|y = y 0 , w = w 0 ) is encoded by the left-most branch. Each CPD can be retrieved from the label of the leaf node. For instance, the CPD P (z|y = y 0 , w = w 1 , x = x 1 ) can be retrieved by following the left path from the root node (y = y 0 ), then the right node (w = w 1 ), and finally, its right child (x = x 1 ). Hence, a CPT-tree can be normalized to an exponentially sized tabular CPT by iterating through each instantiation of the parent variables in the CPT-tree and retrieving the appropriate CPD.
We refer to BNs where some families are modelled by CPT-trees as CPT-treemodelled BNs. There are various inference methods that support inference with CPTtree-modelled BNs, including network transformation and clustering [3], cutset conditioning [3], and variable elimination [13]. In this thesis, we will focus on network transformation and clustering. The initial idea of this algorithm was proposed by [3] but they did not present a formal algorithm. A contribution of this thesis, presented in Chapter 4, is to formalize an algorithm suite for this method. We omit further discussion on this method until that section.

Chapter 3 Orthogonality of NAT & CSI Models
This chapter confirms the necessity of the novel contribution introduced in our work. The structure of the chapter is as follows.

General Information on Orthogonality
Two models are orthogonal if neither model is able to efficiently and exactly encode the other. Otherwise, the models are said to be not orthogonal. To illustrate the concept of orthogonality, consider the following two examples.
i) Rational and irrational numbers are orthogonal representations of real numbers. The reason for this is a rational number cannot represent an irrational number, and conversely, an irrational number cannot represent a rational number. Consider the irrational number π; there does not exist any rational number composed of two integers a and b such that a b = π. Likewise, there does not exist any irrational number that can represent the rational number 3 2 . Hence, the models are orthogonal since neither representation can exactly encode the other.
ii) The Noisy-OR and NIN-AND models are not orthogonal, since the Noisy-OR model is a special case of the NIN-AND model. The rationale for this can be demon-strated in two parts. It is first necessary to show that there exists a NIN-AND model that cannot be encoded by a Noisy-OR. This can be proven by the fundamentals: A NIN-AND model encodes both reinforcing and undermining interactions, while a Noisy-OR model only encodes reinforcing interactions. Hence, any NIN-AND model containing an undermining interaction cannot be expressed by a Noisy-OR. Likewise, we must demonstrate that a NIN-AND model can efficiently and exactly encode every Noisy-OR model. While a formal justification for this exists in [23], for simplicity purposes, we present a sufficiently general example. Consider a BN family with a child variable e that is dependent on three parents c 1 , c 2 and c 3 . All variables are binary since the Noisy-OR is restricted to binary variables. All causes reinforce each other. In Figure 3.1, we show that the CPT generated by the Noisy-OR is equivalent to the CPT specified by a dual NIN-AND gate. Hence, the models are not orthogonal since there exist NIN-AND models that cannot be encoded by the Noisy-OR models and every Noisy-OR model is a special case of the dual NIN-AND gate. Before testing for the orthogonality of the NAT and CSI models, it is first necessary to demonstrate how one model may be converted into the other. In this chapter, we make use of the CPT-tree CSI representation; however, the conversion can be trivially applied on any CSI representation.
The following two sections demonstrate the conversion in both directions: Section 3.2 shows the conversion from a CSI model to a NAT model. Section 3.3 shows the conversion from a NAT model to a CSI model.

Converting from a CSI Model to a NAT Model
We demonstrate the conversion from a CSI model to a NAT model by example.
Since there does not exist a direct conversion method between NAT and CSI, we must use an intermediary representation to facilitate the conversion. The intermediary representation must support conversion from a CPT-tree to the intermediary representation, and conversion from the intermediate representation to a NAT model.
Thus, a tabular CPT is selected as the intermediary representation since it satisfies both properties.
Suppose, we have the CPT-tree in Figure 3.2 specifying the CPT of a family with the child variable z and three parents w, x, y. All variables are binary. The CPT-tree admits 1 CSI interaction: I C (z; w|y, x = x 1 ), which states that when x = x 1 , z is contextually independent of w for each y.

Normalization of a CPT-tree to a Tabular CPT
The next step is to normalize the CPT-tree to a tabular CPT. The resultant tabular CPT is presented in Figure 3.3. We note the values encoded in the resultant CPT are exactly identical to the values expressed by the CPT-tree.

Compression of a CSI CPT to a NAT Model
We can now apply compression on the tabular CPT to convert the tabular CPT into a NAT model. Let the 0 th index of a variable's value be the inactive state (e.g., w 0 = w − ) and the 1 st index of a variable's value (e.g., w 1 = w + ) be the active state. The resultant NAT model is shown in Figure 3.4 with the following singlecausals: Overall, this process allows for the conversion of an existing CPT-tree into a P (z|x, y, w) P (z + ← x + ) = 0.6 P (z + ← y + ) = 0.8 P (z + ← w + ) = 0.1 z + ←x + z + ↚ x + , y + , w + z + ←y + z + ↚w + CPT-tree since a more comprehensive evaluation is presented in Section 3.4. In the next section, we discuss the inverse of converting a NAT model to a CSI model.

Converting from a NAT Model to a CSI Model
In this section, we outline a method that estimates the number of parameters

Normalization of a NAT Model to a Tabular CPT
Suppose, we would like to convert the NAT in Figure 3.5 (a) to a CSI model.
The NAT model consists of 3 binary causes x, y, w that produce an effect e. All three causes undermine each other. Let the 0 th index of a variable's value be the inactive state (e.g., w 0 = w − ) and the 1 st index of a variable's value (e.g., w 1 = w + ) be the active state.
Since there does not exist a direct conversion method, we will use a tabular CPT as an intermediary representation. Thus, the first step is normalize the NAT model into a tabular CPT. The resultant CPT is represented in Figure 3.5 (b). We note the values encoded in the resultant CPT are exactly identical to the values expressed by the NAT model.

Clustering a CPT to Estimate Number of CPT-Tree Parameters
Clustering is the process of grouping similar objects together such that the ob- Based on this idea, the algorithm Cluster takes a binary-valued NAT CPT T and a distance bound δ. The algorithm groups values in the CPT into a set of clusters Ψ, such that the following conditions hold: 1. For each cluster Q ∈ Ψ and each pair of values p, q ∈ Q, |p − q| ≤ δ. This condition specifies the inner cluster distance (distance between the minimum and maximum values) within each cluster is upper bounded by δ.
2. For each two clusters, Q, R ∈ Ψ, let min Q , max Q , min R , max R be extreme values in Q and R, respectively. Either max Q < min R or max R < min Q . This condition specifies that the clusters are ordered in ascending order, such that a cluster with smaller member values is positioned before a cluster with larger member values.
3. For clusters Q, R ∈ Ψ where max Q < min R , we have min R − max Q > δ. This condition specifies that the distance between any two clusters must be greater than δ.
The algorithm Cluster satisfies these conditions. To demonstrate the clustering algorithm, consider the binary-valued CPT that specifies P (z | x, y, z) shown in Figure 3.

(b). The clustering algorithm's parameter
T is specified as P (z = z 1 | x, y, w) where z 1 is arbitrarily selected ∈ dom(z). The values are shown in Figure 3.6 (panel a). The distance bound δ is specified to be 0.02.
The algorithm begins by sorting T in ascending order (panel b). We then initialize a set of empty clusters Ψ and create a new cluster Q containing the first element   . Once all values have been clustered, it is possible that the last cluster is non-empty and has not been added to Ψ. Hence, once all iterations are complete, we add Q to Ψ on line 11. Figure 3.7 shows the resulting clustered CPT.
Thus, the cluster would not be exploitable by CSI. The number of clusters obtained by the clustering algorithm is consequently a lower-bound on the number of parameters when the NAT CPT is approximated by a CSI model. We resolve this issue by splitting such clusters into the largest compatible sub-clusters possible.

Splitting Clusters to Ensure Exploitability
The algorithm Split takes a set of clusters Ψ that may or may not be exploitable, and returns a set of exploitable clusters Ψ . The set of input clusters Ψ is typically obtained from the Cluster algorithm. The Split algorithm works as follows. On line 1, we initialize a queue λ holding all clusters in Ψ. The order of clusters inserted into the queue is immaterial. On line 2, we initialize an empty set of clusters Ψ to hold all exploitable clusters.
On line 4, we enter the main loop that continues looping until the queue is empty.
On line 5, we pop the front of the queue to obtain the current cluster Q i that will be tested. On lines 6 and 7, we check if a splitting is unnecessary. This consists of two tests: (i) if the cluster Q i contains 1 instantiation, then it is said to be exploitable. We demonstrate the algorithm on the clustered CPT shown in Figure 3.7. We begin by initializing λ to the set of clusters: {(x = x 0 , y = y 1 , w = w 1 ), (x = x 1 , y = y 0 , w = w 1 ), (x = x 1 , y = y 1 , w = w 0 )}, We initialize the exploitable clusters set Ψ to an empty set. For the 1st iteration, we pop the following cluster: We then test if the cluster contains 1 instantiation. Since this test passes, we add Q i to the set of exploitable clusters Ψ and proceed to the next iteration.
The 2nd iteration pops the following cluster from the queue: Since it also contains 1 instantiation, it is added to the set of exploitable clusters Ψ .
The 3rd iteration pops the following cluster from the queue: We then test if the cluster contains 1 instantiation. This test fails since the cluster contains 3 instantiations. We then test if the entire cluster is exploitable.
The intersection of all assignments in Q i is an empty set. Hence, the conditional on line 7 fails as the cluster is not exploitable and must be split. We then identify the most frequent value held by instantiations.  In this case, there is a 3-way tie for the most frequent value of 2. We settle the tie by arbitrarily selecting one value -x 1 . We note that it is critical to select one maximum frequency value, not multiple. Selecting multiple values will not maximize the size of splits and will result in sub-optimal clusters. For example, selecting x 1 by itself has 2 matching instantiations but selecting x 1 ∧ w 1 has only 1 matching instantiation.
Next, we partition the cluster into two segments R and S. The segment R contains all instantiations where x = x 1 : The segment S contains all instantiations where x = x 1 : The segment R is now exploitable and can be added to the set of exploitable clusters Ψ , while the segment S may not be exploitable and is appended to the queue λ in order for further testing and splitting if needed.
The 4th iteration pops the following cluster from the queue:

This cluster fails the tests on lines 5 and 6 as the cluster is not a single instantiation
nor is the cluster's intersection an empty set: We then proceed to the identification of the most frequent value as shown in Ta- In this case, there is a 3-way tie for the most frequent value of 2. We settle the tie by arbitrarily selecting one value -x 0 . Next, we partition the cluster into two segments R and S. The segment R contains all instantiations where x = x 0 : {(x = x 0 , y = y 0 , w = w 1 ), (x = x 0 , y = y 1 , w = w 0 )} And, the segment S contains all instantiations where x = x 0 : The segment R is now exploitable and can be added to the set of exploitable clusters Ψ , while the set S may not be exploitable and is appended to the queue λ in order for further testing and splitting if needed.
The 5th iteration pops the following cluster from the queue: This cluster is of 1 instantiation and thus is added directly to set of exploitable clusters. The 6th iteration also contains one instantiation and is added to the set of exploitable clusters Ψ . The 5th and 6th iterations popped clusters that were appended to the queue by the R/S partitioning performed in the 3rd and 4th iterations, respectively.
The CPT shown in Figure 3.8 shows the final clustered CPT such that each cluster is exploitable. The CPT contains 6 instantiations, which is an increase of 2 instantiations (printed in red) relative to the initial clustering shown in Figure 3.7.

Extending Clustering to Multi-valued Variables
The clustering algorithm outlined by the Cluster algorithm is restricted to binary variables. In this section, we discuss how the clustering approach may be extended to multi-valued variables. This is achieved by building the test up from binary variables to ternary, and finally to the general variables. While we did not perform x 0 y 0 w 1 0.61 0.39 Q 4 x 0 y 1 w 0 0.6 0.4 Q 5 x 0 y 1 w 1 0.836 0.164 Q 6 x 1 y 0 w 0 0.59 0.41 experiments with this extension due to time constraints, the general case was implemented in Java to verify correctnesss.

Binary Case
Consider the CPT P (x|y, z) with variables x, y, z over the domains dom(x) = {x 0 , x 1 }, dom(y) = {y 0 , y 1 }, dom(z) = {z 0 , z 1 }, respectively. Suppose, we would like to know if the CPT encodes the CSI interaction I C (x; z|y = y 0 ). This can be determined by performing the following test: In plain language, we must test that P (x 0 |y 0 , z) is the same for both possible values of z: z = z 0 and z = z 1 . is redundant, given that we have established x 0 and that the CPDs must sum to 1. Hence, the number of tests required to establish CSI has increased, making the conditions for CSI more stringent. In other words, testing for CSI with binary variables requires only 1 condition to be met, but testing for CSI with ternary variables requires 6 conditions to be met. Note, we do not have to test P (x = x 2 |y = y 0 , z = z i ) since all CPDs sum to 1. To demonstrate the ternary case, consider the example CPT shown in Figure 3.10.
In the above CPT, we can see that CSI interaction holds as P (x = x 0 |y = y 0 , z = z i ) = 0.3 and P (x = x 1 |y = y 0 , z = z i ) = 0.5 where i = 0, 1. It is also observed that testing for P (x = x 2 |y = y 0 , z = z i ) = 0.2 is redundant, given that we have already established x 0 and x 1 and know that the CPDs must sum to 1.

General Case
To generalize the test for context-specific independence, let the domain of x be of size m, and let the domain of z be of size n. We then will have m × n tests of the form: P (x = x 0 |y = y 0 , z = z 0 ) = · · · = P (x = x 0 |y = y 0 , z = z n−1 ) P (x = x 1 |y = y 0 , z = z 0 ) = · · · = P (x = x 1 |y = y 0 , z = z n−1 ) . . .
In order for CSI to hold, the above expression must hold for all values of x, z from 0 to m − 1 and 0 to n − 1, respectively.

Evaluating Expression of CSI Models as NAT Models
Theoretically, if the NAT model were able to encode a CSI CPT, then it should be able to exactly model every CSI CPT and require the same or fewer parameters than a CSI representation. Hence, in this section, we conduct an experiment to find counterexamples to demonstrate by contradiction that there exists CSI models, which cannot be encoded by the NAT model exactly.

Pre-processing: Adding CSI to an Existing CPT
In this section, we outline the method used to randomly generate CPT-trees that exhibit the same CSI interaction but encode different CPTs over the same BN family topology. This method is needed because we wish to identify repeatable counterexamples of CSI models that cannot be encoded by a NAT model. Hence, it is necessary to demonstrate that some CSI interaction(s) are not expressible by a NAT model over a sufficiently wide range of CPTs.
Unfortunately, naively randomly generating CPT-trees has no guarantee to encode the same CSI interaction. This can be mitigated by adding CSI interactions to a randomly generated CPT, denoted P * , over the same variables. Initially, P * would not encode any CSI interactions. Our method modifies the CPDs in P * affected by the CSI interaction, while leaving the non-affected CPDs unchanged. This results in a randomly generated CPT exhibiting the same CSI interaction, but with different values.
Consider, the randomly generated binary-valued CPT P (z|x, y, w) in the left pane of Figure 3.11 and the CSI interaction I C (z; w|y, x = x 1 ) from Section 3.2. To generate a new CPT P * from the CPT P and the CSI interaction I, we must assign the same probability for each distinct combination of (y, z) when x = x 0 .
We begin by duplicating P as P * . For each distinct (y, z) combination denoted (y , z ), we retrieve a probability ρ from P such that P (z = z |x = x 1 , y = y , w) where w is arbitrarily assigned. Then, for each instantiation in P * that satisfies the combination (y = y , z = z ), we replace the probability corresponding to that instantiation with ρ. This ensures that all instantiations of a (y , z ) combination have the same value. The right pane of Figure 3.11 shows the resultant CPT. Each combination is printed in a different colour. For the (y = y 0 , z = z 0 ) combination that is printed in red, we assigned a value of 0.6. For the (y = y 0 , z = z 1 ) combination that is printed in green, we assigned a value of 0.4. For the (y = y 1 , z = z 0 ) combination that is printed in purple, we assigned a value of 0.7. Lastly, for the (y = y 1 , z = z 1 ) combination that is printed in orange, we assigned a value of 0.3.

Experimental Setup
The objective of this experiment is to identify CSI models that cannot be encoded by NAT models. We simulated a batch of 100 randomly CPTs. Each CPT had a child variable v dependent on 5 parent variables q, r, s, t, u. All variables have a domain size of 5.
We specified 3 CSI interactions to evaluate. Each CSI interaction imparts a varying amount of duplication in a CPT. The CSI interactions are as follows: 1. I c (v; t, u | q = q 1 , r = r 2 , s ∈ {s 3 , s 4 }) v is contextually independent of t and u when q = q 1 , r = r 2 , s ∈ {s 3 , s 4 }.
2. I c (v; r, s, t, u | q = q 1 ) v is contextually independent of r, s, t, u when q = q 1 .
For each CPT P in the batch, we created three copies P 1 , P 2 , P 3 . Using the preprocessing outlined in Section 3.4.1, CSI interaction 1 is added to P 1 , CSI interaction 2 is added to P 2 , and CSI interaction 3 is added to P 3 . Thus, we create 4 × 100 = 400 total source CPTs. We then compress each CPT into a NAT model. The accuracy of the compressions were evaluated and compared per CSI interaction. Results are presented in the next section.

Experimental Results
The results of this experiment are presented in Table 3 It is observed that CSI CPTs take more than 30 times the space of the resultant NAT models. These representational savings are significant, but come at the price of an approximation error. While the approximation error decreases when the number of CSI parameters decreases, these results suggest that NAT models generally cannot encode CSI CPTs exactly.

Evaluating Expression of NAT CPTs as CSI Models
Conversely, we now empirically demonstrate that NAT models cannot be efficiently and exactly encoded as CSI models. The approach of this section will be similar to the above as we will conduct an experiment to identify NAT CPT counterexamples that cannot be encoded by a CSI CPT.

Experimental Setup
The experiment is conducted on 100 generated NAT CPTs, each over a CPT of 5 parent variables. All variables are binary, with 32 parameters per CPT. Each NAT CPT is clustered with a distance bound of δ = 0.02 and split as needed.

Experimental Results
The results are plotted in Figure 3.12. Each bar counts the number of NAT CPTs that produced a particular number of clusters. It can be observed that all CPT-trees need at least 17 parameters, while the NAT CPT only requires 5 parameters. Since we weakened the requirements for a CPT to exhibit CSI, there is now a modelling error associated with the CSI representation. Each cluster can be identified by a centroid, which is computed as the mean of the cluster's member values. To evaluate the error, we expand the clusters into a tabular CPT where each instantiation is specified by the centroid containing the instantiation. We then calculate the Euclidean distance between the source CPT and the CPT generated by expanding the clustering results. In Figure 3.12, each bar is labelled by its average modelling error on the top and the standard deviation below it.
Overall, the CSI representation is not able to encode the NAT CPT exactly and efficiently. If exactness is required, we demonstrated by definition that a CSI representation is unable to encode a NAT CPT with no duplicated probabilities efficiently. If exactness is not required, we applied clustering to determine the number of parameters required to express the NAT CPT as a CPT-tree. The results of clustering demonstrated that the CSI representation not only introduces an error but also requires a greater number of parameters to encode the CPT. These findings in combination with Section 3.4 suggest that the NAT and CSI representations are orthogonal.

Chapter 4 Formalizing CPT-tree Transformation
In this chapter, a CPT-tree transformation algorithm is designed. The idea for CSI transformation was first introduced in Boutillier et al. [3] through a simple binary example. However, to the best of our knowledge, no general algorithm has been formalized. We aim to fill this gap by generalizing network transformation to multi-valued variables and formalizing the process through a suite of algorithms.
The chapter is laid out as follows: Section 4.1 extends CPT-tree arcs to be set-

Set-valued CPT-tree Edges
In this section, we extend the CPT-tree representation to support set-valued edges. Recall that, in a CPT-tree as specified by [3], each outgoing edge in a CPTtree is labelled by a path that the variable holds, and that a path from the root to each leaf labels a single context in the CPT. For example, in   In this work, we generalize arcs of CPT-trees to support set-valued edges. A setvalued notation allows for each outgoing from a node n to be labelled by a subset of the values in dom(n). For example, the left-most edge connecting u to v in Figure 4 A byproduct of introducing the set-valued notation is that a path from the root to a leaf no longer encodes a single context. Instead, the the path from the root to a leaf encodes a set of contexts. In Figure 4.2, all paths from the root node to a leaf node with the set-valued edge specify 2 contexts. For example, the leftmost leaf node labelled as z(0.8) is reached by 2 contexts: (u = u 0 , v = v 0 , w = w 0 ) and It is noted that the single-value notation is a special case of the set-valued notation. The set-value notation expresses single values by a singleton. For instance, the single-value notation would label the rightmost edge connecting u to v as u 1 . By contrast, the set-valued notation would express the same edge label as {u 1 }. We omit all braces in figures with no ambiguity for readability.

Algorithm Suite
Network transformation is a method to transfer the structure of the CPT-tree to BN that preserves the context-specifies independencies encoded in the CPT-tree. This is achieved by swapping the child variable x of a BN family with a new structure, composed entirely of auxiliary variables, with the exception of x and its parents π(x).
In this section, we introduce the suite of algorithms for CSI transformation. The suite of algorithms consists of three algorithms: generation of the BN segment to encode the CSI interaction, assignment of CPTs, and generation of switch CPTs. Each algorithm is discussed individually in Sections 4.2.1, 4.2.2, and 4.2.3, respectively.

Generate BN Segment to Encode CSI Interaction
The first algorithm SetDagSeg accepts a CPT-tree T over a variable x and parents π(x) and generates a BN segment with a single leaf x that encodes the CSI interaction.
The transformation is applied from top-to-bottom of the CPT-tree. Let each node of the CPT-tree be assigned a level, corresponding to the depth of the node from the root. For example, the root is level 0, the children of the root are level 1, the grandchildren of the root are level 2 and so forth. Thus, we apply the transformation is driven in the CPT-tree from level 0 onwards. We demonstrate SetDagSeg through an example using the BN and CPT-tree pictured in Figure 4.3. Variables b,r,s are ternary while the variable q is binary. The algorithm begins with an empty graph G consisting of the nodes x and π(x). There are no connections between any of the nodes. This is shown in Figure 4 for each node t in T at level L with path(t), For the first level L = 0 in the CPT-tree T , the only node at that level is the root node q (t in the algorithm). The algorithm identifies a node labelled b {} (v in the algorithm). An arc is then added from node q to the node b {} . We then check if each child of q is a leaf. Since this test is negative, we continue processing the iteration. We now can identify q as a multiplexer node, since it has children and not all children are leaves. We then partition the domain of q into two segments, since the node q in the CPT-tree has two children.  The second level L = 1 in the CPT-tree T consists of two nodes. We will discuss each node individually.
The left-most node in the second level L = 1 is a special node because all of its children are leaves (ACAL). Decomposing the node will not yield any savings as there are no further CSI interactions. The iteration for the left-most node s begins by identifying node b q=q 0 as node v and adds an arc from s to b q=q 0 . We then check if each child of s in the CPT-tree is a leaf. Since this test is positive, the algorithm will skip the decomposition by continuing to the next node. This is shown in Figure 4  The right-most node in the second level L = 1 of the CPT-tree is also node s. This iteration begins by identifying its parent in the CPT-tree (v) as b q=q 1 . It then adds an arc from s to b q=q 1 . Since node s in the CPT-tree has one non-leaf child, it fails the test on line 10 and we continue processing the iteration. We then partition the ternary domain of s into two segments {s 0 } and {s 1 , s 2 }, by the values assigned to the arcs of the outgoing edges from s. We then create two new nodes in the BN segment with the labels b q=q 1 ,s=s 0 and b q=q 1 ,s∈{s 1 ,s 2 } , respectively. Each newly introduced node is added as a parent of of b q=q 1 . This is shown in Figure 4  In summary, we can classify the auxiliary nodes introduced by the SetDagSeg algorithm into the following three types: • ACAL Nodes: These nodes are created when the CPT-tree node has all leaf children. They are added in line 14 and never processed after by the test on line 7. Hence, they remain roots (e.g., b q=q 1 ;s 2 ∈{s 1 ;s 2 } ).
• Multiplexer Nodes: These nodes are created when the CPT-tree has at least one non-leaf child. They are processed on line 7 as v and by the for loop on line 13 (e.g., b {} and b q=q 1 ).
• Outer Nodes: These nodes are created when the CPT-tree node is a leaf. They are the remaining nodes that are processed in line 7 as v, passed the test in line 7, and skipped the for loop in line 13 (e.g., b q=q 0 and b q=q 1 ;s=s 0 ).
Each type of node has a different method to assign the CPT. We discuss each of these in detail next.

Assignment of CPTs to Generated BN Segment
Given a child variable x, its parents π(x), a CPT-tree T and a transformed BN segment, the second algorithm AssignCPT iterates through each non-parent node in G and assigns the appropriate CPT based on the type of node. denote the unique parent of v from π(x) by y;

12
call SetSwitchCpt(v, y, T, G) and assign the CPT returned to v;

return G;
We demonstrate the AssignCpt algorithm by applying it to the BN segment pictured in Figure 4.7 For Outer Nodes, the CPT is retrieved directly from the CPT-tree. For instance, in Figure 4.7, the node b q=q1,s∈s1,s2 's CPT is assigned by following the path (b q=q 1 , s ∈ {s 1 , s 2 }) in the CPT-tree to the leaf node. The label of the leaf node represents a CPD encoded by its parameters. Figure 4.8 presents the CPD for the node b q=q1,s∈s1,s2 . For ACAL nodes, the CPT is assembled from the CPT-tree node's children.
Consider, the node b q=q0 in Figure 4.7. Follow the path q = q0 in the CPT-tree  For Multiplexer nodes, the first parent y is from π(x) and is identified on line 11 of AssignCpt. The other parents are all auxiliary. The CPT for node v is deterministic.
In the next section, we discuss how the CPT is generated.

Generate Multiplexer CPT
The last algorithm generates the multiplexer CPT. Given a a multiplexer, child variable v, a parent y that v is switching on, a CPT-tree T and a transformed BN segment G, the SetSwitchCpt algorithm assigns a deterministic CPT that switches based on the value of v, according to a specified deterministic distribution.
We first demonstrate the deterministic distribution by way of an example. Consider the CPT for the multiplexer node v = b {} with its parent y = q. The CPT for In plain language, for a given configuration, (q, b q=q0 , b q=q1 ), we match the value q i assigned to q with the auxiliary variable whose path includes the assignment q = q i .
We take the value of that auxiliary variable as the observed value (value of 1), with the other states in b being unobserved (value of 0). Algorithm 4.3 shown below formalizes the above for all configurations of a generic multiplexer node v and parent y. The resulting multiplexer node for b {} is shown in Figure 4.10. We aim to demonstrate that P (z|u, v) is equivalent to the transformed BN seg- The chain rule for two random probabilistic events i and j states that P (i, j) = P (j|i) × P (i). We can apply this rule to the above equation to separate the product into multiple parts.
Using contextual independence, we know that z u=u 0 is only dependent on v and z u=u 1 is contextually independent of u and v. We also know that the variable u cannot hold both u = u 0 and u = u 1 at once. Thus, their auxiliary variables z u=u 0 and z u=u 1 are independent of each other. This allows us to omit the terms have no effect on the resulting probabilities.
Given a configuration of (z {} , u, v), we can now compute the probability of that configuration by expanding the sum and then substituting the values in to each variable in the sum.
. We can substitute z {} with z 0 , u with u 1 and v with v 0 in the above equation and then replace each term with the values from the CPTs shown in Figure 4.11 panels d through f.
Note, that the multiplexer CPT will zero-out the entries that include z u=u 0 = z 0 .
Repeating this process for each (z {} , u, v) configuration yields the same CPT over the original BN. Hence, the resulting marginals are identical.

General Information on Variable Duplications
The efficiency of the CSI transformation algorithm is dependent on the number of variable duplications that occur in the CPT-tree. Before demonstrating the dependence, it is first necessary to specify the number of variable duplications on a CPT-tree.

Number of Duplicate Variables
Given a CPT-tree for a child node z and the set of parents π(z), we specify the number of duplicated variables as the number of non-leaf nodes in the CPT-tree minus the number of parents |π(z)|.
In Figure 4.12, we present 3 possible CPT-trees for the same BN family P (z|x, w, y).

Chapter 5 Mixed NAT-CSI Bayesian Networks
This chapter outlines a BN representation modelled with both NAT and CSI models, and an accompanying compilation method to support efficient inference. The chapter is structured as follows: Section 5.1 details the BN representation composed of both local models. Section 5.2 discusses the method to support efficient inference.

Representation
Causal independence models and CSI models can each be exploited to improve space and inference efficiency in BNs. To our knowledge, no prior study has considered inference on BNs that take advantage of both simultaneously. Combining the models results in several issues, which we address below: First, we note that causal independence models and CSI models both apply to individual families in BNs. Thus, it is plausible that the models can both coexist in the same environment, and consequently the same BN. For example, a patient's recovery from surgery may be dependent on whether or not they use physiotherapy, the skill of the physiotherapist, and their use of medicine. The patient's use of medicine may in-turn be dependent on three medicines, which may counteract. The recovery from surgery variable may be modelled by a CSI model while the variable indicating the use of medicine may be modelled by a causal independence model. Hence, it is reasonable that the CSI modelled variable (recovery from surgery) is dependent on a NAT modelled variable (use of medicine). The coexistence may also occur in reverse (i.e., NAT modelled variable is dependent on a CSI modelled variable), or as two conditionally independent variables in an environment.
Second, a suitable representation is needed for each type of local model. In this thesis, we adopt NAT models as our causal independence model and CPT-trees as our CSI model. The NAT model was selected due to its ability to encode both reinforcing and undermining interactions. The CPT-tree model was selected due to its wider support of inference methods. To avoid digressing, refer to Chapter 2 for details. We define a BN modelled with both NAT models and CPT-trees as a mixed NAT-CSI BN.
Mixed NAT-CSI Bayesian Network (MNCBN) A MNCBN is a BN (M, G, P ), specified in terms of the following: • M is a set of variables.
• G is a directed acyclic graph whose nodes correspond one-to-one to members of M . Each variable in the graph is conditionally independent of its non-descendants given its parents.
• P is a set of CPTs partitioned into the triplet (T C, N M, CT ) where T C is a set of tabular CPTs, NM is a set of NAT models, and CT is a set of CPT-trees.
An example MNCBN is shown in Figure 5 Third, a MNCBN does not support the use of typical BN inference algorithms due to the presence of the non-tabular local models, nor does it support the use of an alternative processing designed for one local model on the other due to the orthogonality of the local models. Thus, it is necessary to identify a novel approach that supports inference on MNCBNs. We introduce a novel inference framework designed for MNCBNs below.

Inference Framework
In this section, we outline an framework that prepares MNCBNs for inference, select an inference method, and demonstrate the framework on an example MNCBN.   One thing to note is that evidence can only be specified on the observed variables.
The auxillary variables introduced by both de-causalization and network transformation are said to be unobservable. Hence, observations may only be entered on the nodes that were in the original MNCBN, prior to the conversion into a standard BN.

Producing Lazy Junction Trees from DTBNs
In this work, we make use of the Lazy Propagation algorithm due to its ability to efficiently compute posterior probabilities of all observable variables at inference runtime. Lazy propagation can be directly applied to a DTBN with no further changes.
Refer to Sections 2.5.4 and 2.5.5 for further details on the lazy propagation algorithm.

Framework Demonstration
Consider the MNCBN shown in Figure  We convert the MNCBN given in Figure 5.1 into a DTBN by de-causalizing node g and network transforming node h. The resulting DTBN is shown in Figure 5.2.
Nodes prefixed with x, y, or q are auxillary nodes introduced by the de-causalization. Once the MNCBN is converted into a DTBN, the framework compiles the DTBN in Figure 5.2 into a lazy junction tree in order to perform lazy propagation inference.

Coexistence of NAT & CSI Models in Real-World BNs
The first experiment aimed to confirm the coexistence of NAT and CSI local models in real-world BNs. If one (or both) of the models do not exist in the realworld, then the usefulness of this research is limited to synthetic MNCBNs. If both models exist, then the research can be applied in practice. The existence of NAT models in 8 real-world BNs was positively identified with reasonable inference errors by [22], which we summarized in Section 2.7.6. In this section, we demonstrate the existence of CSI models in real-world BNs.

Experimental Setup
This was achieved by testing the existence of CSI in 2 of the 8 real-world BNs from the NAT modelling study. The BNs selected were Andes, which models a physics tutoring system, and Win95pts, which models a printer diagnostics system [16]. The BNs were selected because they are entirely binary-valued; the other 6 BNs from the NAT modelling study were multi-valued. The Andes BN was modified to remove 3 isolated nodes (nodes with no parents nor children). We denote the modified BN as Andes -.
We apply the clustering approach discussed Section 3.3 on all families in each BN, which has 2 or more parents. Recall, the clustering algorithm groups probabilities into clusters based on a distance bound δ. Each cluster has a maximum inner-cluster cluster distance of δ. Each cluster has a minimum inter-cluster distance of δ. For this experiment, we use a distance bound of δ = 0.02.

Experimental Results
The clustering results are presented in

Experimental Results
q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q k = 0 k = 2 k = 4 k = 7 k = 10 d = 5%   The D+T approach introduced by this thesis is highlighted in red. Runtimes were obtained using a desktop with 2.9GHz clock speed.
It can be observed that N+N is the slowest in all (k, d) combinations, and that both D+N and N+T improve on the N+N approach. However, the relative performance between D+N and N+T is indiscernible. In 4 of the 10 (k, d) combinations: (0, 5%), (7, 10%), (10, 5%), (10, 10%), D+N has a greater mean inference runtime than N+T. In the remaining 6 combinations, N+T has a greater mean inference runtime than D+N. This could be partly due to the presence of normalization in both conversion methods.
Both de-causalization and network transformation tend to result in compiled structures with a smaller maximum number of parents and treewidth. Applying normalization on local models results in compiled structures with a larger maximum number of parents and treewidth. Since the inference efficiency is bounded by the maximum number of parents and treewidth, it is possible that the de-causalized or transformed families admit some efficiency savings but the remaining inefficient normalized families set the bounds on the inference efficiency. It follows that evaluating the relative performance between the D+T and N+T conversion methods may be comparing the normalization, rather than the de-causalization and transformation approaches. The third experiment eliminates the confounding variable of normalization to investigate the relative gain from alternative models.
Moreover, D+T is on average two orders of magnitude faster than alternatives, which clearly demonstrates the computational advantage obtained when exploiting both NAT and CSI in MNCBNs. Based on the same logic as above, the speedup of the D+T conversion method suggests that the removal of normalization reduces the bounds of inference efficiency.

Experimental Setup
The objective of the third experiment was to directly compare de-causalization and network transformation without the added noise of normalization. In this experiment, we generated BNs in two steps. First, we generated only DAGS with 200 variables each (binary or ternary). The largest number of parents per node is 12, and each DAG has at least 4 such families. We generated 300 distinct DAGs, one for each combination of the three parameters: • Number of variable duplications (k): 0, 2, 4, 7, 10 • Density beyond being singly connected (d): 5%, 10% • BN topology: 30 randomly generated Second, a pair of Bayesian networks are created from each DAG: a NAT-modelled Bayesian network (NMBN) and a CPT-tree modelled Bayesian network (CMBN).
The NMBN is generated by modelling all families with 2 or more parents with NAT models. The CMBN is generated by modelling all families with 2 or more parents with CPT-trees. Families with less than 2 parents are left as tabular CPTs. Hence, the pair of Bayesian networks have the same DAG, but differ in JPDs.
Each NMBN is de-causalized and each CMBN is network transformed. Each resultant BN is compiled for inference by lazy propagation. Ten inference runs are performed on each BN with random observations over 20 randomly selected variables. Figure 6.2 contains two panels that compare the log 10 inference runtimes of the NMBNs against the CMBNs. The left panel shows the inference runtime with a density beyond singly connected of 5% while the right panel shows a density beyond q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q d = 5% d = 10% In both panels (d = 5% and d = 10%), inference runtime of NMBNs are the least, even compared to the most efficient CMBNs that consist of CPT-trees with no duplicated variables (k = 0). This is observed in Figure 6.2 as the NMBN (blue box)

Experimental Results
is below the leftmost CMBN (black box) in both panels. We note the difference in efficiency between CMBN and NMBNs may be explained by two limitations of the network transformation approach.
First, the number of parents of a transformation's multiplexer node is always greater than the number of parents of a de-causalized node. Consider a CPT-tree transformed BN segment with one multiplexer node m that switches on the parent node n. Assuming all arcs in the source CPT-tree are singly-valued (i.e., one parent value per arc), the multiplexer node m will have dom(n) + 1 parents. That is, one parent to encode the original variable n, plus one parent to encode each state of dom(n). In comparison, every node in a de-causalized BN segment is guaranteed to have at most 2 parents [25]. A greater maximum number of parents generally decreases inference efficiency.
Second, a de-causalized BN segment is guaranteed to be loop free. By contrast, a CPT-tree with duplicated variables will induce loops, which raises the treewidth of the transformed structure (Section 4.4). A higher treewidth generally decreases inference efficiency. Hence, the greater maximum number of parents and the larger treewidth suggest that NAT modelled Bayesian networks are generally more efficient than CPT-tree modelled networks.
Furthermore, we note that larger k values in Figure 6.2 correspond to longer inference runtimes. This confirms the expected results discussed in Section 4.4. While a larger d value corresponds to longer inference as well, the impacts of increasing density from 5 to 10% beyond singly connected appear to have relatively less of an impact than increasing the k value.
In summary, we confirmed the coexistence of NAT and CSI models in real-world BNs. Next, we demonstrated the D+T conversion method that exploits both local models is two orders of magnitude faster than all other conversion methods. A possible explanation for the speedup is that the D+N and N+T conversion methods, which exploit one local model are subject to normalization on the other local model. It is plausible that the exploitation of one local model admits some efficiency savings but the normalized families enforce the lower bound of inference efficiency. Lastly, we evaluated the relative performance between the de-causalization and normalization approaches where we observed inference runtime of NAT-modelled BNs to be least, even relative to the most efficient CPT-trees with no duplicated variables.

Chapter 7 Conclusion
In this chapter, we summarize the key contributions of this thesis and offer some areas for future research.

Summary of Contributions
This thesis has introduced MNCBNs, a new representation for exploiting both CSI and causal independence in the same environment. The representation makes use of NAT models as its causal independence model and CPT-trees as its CSI model. An inference framework designed for this representation facilitates efficient inference by combining de-causalization on the NAT models and network transformation on the CPT-tree models in the MNCBN. This avoids the previous requirement normalizing one model to an exponentially sized tabular CPT. The inference framework has been shown to be both exact and efficient, resulting in a 2 times order of magnitude speed up for inference tasks on low density networks.
We demonstrated the necessity of this research by confirming that neither model was able to efficiently and exactly encode the other model. This indicates that an inference approach designed for one model cannot be applied to the other while maintaining exactness and efficiency.
We extended the work of Boutillier et al. [3] by formalizing the network transformation algorithm suite. This facilitates the conversion of a CPT-tree into a tabular modelled BN segment that preserves context-specific independencies. This BN segment is then compatible with many standard BN algorithms.
We have explored the coexistence of NAT and CSI models in real-world BNs.
This experiment combined previous work that demonstrated the existence of NAT models on 8 real-world BNs, with our study of identifying CSI on 2 real world BNs to positively confirm the coexistence in real-world BNs. Our study applied the clustering algorithm on the real-world BNs to identify how many groups of values would be needed to express the CPT. We found that both CPTs expressed a significant amount of CSI allowing for a concise expression by a CSI representation. Other work has also identified CSI in real-world datasets. Collectively, these studies confirm the coexistence of NAT and CSI in real-world BNs.
Some material forming the core of this thesis has been published and presented at the 2020 Canadian Artificial Intelligence conference [15].

Future Work
The first area of future work could explore the compression of tabular CPTs into CPT-trees. In this work, we make use of a clustering algorithm to estimate the number of parameters needed to specify a CPT-tree. Future work can extend these findings by learning a CPT-tree topology from a tabular CPT. It is likely the topology could be identified by extending common decision tree learning algorithms to CPT-trees. It is noted that some work [18] has been conducted in this area but its learning is restricted to binary trees with multi-valued arcs. It was demonstrated in Section 4.4 that if a CPT-tree tests each attribute individually, this restriction results in a greater number of variable duplications. Hence, a CPT-tree learning algorithm supporting multi-valued trees is desired.
Building off the first area, the second area of future work could explore learning MNCBNs from raw data. This would likely require first identifying the dependence structure (the directed acyclic graph) then identifying the optimal local model for each family. One possible way to identify the optimal local model for each family would be to compress each tabular CPT into both a NAT model and a CSI model.
The hypothetical algorithm would then evaluate the number of parameters and approximation error of the NAT and CSI models against the tabular CPT, and assign the model which best fits the family.
The final area of future work could be to evaluate the ability of CSI to model multi-valued NAT CPTs. In this work, we conducted experiments over binary NAT CPTs and outlined an approach to extend clustering to larger domains. It is predicted that clustering multi-valued NAT CPTs will identify less CSI due to the stricter conditions that must be met for CSI to exist, relative to the clustering of binary NAT CPTs. Further experimental evaluation over multi-valued NAT CPTs would substantiate the prediction.