Peter Spirtes, Jiji Zhang 1. Faithfulness comes in several flavors and is a kind of principle that...

48
Causal Faithfulness and Simplicity Peter Spirtes, Jiji Zhang 1

Transcript of Peter Spirtes, Jiji Zhang 1. Faithfulness comes in several flavors and is a kind of principle that...

1

Causal Faithfulness and Simplicity

Peter Spirtes, Jiji Zhang

2

Faithfulness comes in several flavors and is a kind of principle that selects simpler (in a certain sense) over more complicated models.

We show how to weaken the assumption of standard faithfulness so that it needs to be applied in fewer circumstances.

We show how to weaken the assumption of strong (ε)-faithfulness) so that it does not prohibit the existence of weak edges.

We show how to modify the causal search algorithms so that they make fewer mind changes as the sample size grows.

Goals

3

Example of SGS Algorithm X Y

Z

WTrue GraphW = aZ + εW

Z = bX + cY + εZ

X = εX

Y = εY

X Y

Z

W

X Y

Z

W

X Y

Z

W

X Y

Z

W

IP(W,X|Z) = 0IP(W,Y|Z) = 0IP(X,Y|∅) = 0

4

S1. Form the complete undirected graph H on the given set of variables V.

S2. For each pair of variables X and Y in V, search for a subset S of V\{X, Y} such that X and Y are independent conditional on S. Remove the edge between X and Y in H iff such a set is found.

S3. Let K be the graph resulting from S2. For each unshielded triple <X, Y, Z> (i.e., X and Y are adjacent, Y and Z are adjacent, but X and Z are not adjacent), if X and Z are independent conditional on some subset of V\{X, Y} that does not contain Y, then orient the triple as a collider: X Y Z.

S4. Execute the entailed orientation rules.

SGS algorithm

5

Causal Markov Assumption: For a set of variables for which there are no unmeasured common causes, each variable is independent of its non-effects conditional on its direct causes. Non-obvious equivalent formulation: If IG(X,Y|

Z) in causal DAG G with no unmeasured common causes then IP(X,Y|Z) = 0.

If IP(X,Y|Z) = 0 then IG(X,Y|Z) in causal DAG G.Converse of Causal Markov Assumption.If IP(X,Y|Z) is a rational function of parameters,

then violations are Lebesgue measure 0.

Causal Faithfulness Assumption

6

Reduction of UnderdetminationIf I(A,B|∅) then prefer A → C ← B to A → C → B

Computational EfficiencyIf A – C – B and I(A,B|∅) then don’t need to check

I(A,B|C).

Statistical EfficiencyThe Markov equivalence class can be found

without testing independence conditional on a set with more than maximum degree of any variable in the true causal graph.

Three Faces of Faithfulness

7

If causal sufficiency, Causal Markov and Causal Faithfulness Assumptions, then there exist pointwise consistent estimators of Markov equivalence classSGSPCGES (Gaussian, multinomial)

If just assume Causal Markov Assumption and causal sufficiency there are no pointwise consistent estimators of Markov Equivalence ClassGaussianMultinomialUnrestricted

Faithfulness Assumptions and Pointwise Consistency

8

If causal sufficiency, Causal Markov and Causal Faithfulness Assumptions, then no uniform consistent estimator of Markov Equivalence ClassGaussianMultinomialUnrestricted

Faithfulness Assumptions and Uniform Consistency

9

(A4: ε-faithfulness) The partial correlations between X(i) and X( j) given {X(r); r k} for some set k {1,…,pn}\{i,j} are denoted by rn;i,j|k. Their absolute values are bounded from below and above:

Kalish and Buhlmann Assumptions

10

Kalisch and Buhlmann Theorem

11

Uhler et al.: (A4) tends to be violated fairly often, if the parameter values are assigned randomly, and ε is not very small.

There are two ways to get very small partial correlations – almost cancellations and very weak edges.

(A4) forbids both – it entails that there are no very weak edges.

Kalish and Buhlmann Assumptions

12

X Y X Y X Y X Y

Z Z Z Z

W W W W

Discontinuities in Limiting Output

13

Behavior as Sample Size GrowsX Y IP(W,{X,Y}|Z) IP(W,{X,Y}|∅) Z IP(X,Y|∅) IP(W,Z|∅) WOutput Small Sample

X Y IP(W,{X,Y}|Z) IP(W,{X,Y}|∅) Z IP(X,Y|∅) IP(W,Z|∅) WOutput Medium- Sample

X Y IP(W,{X,Y}|{Z}) IP(W,{X,Y}|∅) Z IP(X,Y|∅) IP(W,Z|∅) WOutput Medium+ Sample

X Y IP(W,{X,Y}|{Z}) IP(W,{X,Y}|∅) Z IP(X,Y|∅) IP(W,Z|∅) WOutput Large Sample

14

Desired Behavior as Sample Size GrowsX Y IP(W,{X,Y}|Z) IP(W,{X,Y}|∅) Z IP(X,Y|∅) IP(W,Z|∅) WOutput Small Sample

X Y IP(W,{X,Y}|Z) IP(W,{X,Y}|∅) Z IP(X,Y|∅) IP(W,Z|∅) WOutput Medium- Sample

X Y IP(W,{X,Y}|{Z}) IP(W,{X,Y}|∅) Z IP(X,Y|∅) IP(W,Z|∅) WOutput Medium+ Sample

X Y IP(W,{X,Y}|{Z}) IP(W,{X,Y}|∅) Z IP(X,Y|∅) IP(W,Z|∅) WOutput Large Sample

15

X → Y → Z → W X – Y – Z – W X – Y – Z → W

IP(X,Z|Y) IP(X,Z|Y) IP(X,Z|Y)

IP(Y,W|{X,Z)} IP(Y,W|{X,Z)} IP(Y,W|{X,Z)}

IP(X,W|∅) IP(X,W|∅)

True Graph Small Sample Large Sample

Behavior as Sample Size Grows

16

X → Y → Z → W X – Y – Z – W X – Y – Z → W

IP(X,Z|Y) IP(X,Z|Y) IP(X,Z|Y)

IP(Y,W|{X,Z)} IP(Y,W|{X,Z)} IP(Y,W|{X,Z)}

IP(X,W|∅) IP(X,W|∅)

True Graph Small Sample Large Sample

Desired Behavior as Sample Size Grows

17

X Y

Z

W

True Graph

18

Behavior as Sample Size GrowsX Y IP(W,{X,Y}|Z) IP(W,{X,Y}|∅) Z IP(X,Y|∅) IP(W,Z|∅) WOutput Small Sample

X Y IP(W,{X,Y}|Z) IP(W,{X,Y}|∅) Z IP(X,Y|∅) IP(W,Z|∅) WOutput Medium- Sample

X Y IP(W,{X,Y}|{Z}) IP(W,{X,Y}|∅) Z IP(X,Y|∅) IP(W,Z|∅) WOutput Medium+ Sample

X Y IP(W,{X,Y}|{Z}) IP(W,{X,Y}|∅) Z IP(X,Y|∅) IP(W,Z|∅) WOutput Large Sample

19

S3*. Let K be the undirected graph resulting from S2. For each unshielded triple <X, Y, Z>, If X and Z are not independent conditional on

any subset of V\{X, Y} that contains Y, then orient the triple as a collider: X Y Z.

If X and Z are not independent conditional on any subset of V\{X, Y} that does not contain Y, then mark the triple as a non-collider.

Otherwise, mark the triple as ambiguous (or unfaithful).

CSGS

20

Adjacency – If X – Y in the causal DAG then IP(X,Y|Z) ≠ 0 for any Z.

Assumptions about which independencies

21

Triangle – For any three variables that form a triangle in causal DAG GIf Z is a non-collider on the path <X, Z, Y>, then X

and Y are not independent conditional on any subset of V\{X, Y} that does not contain Z;

If Z is a collider on the path <X, Z, Y>, then X and Y are not independent conditional on any subset of V\{X, Y} that contains Z.

Suppose X → Y ← Z and IP(X,Z|Y) = 0. This is faithful to X → Y → Z. This cannot be detected, so it must be assumed.

Assumptions about which independencies

22

X ¬I(X,Z|∅) Z ¬I(X,Y|Z)Y ¬I(Y,Z|∅)

X ¬I(X,Z|∅) ¬I(X,Z|W) ¬I(X,Z|Y,W) Z ¬I(Y,Z|∅) ¬I(Y,Z|W) ¬I(Y,Z|X,W)Y ¬I(X,Y|Z) ¬I(X,Y|W) ¬I(X,Y|Z,W) W ¬I(X,W|∅) ¬I(X,W|Z) ¬I(X,W|Y)

¬I(Y,W|∅) ¬I(Y,W|X) ¬I(Y,W|Z)¬I(Z,W|∅) ¬I(Z,W|X) ¬I(Z,W|Y)

Triangle-Faithfulness

23

The population distribution is not Markov to any proper subDAG of the true causal DAG.

Causal Minimality is entailed by manipulation definition of causation if a distribution is positive.

There is a weaker kind of causal minimality – P-minimality: the population distribution is not Markov to any DAG that entails a proper superset of the conditional independence relations.Is this sufficient for the correctness of VCSGS?

Causal Minimality

24

X → Y → Z → W X – Y – Z – W X – Y – Z – W

IP(X,Z|Y) IP(X,Z|Y) IP(X,Z|Y)

IP(Y,W|{X,Z)} IP(Y,W|{X,Z)} IP(Y,W|{X,Z)}

IP(X,W|∅) IP(X,W|∅)

True Graph Small Sample Large Sample

Example of VCSGS

25

X → Y → Z → W X – Y – Z – W X – Y – Z →W

IP(X,Z|Y) IP(X,Z|Y) IP(X,Z|Y)

IP(Y,W|{X,Z)} IP(Y,W|{X,Z)} IP(Y,W|{X,Z)}

IP(X,W|∅) IP(X,W|∅)

True Graph Small Sample Large Sample

Example of VCSGS

26

X → Y → Z → W X – Y – Z – W X – Y – Z →W

IP(X,Z|Y) IP(X,Z|Y) IP(X,Z|Y)

IP(Y,W|{X,Z)} IP(Y,W|{X,Z)} IP(Y,W|{X,Z)}

IP(X,W|∅) IP(X,W|∅)

True Graph Small Sample Large Sample

Example of VCSGS

27

V1. Form the complete undirected graph H on the given set of variables V.

V2. For each pair of variables X and Y in V, search for a subset S of V\{X, Y} such that X and Y are independent conditional on S. Remove the edge between X and Y in H and mark the pair <X, Y> as ‘apparently non-adjacent’, if and only if such a set is found.

V3. Let K be the graph resulting from V2. For each apparently unshielded triple <X, Y, Z> (i.e., X and Y are adjacent, Y and Z are adjacent, but X and Z are apparently non-adjacent), If X and Z are not independent conditional on any subset

of V\{X, Y} that contains Y, then orient the triple as a collider: X Y Z.

If X and Z are not independent conditional on any subset of V\{X, Y} that does not contain Y, then mark the triple as a non-collider.

Otherwise, mark the triple as ambiguous (or unfaithful), and mark the pair <X, Z> as ‘definitely non-adjacent’.

VCSGS algorithm

28

V4. Execute the same orientation rules as in S4, until none of them applies.

V5. Let M be the graph resulting from V4. For each consistent disambiguation of the ambiguous triples in M (i.e., each disambiguation that leads to a pattern), test whether each vertex V in the resulting pattern satisfies the Markov condition. If V and W satisfy the Markov condition in every pattern, then mark the ‘apparently non-adjacent’ <V,W> pair as ‘definitely non-adjacent’.

VCSGS algorithm

29

Inclusion Relations for given P

Faithfulness

Adjacency-Faithfulness

Triangle-Faithfulness

P-Minimality

30

If Triangle Faithfulness Assumption, Causal Minimality Assumption, and Causal Markov Assumption, then VCSGS is a consistent estimator of the extended Markov equivalence class.

Is it complete?

Faithfulness Assumptions and Pointwise Consistency

31

V5*. Let M be the graph resulting from V4. For each consistent disambiguation of the ambiguous triples in M (i.e., each disambiguation that leads to a pattern), test whether each vertex V in the resulting pattern satisfies the Markov condition. If V and W satisfy the Markov condition in some pattern, then mark the ‘apparently non-adjacent’ <V,W> pair as ‘definitely non-adjacent’.

Conjecture

32

Assumption NVV(J):

Assumption UBC(C):

Our Assumptions

33

Given a set of variables V, suppose the true causal model over V is M = <P,G>, where P is a Gaussian distribution over V, and G is a DAG with vertices V For any three variables X, Y, Z that form a triangle in G (i.e., each pair of vertices is adjacent),If Y is a non-collider on the path <X, Y, Z>, then

|r(X, Z|W)| ≥ k |eM(X – Z)| for all WV that do not contain Y; and

If Y is a collider on the path <X, Y, Z>, then |r(X, Z|W)| ≥ k |eM(X – Z)| for all WV that do contain Y.

k-Triangle-Faithfulness Assumption

34

S3* (sample version). Let K be the undirected graph resulting from the adjacency phase. For each unshielded triple <X, Y, Z>, If there is a set W not containing Y such that the test of r(X, Z|

W) = 0 returns 0 (i.e., accepts the hypothesis), and for every set U that contains Y, the test of |r(X,Z|U)| = 0 returns 1 (i.e., rejects the hypothesis), and the test of |r(X,Z|U) – r(X,Z|W)| L returns 0 (i.e., accepts the hypothesis), then orient the triple as a collider: X Y Z.

If there is a set W containing Y such that the test of r(X, Z|W) = 0 returns 0 (i.e., accepts the hypothesis), and for every set U that does not contain Y, the test of |r(X,Z|U)| = 0 returns 1 (i.e., rejects the hypothesis), and the test of |r(X,Z|U) – r(X,Z|W)| L returns 0 (i.e., accepts the hypothesis), then mark the triple as a non-collider.

Otherwise, mark the triple as ambiguous.

VCSGS (Sample version)

35

Say that CSGS(L, n, M) errs if it contains (i) an adjacency not in GM; or (ii) a marked non-collider not in GM, or (iii) an orientation not in GM.

Theorem: Given causal sufficiency of the measured variables V, the Causal Markov, k-Triangle-Faithfulness, NVV(J), and UBC(C) Assumptions, the CSGS algorithm is uniformly consistent in the sense that

Uniform Consistency

36

For each vertex ZIf every vertex not adjacent to Z is not

confirmed to be non-adjacent to Z return ‘Unknown’ for every edge containing Z

else For every non-adjacent pair <Y, Z> in EP(G),

let the estimate be 0For each vertex Z such that all of the edges

containing Z are oriented in EP(G), if Y is a parent of Z in EP(G), let the estimate be the sample regression coefficient of Y in the regression of Z on its parents in EP(G).

Estimation Algorithm

37

Let M1 be an output of the Estimation Algorithm, and M2 be a causal model. We define the structural coefficient distance, d[M1,M2], between M1 and M2 to be

where by convention if = “Unknown”.

Structural Coefficient Distance

38

E1. Run the CSGS algorithm on an i.i.d. sample of size n from PM.

E2. Let the output from E1 be CSGS(L, n, M). Apply step V5 in the VCSGS algorithm (from section 3), using tests of zero partial correlations and record which non-adjacencies are confirmed.

E3. Apply the Estimation Algorithm to CSGS(L, n, M), the confirmed non-adjacencies, and the sample of size n.

Edge Estimation Algorithm

39

Given causal sufficiency of the measured variables V, the Causal Markov, k-Triangle-Faithfulness, NVV(J), and UBC(C) Assumptions, the Edge Estimation I algorithm is uniformly consistent in the sense that for every > 0

For a large enough and dense enough graph, this still allows for the possibility of large manipulation errors (due to many small edge errors.

Uniform Consistency

40

Breaking the Markov Equivalence ClassX1 X2 X3

1.00.01 1.00.78777810.612157 1.0

41

Breaking the Markov Equivalence Classif k > 0.014, then the k-Triangle-Faithfulness Assumption

is violated for models M2 and M3, but not for M1. If 0.008 < k < 0.014 then the k-Triangle-Faithfulness

Assumption is violated for models M3, but not for M1 or M2.

42

E1. Run Edge Estimation Algorithm I. E2. Set ForbiddenOrientations = {}.E3. For each maximal clique in CSGS(L, n, M) such

that if a vertex in the clique is not adjacent to some vertex not in the clique, it is definitely non-adjacent (i) for each possible orientation O of all of the unoriented

edges in the maximal clique Apply the orientation O to each of the unoriented edges.Apply Meeks’ orientation rules.If application of the rules produces a cycle or a new

unshielded collider add O to ForbiddenOrientationsAdd O to ForbiddenOrientations if for any Y and W such that

Y is a non-collider the path <X, Y, Z>, and W V and does contain Y

Edge Estimation Algorithm II

43

E4. For each unoriented edge X – Y in CSGS(L, n, M), if there is only one orientation X Y that does not occur in ForbiddenOrientations, and every vertex that Y is not adjacent to, Y is definitely not adjacent to, orient as X Y

E5. For each vertex V such that some edge containing V in CSGS(L, n, M) is not oriented, if there is only one orientation of all of the edges containing V that is not in ForbiddenOrientations, and every vertex that V is not adjacent to, V is definitely not adjacent to, let the estimate of each edge equal be the sample regression coefficient of V on its parents in the non-forbidden orientation.

Edge Estimation Algorithm II

44

Theorem: Given causal sufficiency of the measured variables V, the Causal Markov, k-Triangle-Faithfulness, NVV(J), and UBC(C) Assumptions, the Edge Estimation II algorithm is uniformly consistent in the sense that for every > 0

where O(L,n,M) is the graphical output of the Edge Estimation II algorithm, and is the output of the Edge Estimation II algorithm.

Uniform Consistency

45

We weaken the assumption of faithfulness so that fewer inferences from conditional independence to d-separation need to be made.

We strengthened the assumption so that it allows one to make inferences from “almost independence” in a probability distribution to d-separation in a causal graph, allowing for the existence of uniformly consistent estimation algorithms.

Conclusion

46

We changed the concept of correctness to allow for missing weak edges, and saying “don’t know” about some features of Markov equivalence classes.

The new simplicity assumption broke up the Markov equivalence class in the sense that it considers some models in a Markov equivalence class simpler than other models in the same Markov equivalence class.

This allowed for uniformly consistent estimates of linear coefficients in a causal model, as well as causal structure.

Conclusion

47

Can we get similar results for:PCFCInon-linear modelsincreasing numbers of variables and vertex degree and

decreasing k (analogous to Kalisch and Buhlmann)?If parameter values are randomly assigned, how often

is k-triangle faithfulness violated as a function ofsample sizeclique sizeparameter distributionk

Open Questions

48

Kalisch, M., and P. Bühlmann (2007). Estimating high-dimensional directed acyclic graphs with the PC-algorithm. Journal of Machine Learning Research 8, 613–636.

Spirtes, P., and Zhang, J. (forthcoming) A Uniformly Consistent Estimator of Causal Effects Under The k-Triangle-Faithfulness Assumption, Statistical Science.

Spirtes, P., and Zhang, J. (submitted) Three Faces of Faithfulness, Synthese.

References