Machine Teaching and its Applications

64
1/64 Machine Teaching and its Applications Jerry Zhu University of Wisconsin-Madison Jan. 8, 2018

Transcript of Machine Teaching and its Applications

Page 1: Machine Teaching and its Applications

1/64

Machine Teaching and its Applications

Jerry Zhu

University of Wisconsin-Madison

Jan. 8, 2018

Page 2: Machine Teaching and its Applications

2/64

Introduction

Page 3: Machine Teaching and its Applications

3/64

Machine teaching

Given target model θ∗, learner AFind the best training set D so that

A(D) ≈ θ∗

Page 4: Machine Teaching and its Applications

4/64

Passive learning, active learning, teaching

Page 5: Machine Teaching and its Applications

5/64

Passive learning

with large probability |θ − θ∗| = O(n−1)

Page 6: Machine Teaching and its Applications

6/64

Active learning

|θ − θ∗| = O(2−n)

Page 7: Machine Teaching and its Applications

7/64

Machine teaching

∀ε > 0, n = 2

Page 8: Machine Teaching and its Applications

8/64

Another example: teaching hard margin SVM

TD = 2 vs. V C = d+ 1

Page 9: Machine Teaching and its Applications

9/64

Machine learning vs. machine teaching

I learning (D given, learn θ)

θ = argminθ

∑(x,y)∈D

`(x, y, θ) + λ‖θ‖2

I teaching (θ∗ given, learn D)

minD,θ

‖θ − θ∗‖2 + η‖D‖0

s.t. θ = argminθ

∑(x,y)∈D

`(x, y, θ) + λ‖θ‖2

I D not i.i.d.I synthetic or pool-based

Page 10: Machine Teaching and its Applications

10/64

Why bother if we already know θ∗?

teach·ing/’teCHiNG/noun

1. education

2. controlling

3. shaping

4. persuasion

5. influence maximization

6. attacking

7. poisoning

Page 11: Machine Teaching and its Applications

11/64

The coding view

I message=θ∗

I decoder=learning algorithm A

I language=D

A(D)

θ∗

Α (θ )−1 ∗

A

Α−1

D

Θ

DD

Page 12: Machine Teaching and its Applications

12/64

Machine teaching generic form

minD,θ

TeachingRisk(θ) + ηTeachingCost(D)

s.t. θ = MachineLearning(D)

Page 13: Machine Teaching and its Applications

13/64

Fascinating things I will not discuss today

I probing graybox learners

I teaching by features, pairwise comparisons

I learner anticipates teaching

I reward shaping, reinforcement learning, optimal control

Page 14: Machine Teaching and its Applications

14/64

Machine learning debugging

Page 15: Machine Teaching and its Applications

15/64

Harry Potter toy example

Page 16: Machine Teaching and its Applications

16/64

Labels y contain historical bias

+ hired by the Ministry of Magic

o no

0 0.5 1magical heritage

0

0.5

1ed

ucat

ion

Page 17: Machine Teaching and its Applications

17/64

Trusted items (x, y)

I expensive

I insufficient to learn

0 0.5 1magical heritage

0

0.5

1ed

ucat

ion

Page 18: Machine Teaching and its Applications

18/64

Idea

0 0.5 1

magical heritage

0

0.5

1

educa

tion

Flip training labels and re-train model to agree with trusted items.

Ψ(θ) := [θ(x) = y]

Page 19: Machine Teaching and its Applications

19/64

Not our goal: only to learn a better model

minθ∈Θ

`(X,Y, θ) + λ‖θ‖

s.t. Ψ(θ) = true

Page 20: Machine Teaching and its Applications

20/64

Our goal: To find bugs and learn a better model

minY ′,θ

‖Y − Y ′‖

s.t. Ψ(θ) = true

θ = argminθ∈Θ

`(X,Y ′, θ) + λ‖θ‖

Page 21: Machine Teaching and its Applications

21/64

Solving combinatorial, bilevel optimization(Stackelberg game)

step 1. label to probability simplex

y′i → δi ∈ ∆

step 2. counting to probability mass

‖Y ′ − Y ‖ → 1

n

n∑i=1

(1− δi,yi)

step 3. soften postcondition

θ(X) = Y → 1

m

m∑i=1

`(xi, yi, θ)

Page 22: Machine Teaching and its Applications

22/64

Continuous now, but still bilevel

argminδ∈∆n,θ

1

m

m∑i=1

`(xi, yi, θ) + γ1

n

n∑i=1

(1− δi,yi)

s.t. θ = argminθ

1

n

n∑i=1

k∑j=1

δij`(xi, j, θ) + λ‖θ‖2

Page 23: Machine Teaching and its Applications

23/64

Removing the lower level problem

θ = argminθ

1

n

n∑i=1

k∑j=1

δij`(xi, j, θ) + λ‖θ‖2

step 4. the KKT condition

1

n

n∑i=1

k∑j=1

δij∇θ`(xi, j, θ) + 2λθ = 0

step 5. plug implicit function θ(δ) into upper level problem

argminδ

1

m

m∑i=1

`(xi, yi, θ(δ)) + γ1

n

n∑i=1

(1− δi,yi)

step 6. compute gradient ∇δ with implicit function theorem

Page 24: Machine Teaching and its Applications

24/64

Software available.

Page 25: Machine Teaching and its Applications

25/64

Harry Potter Toy Example

0 0.5 1

magical heritage

0

0.5

1

educa

tion

0 0.5 1

magical heritage

0

0.5

1

educa

tion

0 0.5 1

magical heritage

0

0.5

1

educa

tion

data our debugger influence function

0 0.5 1

magical heritage

0

0.5

1

educa

tion

0 0.5 1

magical heritage

0

0.5

1

educa

tion

0 0.5 1

Recall

0

0.5

1

Pre

cis

ion

DUTI

INF

NN

LND (Oracle)

nearest neighbor label noise detection average PR

Page 26: Machine Teaching and its Applications

26/64

Adversarial Attacks

Page 27: Machine Teaching and its Applications

27/64

Level 1 attack: test item (x, y) manipulation

minx

‖x− x‖p

s.t. θ(x) 6= y.

Model θ fixed.

Page 28: Machine Teaching and its Applications

28/64

Level 2 attack: training set poisoning

minD

‖D0 −D‖p

s.t. Ψ(A(D))

e.g. Ψ(θ) := [θ(x+ ε) = y′]

Page 29: Machine Teaching and its Applications

29/64

Level 2 attack on regression

Lake Mendota, Wisconsin (x, y)

1900 1920 1940 1960 1980 200040

60

80

100

120

140

YEARIC

E D

AY

S

Page 30: Machine Teaching and its Applications

30/64

Level 2 attack on regression

minδ,β

‖δ‖p

s.t. β1 ≥ 0

β = argminβ‖(y + δ)−Xβ‖2

1900 1920 1940 1960 1980 200040

60

80

100

120

140

YEAR

ICE

DA

YS

original, β1=−0.1

2−norm attack, β1=0

1900 1920 1940 1960 1980 20000

50

100

150

200

250

300

YEAR

ICE

DA

YS

original, β1=−0.1

1−norm attack, β1=0

minimize ‖δ‖22 minimize ‖δ‖1

[Mei, Z 15a]

Page 31: Machine Teaching and its Applications

31/64

Level 2 attack on latent Dirichlet allocation

[Mei, Z 15b]

Page 32: Machine Teaching and its Applications

32/64

Guess the classification task

Ready?

Page 33: Machine Teaching and its Applications

33/64

Guess the classification task (1)

Page 34: Machine Teaching and its Applications

34/64

Guess the classification task (2)

Page 35: Machine Teaching and its Applications

35/64

Guess the classification task (3)

+ The Angels won their home opener against the Brewers todaybefore 33,000+ at Anaheim Stadium, 3-1 on a 3-hitter by Mark Langston.

+ I’m *very* interested in finding out how I might be able to get twotickets for the All Star game in Baltimore this year.

+ I know there’s been a lot of talk about Jack Morris’ horrible start,but what about Dennis Martinez. Last I checked he’s 0-3 with 6+ ERA....

- Where are all the Bruins fans??? Good point - there haven’t evenbeen any recent posts about Ulf!

- I agree thouroughly!! Screw the damn contractual agreements!Show the exciting hockey game. They will lose fans of ESPN

- TV Coverage - NHL to blame! Give this guy a drug test, andsome Ridalin whale you are at it.

...

Page 36: Machine Teaching and its Applications

36/64

Did you get it right? (1)

gun vs. phone

Page 37: Machine Teaching and its Applications

37/64

Did you get it right? (2)

woman vs. man

Page 38: Machine Teaching and its Applications

38/64

Did you get it right? (3)

20Newsgroups soc.religion.christian vs. alt.atheism

+ : T H E W I T N E S S & P R O O F O F :: J E S U S C H R I S T ’ S R E S U R R E C T I O N :

: F R O M T H E D E A D :

+ I’ve heard it said that the accounts we have of Christs life andministry in the Gospels were actually written many years after

- An Introduction to Atheismby mathew <[email protected]>

- Computers are an excellent example...of evolution without ”a” creator.

Page 39: Machine Teaching and its Applications

39/64

Camouflage attack

Social engineering against Eve

Page 40: Machine Teaching and its Applications

40/64

Camouflage attack

Alice knows

I S (e.g. women, men)

I C (e.g. 7, 1)

I A

I Eve’s inspection function MMD (maximum mean discrepancy)

finds

argminD⊆C

∑(x,y)∈S

`(A(D), x, y)

s.t. MMD(D,C) ≤ α

Page 41: Machine Teaching and its Applications

41/64

Test set error

0 25000 50000 75000 100000 125000 150000 175000 200000Budget

0.0

0.2

0.4

0.6

0.8

1.00/

1 Er

ror

baseline1baseline220100500

(Gun vs. Phone) camouflaged as (5 vs. 2)

Page 42: Machine Teaching and its Applications

42/64

Enhance human learning

Page 43: Machine Teaching and its Applications

43/64

“Hedging”

1. Find D∗ to maximize accuracy on cognitive model A

2. Give humans D∗

I either human performance improvedI or cognitive model A revised

Page 44: Machine Teaching and its Applications

44/64

Human learning example 1[Patil et al. 2014]

θ =0.5∗0 1

y=−1 y=1

A = kernel density estimator

human trained on human test accuracy

random items 69.8%D∗ 72.5%

(statistically significant)

Page 45: Machine Teaching and its Applications

45/64

Human learning example 2[Sen et al. in preparation]

Lewis space-filling

A = neural network

human trained on human test error

random 28.6%expert 28.1%D∗ 25.1%

(statistically significant)

Page 46: Machine Teaching and its Applications

46/64

Human learning example 3[Nosofsky & Sanders, Psychonomics 2017]

A = Generalized Context Model (GCM)

human trained on human accuracy

random 67.2%coverage 71.2%D∗ 69.3%

D∗ not better on humans (experts revising the model)

Page 47: Machine Teaching and its Applications

47/64

Super Teaching

Page 48: Machine Teaching and its Applications

48/64

Page 49: Machine Teaching and its Applications

49/64

Super teaching example 1

Let Diid∼ U(0, 1), A(D) =SVM.

whole training set O(n−1)

most symmetrical pair O(n−2)(Not training set reduction)

Page 50: Machine Teaching and its Applications

50/64

Super teaching example 2

Let Diid∼ N(0, 1), A(D) = 1

|D|∑

x∈D x.

Theorem: Fix k. For n sufficiently large, with largeprobablity

minS⊂D,|S|=k

|A(S)| ≤ kk−ε√kn−k+ 1

2+2ε|A(D)|

Page 51: Machine Teaching and its Applications

51/64

Thank you

I email me for “Machine Teaching Tutorial”

I http://www.cs.wisc.edu/~jerryzhu/machineteaching/

I Collaborators:I Security: Scott Alfeld, Paul BarfordI HCI: Saleema Amershi, Bilge Mutlu, Jina SuhI Programming language: Aws Albarghouthi, Loris D’Antoni,

Shalini GhoshI Machine learning: Ran Gilad-Bachrach, Manuel Lopes, Yuzhe

Ma, Christopher Meek, Shike Mei, Robert Nowak, GoruneOhannessian, Philippe Rigollet, Ayon Sen, Patrice Simard, AraVartanian, Xuezhou Zhang

I Optimization: Ji Liu, Stephen WrightI Psychology: Bradley Love, Robert Nosofsky, Martina Rau,

Tim Rogers

Page 52: Machine Teaching and its Applications

52/64

Page 53: Machine Teaching and its Applications

53/64

Yet another example: teach Gaussian density

TD = d+ 1: tetrahedron vertices

Page 54: Machine Teaching and its Applications

54/64

Proposed bugs

I flipping them makes re-trained model agree with trusted items

I given to experts to interpret

0 0.5 1

magical heritage

0

0.5

1

educa

tion

Page 55: Machine Teaching and its Applications

55/64

The ML pipeline

data (X,Y ) → learner ` → parameters λ → model θ

θ = argminθ∈Θ

`(X,Y, θ) + λ‖θ‖

Page 56: Machine Teaching and its Applications

56/64

Postconditions

Ψ(θ)

Examples:

I “the learned model must correctly predict an important item(x, y)”

θ(x) = y

I “the learned model must satisfy individual fairness”

∀x, x′, |p(y = 1 | x, θ)− p(y = 1 | x′, θ)| ≤ L‖x− x′‖

Page 57: Machine Teaching and its Applications

57/64

Bug Assumptions

I Ψ satisfied if we were to train through “clean pipeline”

I bugs are changes to the clean pipeline

I Ψ violated on the dirty pipeline

Page 58: Machine Teaching and its Applications

58/64

Debugging formulation

minY ′

‖Y ′ − Y ‖

s.t. θ(X) = Y

θ = argminθ∈Θ

1

n

n∑i=1

`(xi, y′i, θ) + λ‖θ‖2

I bilevel optimization (Stackelberg game)

I combinatorial

Page 59: Machine Teaching and its Applications

59/64

Another special case: bug in regularization weight

0 20 40 60 80 100

score

0

0.5

1

pass

/fail

(logistic regression)

Page 60: Machine Teaching and its Applications

60/64

Postcondition violated

Ψ(θ): Individual fairness (Lipschitz condition)

∀x, x′, |p(y = 1 | x, θ)− p(y = 1 | x′, θ)| ≤ L‖x− x′‖

Page 61: Machine Teaching and its Applications

61/64

Bug assumption

Learner’s regularization weight λ = 0.001 was inappropriate

θ = argminθ∈Θ

`(X,Y, θ) + λ‖θ‖2

Page 62: Machine Teaching and its Applications

62/64

Debugging formulation

minλ′,θ

(λ′ − λ)2

s.t. Ψ(θ) = true

θ = argminθ∈Θ

`(X,Y, θ) + λ′‖θ‖2

Page 63: Machine Teaching and its Applications

63/64

Suggested bug

λ = 0.001→ λ′ = 121

0 20 40 60 80 100

score

0

0.5

1

pass

/fail

Page 64: Machine Teaching and its Applications

64/64

Guaranteed defense?

LetA(D0)(x) = y

Attacker can use the debug formulation

D1 := argminD

‖D0 −D‖p

s.t. Ψ1(A(D)) := A(D)(x) 6= y

Defender can use the debug formulation, too

D2 := argminD

‖D1 −D‖p

s.t. Ψ2(A(D)) := A(D)(x) = y

When does D2 = D0?