Download - Monte Carlo Tree Search for Scheduling Activity Recognitionweb.engr.oregonstate.edu/~sinisa/talks/iccv13_AOG_presentation.pdf · Monte Carlo Tree Search for Scheduling Activity Recognition

Transcript
Page 1: Monte Carlo Tree Search for Scheduling Activity Recognitionweb.engr.oregonstate.edu/~sinisa/talks/iccv13_AOG_presentation.pdf · Monte Carlo Tree Search for Scheduling Activity Recognition

Selection: select state 𝑆0:𝑡 .Simulation: update the expected utility for the simulations, given 𝑟 Π𝑠𝑖𝑚 .Backpropagation: propagate the expected utility back to the root node by updating allthe nodes on the path using:

𝑄′ 𝑆0:𝑡, 𝑎𝑡 ← 𝑄′ 𝑆0:𝑡, 𝑎𝑡 + 𝐶 𝑟 Π𝑠𝑖𝑚

∀ 𝑆0:𝑡 ∈ Πsim, 𝑡 = {1,2, … , 𝐵}

Results

References

AcknowledgmentNSF IIS 1018490, ONR MURI N00014-10-1-0933,

DARPA MSEE FA 8650-11-1-7149

Monte Carlo Tree Search for Scheduling Activity RecognitionMohamed R. Amer, Sinisa Todorovic, Alan Fern

Oregon State University, Corvallis, OR, USASong-Chun Zhu

University of California, Los Angeles, CA, USA

Problem StatementGiven: a large video with high resolution and multipleactivities happening at the same time, and a query.Goal: perform cost-sensitive parsing to answer the query.

Approach

Activity Parsing as a Search ProblemMotivationQuerying surveillance videos requires running manydetectors of:• group activities,• individual actions,• objects.Running all the detectors is inefficient as they mayprovide little to no information for answering the query.

Parsing spatiotemporal And-Or Graphs can be defined in

terms of (α, β, ɣ, ω):• α: detecting objects, individual actions, group activities

directly from video features.

• β: predicting an activity from its detected parts by

bottom-up composition.

• ɣ: top-down prediction of an activity using its context.• ω: predicting an activity at a given time based on

tracking across the video.

log 𝑝 𝒑𝒈𝑙𝜏 ∝ 𝑤𝛼

𝑇 𝜙 𝑥𝑙𝜏, 𝑦, ℎ + 𝑤𝛾

𝑇 𝜙 𝑥𝑙−𝜏 , 𝑦, ℎ

+ 𝑖,𝑗𝑤𝛽

𝑇 𝜙 𝑥𝑖𝑙+𝜏 , 𝑥𝑗𝑙+

𝜏 , 𝑦, ℎ + 𝑤𝛼𝑇 𝜙 𝑥𝑙

𝜏, 𝑥𝑙𝜏−, 𝑦, ℎ

α detector of activity ɣ context of activity

ω tracking of activityβ relations between parts of the activities

Selection Simulation Backpropagation

Learning[4] M. Amer, D. Xie, M. Zhao, S. Todorovic, S-C. Zhu. “Cost-sensitive top-down/bottom-up inference for multi-scale activity recognition” ECCV12.[6] W. Choi, S. Savarse. “A unified frame work for multi-target tracking and collective activity recognition” ECCV12.[7] W. Choi, K. Shahid, S. Savarse. “What are they doing?: Collective activity classification using spatio-temporal relationship among people.” ICCV-W09[9] S. Khamis, V. Morariu, L.S. Davis. ”Combining per-frame and per-track cues for multi-person action recognition.” ECCV12

Classification accuracy on New Collective Activity dataset

Classification accuracy on UCLA Courtyard dataset

Limited budget

Infinite budget

Classification accuracy on Collective Activity dataset

New from [4]

[6] Tracking detections:

(V3), (QL), [4] Parse graph with no tracking:

(V2), [9] Parse graph with tracking the query:

(V1) Parse graph with tracking all: NEW from [4]

Action: 𝑎𝑡 ∈ 𝑝𝑟𝑜𝑐𝑒𝑠𝑠, 𝑑𝑒𝑡𝑒𝑐𝑡𝑜𝑟, 𝑠𝑝𝑎𝑡𝑖𝑜𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙 𝑣𝑜𝑙𝑢𝑚𝑒 ,State: 𝑆0:𝑡 = 𝑠0 𝑎0, 𝑎1, … , 𝑎𝑡−1 ,Policy: Π = 𝑠0, 𝑎0:𝑡 𝑡 = 1,2, … , 𝐵 , where 𝐵 is the inference budget,Reward: 𝑟(Π) = I log 𝑝 𝒑𝒈(Π) > 𝜃 , where 𝜃 is an input threshold.

The cost-sensitive inference for a policy, Π, is conducted by taking the set of actions, 𝑎0:𝐵, that provides the maximum utility, 𝑄, for states, 𝑆0:𝐵:

Π = argmaxa𝑄 𝑆0:𝑡, 𝑎𝑡 , ∀ 𝑡 = 1,2, … , 𝐵 ,

𝑖𝑓 𝑟 Π = 1 𝑎𝑐𝑐𝑒𝑝𝑡,0 𝑟𝑒𝑗𝑒𝑐𝑡.

New from [4]