ε -Optimal Minimum-Delay/Area Zero-Skew Clock Tree Wire-Sizing in Pseudo-Polynomial Time

Post on 09-Jan-2016

37 views 0 download

description

ε -Optimal Minimum-Delay/Area Zero-Skew Clock Tree Wire-Sizing in Pseudo-Polynomial Time. Jeng-Liang Tsai Tsung-Hao Chen Charlie Chung-Ping Chen (National Taiwan University). University of Wisconsin-Madison http://vlsi.ece.wisc.edu. Outline. Background Motivation and contribution - PowerPoint PPT Presentation

Transcript of ε -Optimal Minimum-Delay/Area Zero-Skew Clock Tree Wire-Sizing in Pseudo-Polynomial Time

1

εε-Optimal Minimum-Delay/Area Zero-Skew Clock -Optimal Minimum-Delay/Area Zero-Skew Clock

Tree Wire-Sizing in Pseudo-Polynomial TimeTree Wire-Sizing in Pseudo-Polynomial Time

Jeng-Liang TsaiJeng-Liang Tsai

Tsung-Hao ChenTsung-Hao Chen

Charlie Chung-Ping ChenCharlie Chung-Ping Chen (National Taiwan (National Taiwan

University)University)

University of Wisconsin-Madisonhttp://vlsi.ece.wisc.edu

2

OutlineOutline

Background• Motivation and contribution• Literature overview

ClockTune algorithm• Problem formulation• ClockTune algorithm overview• Optimality and complexity analysis

Experimental results• Runtime, memory usage, and optimality• Power/Delay trade-off• Incremental refinement

3

MotivationMotivation

Clock skew cycle time penalty• Start with zero-skew clock tree

• Minimize clock delay reduces system-level skew (Kuh, et al. [DAC ‘90])

Clock tree is power-hungry (30% in Intel McKinley(0.18um/1GHz/130W) • P = f CV2

• Minimize switching capacitance (wiring area)

Stability affects design convergence• Allow incremental refinement to accommodate local changes

Interconnect delay dominates total delay• Wire-sizing is effective in reducing interconnect delay

4

MotivationMotivation

Non-convex zero-skew constraints• No known algorithm solves zero-skew wire-sizing problem optimally

with polynomial runtime

Hence, a good clock tree wire-sizing algorithm can Minimize delay and power Guarantee optimality and runtime Have good stability

5

ContributionContribution

First ε-optimal algorithm for solving clock min-delay/power zero-skew wire-sizing optimization problem

Provide complete (Sampled) solution set of the delay/power/area trade-off information for design planning

Efficient pseudo-polynomial runtime (6170-branch clock tree in 6 minutes within 1% optimality)

Runtime v.s. Optimality tradeoff Incremental clock re-balancing to speed up design convergence

6

Literature OverviewLiterature Overview

“Reliable non-zero skew clock tree using wire width optimization”, Pillage, et al. [DAC ’93]• Iteratively optimize skew and delay using adjoint sensitivity analysis• Aimed at reliable clock trees under process variation

Deferred Merging Embedding (DME) algorithm, Kahng, et al. [TCAD ’92] • Bottom-up merging segment construction, top-down embedding

Integrated Deferred Merging Embedding (IDME) algorithm, Wong, et al. [ISPD’00]• Handles simultaneous routing, buffer-insertion, and wire-sizing• Merging segment set: a set of line samples of a merging region• No optimality guarantee• The size of MSS grows exponentially

“Process variation aware clock tree routing”, Lu, et al. [ISPD ’03]• Based on DME/BST

7

OutlineOutline

Background• Motivation and contribution• Literature overview

ClockTune algorithm• Problem formulation• ClockTune algorithm overview• Optimality and complexity analysis

Experimental results• Runtime, memory usage, and optimality• Power/Delay trade-off• Incremental refinement

8

Problem formulationProblem formulation

min-ZSWS (Zero Skew Wire Sizing) problem• Given a clock routing

minimize

s.t.

where Pi, Pj are paths from v to leaf nodes i and jZero-skew constraints are non-convex constraints

• No known algorithm solves the problem optimally in polynomial runtime

Mm

ji

v

v

vv

www

jiwPwP

wT

wT

wTwT

s)constraint skew(Zero),(delay)(delay

Delay)(delay

Area)(area

)(delay)(area

Max

Max

21

Tv

ji

Pi Pj

9

DC region approachDC region approach

Clock Delay and wiring Capacitance are top concerns Define f : RN R2, such that

• fY(w) = Delay(Tv(w)), fX(w) = Capacitance(Tv(w))

• DC region (v): The projection of the feasible region

• Choose a d-c pair from the DC region on R2

C

D

f : R6 -> R2

DC regionTv

w1 w2

w3

w4 w5

w6

Feasible region

10

ClockTuneClockTune algorithm algorithm overviewoverview

Phase 1: bottom-up construct DC regions for every node Phase 2: top-down embedding after delay/power tradeoff

(a) (b)

1

2

2

3

4

5

6 7

4

3

1

C

D D

C

C

DD

D

C

C

CC

C

D

D D

11

Optimality analysisOptimality analysis

Embeddings not fall on the delay samples will be omitted• Propagated error

• Delay sampling error

• Wire width sampling error (detailed in the paper)

D

C

w

d

p

DC region

DC region usingchildren informationSampled DC region

12

D

C

DC region

Sampled DC region

Optimality analysisOptimality analysis Error is bounded

d : delay sampling resolution

w : wire width sampling resolution

• k, : Constants related to l, r0, c0, wm, wM …

Generally speaking, error reduced about a half when resolution doubled

ErrorError

ResolutionResolution

13

Optimality runtime Optimality runtime trade offtrade off

Control sampling resolution can trade off optimality with runtime and memory

0.0%

0.5%

1.0%

1.5%

2.0%

128 256 512 1024

r1

r2

r3

r4

r5

(sample )

Minimum delay v.s. Optimal delay

0

20

40

60

80

100

120

0 1000 2000 3000 4000

p, q = 1024

(min)

(node )

512

256128

Runtime

14

Complexity analysisComplexity analysis

Runtime• Bottom-up phase takes O(n p max(p,q))

• Top-down phase takes O(np)

• Overall: O(n p max(p,q))

MemoryO(np)

where n : number of nodes of the clock tree,

p : number of delay samples taken at each node

q : number of wire width samples taken at each level-2 node

15

OutlineOutline

Background• Motivation and contribution• Related works• problem formulation

ClockTune Algorithm• Design space projection• Algorithm overview• Optimality and complexity analysis

Experimental Results• Runtime, memory usage, and optimality• Power/Delay trade-off• Incremental refinement

16

Experimental setupExperimental setup

• ClockTune is implemented in C++, executed on a 128MB 533MHz Pentium III PC

Benchmarks r1 – r5 from Tsay et al. [ICCAD‘91] Initial routing generated by BB+DME algorithm with minimum

wire width w = 1 m ClockTune uses wm = 1 m, wM = 4 m

p: number of delay samples taken at every node q: number of wire width samples taken at every level-2 node r0 = 0.03, c0 = 210-16/m2

17

Runtime and memory Runtime and memory usageusage

Runtime and memory usage are linear to problem size when p, q are fixed Within 1% optimality when p,q=256 (runtime < 6 minutes, memory ~ 64MB)

p, q = 256 # sink nodes # branches Runtime (s) Memory (MB) Optimality

r1 267 527 24.1 6.0 0.38%

r2 598 1185 61.0 12.5 0.71%

r3 862 1710 100.0 14.4 0.46%

r4 1903 3787 202.4 38.0 0.57%

r5 3101 6170 339.2 64.0 0.93%

0

20

40

60

80

100

120

0 1000 2000 3000 4000

p, q = 1024

(min)

(node )

512

256128

Runtime

0102030405060708090

0 1000 2000 3000 4000

(MB)

(node)

p, q = 1024

512

256

128

Memory Usage

18

Optimality resultsOptimality results

Optimality Error below 1% with p=q=256 Error reduced to about a half when resolution doubled

0.0%

0.5%

1.0%

1.5%

2.0%

128 256 512 1024

r1

r2

r3

r4

r5

(sample )

Minimum delay v.s. Optimal delay

0.0%

0.2%

0.4%

0.6%

0.8%

128 256 512 1024

r1

r2

r3

r4

r5

(sample )

Minimum area v.s. Optimal area

19

Power/Delay trade-offPower/Delay trade-off

r5

Capacitance

Delay

0.2~1.1nF0.2~1.1nF

5~150ns5~150ns

Minimum powerMinimum power

Minimum delayMinimum delay

15:1 delay:power trade-off

20

Incremental Incremental refinementrefinement

DC region captures the design space• Enables incremental refinement

C

DC

D

C

DC

D

X

21

Conclusion & Future Conclusion & Future WorkWork

Provide a zero-skew clock tree wire-sizing algorithm which• Minimizes delay and area ε-optimally

• Guarantees pseudo-polynomial runtime and memory usage

• Provides delay/power trade-off information to designers

• Speeds up design convergence by allowing clock tree re-balancing with minimum changes

Better delay model Buffer insertion/sizing capability

22

Thank you !Thank you !