ECEN 689 Special Topics in Data Science for Communications...

32
Nick Duffield Department of Electrical & Computer Engineering Texas A&M University Lecture 15 ECEN 689 Special Topics in Data Science for Communications Networks

Transcript of ECEN 689 Special Topics in Data Science for Communications...

Page 1: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

Nick Duffield Department of Electrical & Computer Engineering

Texas A&M University

Lecture 15

ECEN 689 Special Topics in Data Science for

Communications Networks

Page 2: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

Finding Heavy Hitters

•  Heavy hitters = frequent items by weight •  Set of keyed weights

–  (ki , xi ), i = 1,2,…,n; –  Total weight X = Σi=1,…,n xi –  Key aggregates X(k) = Σi: k(i)=k xi

•  ϕ-Heavy Hitter (ϕ-HH) –  any key whose aggregate is at least a fraction ϕ of total –  ϕ-HH = {k: X(k) ≥ ϕX}

•  Challenges –  Fast, small space, close approximation –  Example: using count-min sketch

Page 3: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

Finding Heavy Hitters in Internet Traffic

•  Traffic anomalies are common –  DDoS attacks, Flash crowds, worms, failures

•  Traffic anomalies are complicated –  Multi-dimensional à may involve multiple header fields

•  E.g. src IP 1.2.3.4 AND port 1214 (KaZaA) •  Looking at individual fields separately is not enough!

–  Hierarchical à Evident only at specific granularities •  E.g. 1.2.3.4/32, 1.2.3.0/24, 1.2.0.0/16, 1.0.0.0/8 •  Looking at fixed aggregation levels is not enough!

Page 4: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

Challenges for traffic anomaly detection

•  Immense data volume (esp. during attacks) –  Prohibitive to inspect all traffic in detail

•  Multi-dimensional, hierarchical traffic anomalies –  Prohibitive to monitor all possible combinations of

different aggregation levels on all header fields

•  Sampling (packet level or flow level) –  May wash out some details

•  False alarms –  Too many alarms = info “snow” à simply get ignored

•  Root cause analysis –  What do anomalies really mean?

Page 5: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

Implementation Considerations

•  Offline vs. streaming –  If offline: can use multiple passes for better accuracy –  Streaming: need to adaptive find HH prefixes

•  Sampling vs. Exact –  E.g. Sampled Netflow Records –  Need to accommodate estimation inaccuracy in HH thresholding

Page 6: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

Looking at traffic aggregates

•  Aggregating on individual packet header fields gives useful results but –  Traffic reports are not always at the right granularity (e.g.

individual IP address, subnet, etc.) –  Cannot show aggregates defined over multiple fields (e.g.

which network uses which application)

•  The traffic analysis tool should automatically find aggregates over the right fields at the right granularity

Rank Destination IP Traffic

1 jeff.dorm.bigU.edu 11.9%

2 tracy.dorm.bigU.edu 3.12% 3 risc.cs.bigU.edu 2.83%

Most traffic goes to the dorms …

Rank Destination network Traffic

1 library.bigU.edu 27.5%

2 cs.bigU.edu 18.1% 3 dorm.bigU.edu 17.8%

What apps are used?

Rank Source port Traffic

1 Web 42.1%

2 Kazaa 6.7% 3 Ssh 6.3%

Dest. IP

Dest. net

Source port

Where does the traffic come from? ……

Src. IP Src. port

Src. net

Dest. port Dest. IP

Dest. net

Protocol

Which network uses web and which one kazaa?

Page 7: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

Ideal traffic report

Traffic aggregate Traffic

Web traffic 42.1%

Web traffic to library.bigU.edu 26.7%

Web traffic from www.catsplayingpiano.com 13.4%

ICMP traffic from sloppynet.badU.edu to jeff.dorm.bigU.edu 11.9%

Web is the dominant application The library is a heavy user of web Flash crowd!

This is a Denial of Service attack !!

Page 8: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

Hierarchical Heavy Hitter

•  Input stream S of keyed weights {(ki,xi)}, total weight X •  Keys ki drawn from a hierarchical domain D of height h. •  Let p denote a prefix in the domain hierarchy •  D(p) be the set of weights in D that are descendants of p. •  X(p) = total weight in S with keys in D(p)

–  X(p) = Σi: k(i) ∈ D(p) xi

•  ϕ-Hierarchical Heavy Hitter (ϕ-HHH) –  Any prefix p for which D(p) is at least a fraction ϕ of total weight –  ϕ-HHH = {p ∈ D: X(p) ≥ ϕX}

Page 9: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

One-dimensional example

10.0.0.2 10.0.0.3 10.0.0.4 10.0.0.5 10.0.0.8 10.0.0.9 10.0.0.10 10.0.0.14

15 35 30 40 160 110 35 75

ϕ=1/5; Threshold=100 Hierarchy

50 70 270 35 75

75 305 50 70

120 380

500

160 110

270

305

120 380

500

10.0.0.2/31 10.0.0.4/31 10.0.0.8/31 10.0.0.10/31

10.0.0.0/30 10.0.0.4/30 10.0.0.8/30

10.0.0.0/29 10.0.0.8/29

10.0.0.0/28

10.0.0.14/31

10.0.0.12/30

AI Lab

2nd floor

CS Dept

Page 10: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

One-dimensional example

10.0.0.2 10.0.0.3 10.0.0.4 10.0.0.5 10.0.0.8 10.0.0.9 10.0.0.10 10.0.0.14

15 35 30 40 160 110 35 75

Threshold=100

50 70 270 35 75

75 305 50 70

120 380

500

160 110

270

305

120 380

500

10.0.0.2/31 10.0.0.4/31 10.0.0.8/31 10.0.0.10/31

10.0.0.0/30 10.0.0.4/30 10.0.0.8/30

10.0.0.0/29 10.0.0.8/29

10.0.0.0/28

10.0.0.14/31

10.0.0.12/30

Page 11: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

(Compressed) Hierarchical Heavy Hitter

•  Previous setup –  Input stream S of keyed weights {(ki,xi)}, total weight X –  Keys ki drawn from a hierarchical domain D of height h. –  Prefix p; descendants D(p); X(p) = total weight under p

•  Define ϕ-Hierarchical Heavy Hitter (ϕ-HHH’) inductively –  At lowest level in hierarchy, prefix = key

•  ϕ-HHH’ = ϕ-HH on key set –  At any higher level

•  Y(p) = X(p) – Σ { Y(q): q a child of p that is a ϕ-HHH} –  Is ϕ-HHH’ if Y(p) ≥ ϕX

Page 12: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

270

120

500

305

380

160 110

Unidimensional report example

10.0.0.8 10.0.0.9

10.0.0.0/29 10.0.0.8/29

10.0.0.8/31

10.0.0.8/30

10.0.0.0/28

120 380

160 110

Compression

305-270<100

380-270≥100 Source IP Traffic

10.0.0.0/29 120

10.0.0.8/29 380

10.0.0.8 160

10.0.0.9 110

Rule: omit clusters with traffic within error T of more specific clusters in the report

Page 13: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

Multidimensional structure

All traffic All traffic

US EU

CA NY GB DE

Web Mail

Source net Application

US Web

Nodes (clusters) have multiple parents

US

Web

Nodes (clusters) overlap

CA

Page 14: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

Frequent Itemset Mining in Hierarchies

•  Consider flow records = basket •  Key fields = items

–  Basket = { SrcIP, DstIP, SrcPrt, DstPrt, Proto, ….}

•  Vanilla FIM has no recognition of hierarchies –  HHH distributed over addresses under a prefix is not recognized

•  Challenge: how to make FIM hierarchy-aware?

Page 15: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

Hierarchical expansion

•  Expand IP addresses as fill set of prefixes •  32-bit IP address à 25 prefixes of length 8 through 32

–  Prefixes shorter that 8 bit have not been assigned

•  Full Expansion –  Replace IP addresses in each “basket” by list of 25 prefixes –  Apply frequent itemset mining to find frequent prefixes

•  Drawbacks –  Full expansion rarely necessary in practice –  Most heavy hitters are not 32 bit addresses –  If a prefix is known to not be heavy hitter, no need to expand it further

•  Downward closure property

Page 16: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

Progressive Expansion

•  Don’t expand all prefixes initially •  Prefix of length k is explored only if parent k-1 is frequent

–  Based on downward closure property

•  Drawback –  Computation cost: Need to expand prefixed as many as 25 times

Page 17: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

k-by-k Progressive Expansion

•  Expand prefixes in length blocks of k •  8,9,…8+k-1, 8+k,8+k+1,8+2k -1 etc •  Each block

–  FIM mining for heavy hitters

•  Trade-off –  Exploration of non-HHH prefixes

•  Descendants of non-HHH prefixes in same block –  Reduced number of passes

•  FIM done jointly on all levels within block

Page 18: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

Multidimensional Case

•  Key fields: SrcIP, DstIP, SrcPrt, DstPrt, Proto •  Expand SrcIP, DstIP as before •  SrcPrt, DstPrt

–  Aggregate by range •  well known: 0-1023 •  registered: 1024-49151 •  dynamic: > 49151

–  Aggregate by application type •  web, p2p, real-time, gaming,…

•  Previous FIM notions apply –  Is a rule interesting?

•  Conditional vs. unconditional probabilities for 2-dimensional and highers sets of key field values

•  Compare Pr(Prt A | Prefix B) with Pr(Port A)

Page 19: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

Limitations of FIM approach

•  Multiple passes over the data •  Online detection of HHH?

•  Can we dynamically attribute traffic to HH cluster? •  Difficulty: can’t use multiple passes to refine prefixes

Page 20: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

HHH / Cluster detection

•  Input –  <src_ip, dst_ip, src_port, dst_port, proto> –  Bytes (we can also use other metrics)

•  Output –  All traffic clusters with volume above

(epsilon * total_volume) •  ( cluster ID, estimated volume )

–  Traffic clusters: defined using combinations of IP prefixes, port ranges, and protocol

Page 21: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

Standard Tries

•  The standard trie for a set of strings S is an ordered tree such that: –  Each node but the root is labeled with a character –  The children of a node are alphabetically ordered –  The paths from the external nodes to the root yield the strings of S

•  Example: standard trie for the set of strings S = { bear, bell, bid, bull, buy, sell, stock, stop }

a

e

b

r

l

l

s

u

l

l

y

e t

l

l

o

c

k

p

i

d

Page 22: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

22

Analysis of Standard Tries

•  A standard trie uses O(n) space and supports searches, insertions and deletions in time O(dm), where: n total size of the strings in S m size of the string parameter of the operation d size of the alphabet

a

e

b

r

l

l

s

u

l

l

y

e t

l

l

o

c

k

p

i

d

Page 23: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

Word Matching with a Trie

•  We insert the words of the text into a trie

•  Each leaf stores the occurrences of the associated word in the text

s e e b e a r ? s e l l s t o c k !

s e e b u l l ? b u y s t o c k !

b i d s t o c k !

a

a

h e t h e b e l l ? s t o p !

b i d s t o c k !

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68

69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86a r

87 88

a

e

b

l

s

u

l

e t

e

0, 24

o

c

i

l

r

6l

78

d

47, 58l

30

y

36l

12 k

17, 40,51, 62

p

84

h

e

r

69

a

Page 24: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

Compressed Tries

•  A compressed trie has internal nodes of degree at least two

•  It is obtained from standard trie by compressing chains of “redundant” nodes

e

b

ar ll

s

u

ll y

ell to

ck p

id

a

e

b

r

l

l

s

u

l

l

y

e t

l

l

o

c

k

p

i

d

Page 25: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

Using Tries for HHH detection

•  Node = prefix •  Count bytes in short prefixes •  If prefix becomes a HH

–  Breakout into descendant longer prefixes –  Count only bytes in descendant prefixes

•  Repeat as necessary •  Some post processing required to allocate initial byte counts

in short prefixes

Page 26: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

Dynamic Drilldown via 1-DTrie

•  At most 1 update per flow •  Split level when adding new bytes causes bucket >= Tsplit •  Invariant: traffic trapped at any interior node < Tsplit

stage 1 (first octet)

stage 2 (second octet)

stage 3 (third octet)

field

stage 4 (last octet)

0 255

0 255

0 255

0 255

Page 27: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

1-D Trie Data Structure

0 255

0 255 0 255 0 255

0 255 0 255 0 255

0 255 0 255

•  Reconstruct interior nodes (aggregates) by summing up the children •  Reconstruct missed value by summing up traffic trapped at ancestors •  Amortize the update cost

Page 28: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

1-D Trie Performance •  Update cost

–  1 lookup + 1 update

•  Memory –  At most 1/Tsplit internal nodes at each level

•  Accuracy: For any given T > d*Tsplit –  Captures all flows with metric >= T –  Captures no flow with metric < T-d*Tsplit

Page 29: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

29

Extending 1-D Trie to 2-D: Products Update(k1, k2, value)

•  In each dimension, find the deepest interior node (prefix): (p1, p2) –  Can be done using longest prefix matching (LPM)

•  Update a hash table using key (p1, p2)

p1 = f(k1) p2 = f(k2)

totalBytes{ p1, p2 } += value

Page 30: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

Cross-Producting Performance

•  Update cost: –  2 X (1-D update cost) + 1 hash table update.

•  Memory –  Hash table size bounded by (d/Tsplit)2 –  In practice, generally much smaller

•  Accuracy: For any given T > d*Tsplit –  Captures all flows with metric >= T –  Captures no flow with metric < T- d*Tsplit

Page 31: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

Attribution of traffic

•  Bin traffic in shorter prefixes •  After threshold T reached, bin traffic into longer prefixes •  How to allocate unsplit traffic in shorter prefix •  Two methods

1.  Lower bound: ignore unsplit traffic 2.  Proportional splitting of T amongst longer prefixes

Page 32: ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-15.pdf · Special Topics in Data Science for Communications Networks

A Sampling of the extensive literature

•  Automatically Inferring Patterns of Resource Consumption in Network Traffic, Cristian Estan, Stefan Savage, George Varghese, SIGCOMM 2003

•  Finding hierarchical heavy hitters in data streams, Graham Cormode,Flip Korn, S Muthukrishnan, Divesh Srivastava VLDB 2003

•  Online Identification of Hierarchical Heavy Hitters: Algorithms, Evaluation, and Applications,Yin Zhang, Sumeet Singh, Subhabrata Sen, Nick Duffield, Carsten Lund, IMC 2004

•  Ignasi Paredes-Oliva. Addressing Practical Challenges for Anomaly Detection in Backbone Networks. Ph.D. thesis, UPC Barcelona, 2013

•  (Credit to the various authors for some slides)