Bridging self-regulation and content-filtering · regulation and filters SIFT: Use of filters when...

23
© 2004 Ινστιτούτο Πληροφορικής & Τηλεπικοινωνιών, ΕΚΕΦΕ "∆ημόκριτος" ΕΚΕΦΕ «∆ημόκριτος» Ινστιτούτο Πληροφορικής & Τηλεπικοινωνιών Bridging self-regulation and content-filtering Internet Content Filtering Group Internet Content Filtering Group Software and Knowledge Engineering Lab Software and Knowledge Engineering Lab http://www. http://www. iit iit . . demokritos demokritos . . gr gr / / skel skel /i /i - - config config / /

Transcript of Bridging self-regulation and content-filtering · regulation and filters SIFT: Use of filters when...

Page 1: Bridging self-regulation and content-filtering · regulation and filters SIFT: Use of filters when self-or 3 rd party labeling party labeling absent / not trusted – resulted in

© 2004 Ινστιτούτο Πληροφορικής & Τηλεπικοινωνιών, ΕΚΕΦΕ "∆ηµόκριτος"

ΕΚΕΦΕ «∆ηµόκριτος»Ινστιτούτο Πληροφορικής & Τηλεπικοινωνιών

Bridging self-regulation and content-filtering

Internet Content Filtering GroupInternet Content Filtering GroupSoftware and Knowledge Engineering LabSoftware and Knowledge Engineering Labhttp://www.http://www.iitiit..demokritosdemokritos..grgr//skelskel/i/i--configconfig//

Page 2: Bridging self-regulation and content-filtering · regulation and filters SIFT: Use of filters when self-or 3 rd party labeling party labeling absent / not trusted – resulted in

Research activities of Research activities of ii--configconfig

Language TechnologyLanguage Technology

Image UnderstandingImage Understanding

Knowledge Discovery in DataKnowledge Discovery in Data

User ModelingUser Modeling

Multimedia Information ProcessingMultimedia Information Processing

Page 3: Bridging self-regulation and content-filtering · regulation and filters SIFT: Use of filters when self-or 3 rd party labeling party labeling absent / not trusted – resulted in

Core technologies and applications

TechnologiesTechnologies: : Information extractionInformation extractionInformation filteringInformation filtering

Filter generatorFilter generator::Web filteringWeb filteringSpam filteringSpam filtering (e(e--mail)mail)Classifying financial newsClassifying financial news

Page 4: Bridging self-regulation and content-filtering · regulation and filters SIFT: Use of filters when self-or 3 rd party labeling party labeling absent / not trusted – resulted in

Filtering digital contentOriginal docs (e.g. e-mail messages, news from e-

agencies, web pages)

category 1

(e.g. complaints, financial news)

category 2

(e.g. tech support, sports news)

category n

Page 5: Bridging self-regulation and content-filtering · regulation and filters SIFT: Use of filters when self-or 3 rd party labeling party labeling absent / not trusted – resulted in

Internet Content

~~98% 98% “safe”“safe”UncontrolledUncontrolledTotal and direct accessTotal and direct accessLack of provenance (relative to absolute)Lack of provenance (relative to absolute)High volatilityHigh volatility

Page 6: Bridging self-regulation and content-filtering · regulation and filters SIFT: Use of filters when self-or 3 rd party labeling party labeling absent / not trusted – resulted in

Unsafe content

Illegal Illegal Pedophiles, Nazism (DE)Pedophiles, Nazism (DE)

OffensiveOffensivePornography, Racism, ViolencePornography, Racism, Violence

UndesiredUndesiredOnline gambling Online gambling Day trading sitesDay trading sites

Page 7: Bridging self-regulation and content-filtering · regulation and filters SIFT: Use of filters when self-or 3 rd party labeling party labeling absent / not trusted – resulted in

Where and how to filter

SelfSelf--regulationregulationFiltering at the sourceFiltering at the sourceFiltering during distributionFiltering during distributionFiltering at the last mileFiltering at the last mile

Page 8: Bridging self-regulation and content-filtering · regulation and filters SIFT: Use of filters when self-or 3 rd party labeling party labeling absent / not trusted – resulted in

Self-regulation

SelfSelf--labeling by content authors labeling by content authors –– producersproducersBrowsers block according to user settingsBrowsers block according to user settings

ICRA v.1.0/RSACi : (n 3 s 4 v 0 l 4)ICRA v.1.0/RSACi : (n 3 s 4 v 0 l 4)

New, more expressive vocabulary incl. context : New, more expressive vocabulary incl. context : ICRA v. 2.0 (ICRA v. 2.0 (http://www.http://www.icraicra.org/.org/faqfaq/decode/)/decode/)

Page 9: Bridging self-regulation and content-filtering · regulation and filters SIFT: Use of filters when self-or 3 rd party labeling party labeling absent / not trusted – resulted in

Filtering at the source – distribution

Literally impossible due to network structure, Literally impossible due to network structure, lack of provenance and routing method (cf. lack of provenance and routing method (cf. legal case against Yahoo! France) legal case against Yahoo! France)

Page 10: Bridging self-regulation and content-filtering · regulation and filters SIFT: Use of filters when self-or 3 rd party labeling party labeling absent / not trusted – resulted in

Filtering at the last-mile (‘consumer’)

ListList--based solutionsbased solutionsunderblockingunderblocking

Shallow keyword matching solutionsShallow keyword matching solutionsoverblockingoverblocking

Page 11: Bridging self-regulation and content-filtering · regulation and filters SIFT: Use of filters when self-or 3 rd party labeling party labeling absent / not trusted – resulted in

FilterXFilterX: Web page filtering: Web page filtering

FilterX is a Web proxy server that filters pornographic content on the Web. Having been trained with suitable examples, FilterX operates in real time.

• Combining natural language processing image analysis and Web structure, FilterXanalyses all the information available on theHTTP stream, not just the URL or title.

• Using machine learning, FilterX considers the actual contribution of textual, structural and pictorial features.

•Creating a multimedia representation model, for each document FilterX achieves practically zero overblocking.

Page 12: Bridging self-regulation and content-filtering · regulation and filters SIFT: Use of filters when self-or 3 rd party labeling party labeling absent / not trusted – resulted in

Last-mile applications of FilterX

SelfSelf--regulation and filtersregulation and filtersSIFT: Use of filters when selfSIFT: Use of filters when self-- or 3or 3rdrd party labeling party labeling absent / not trusted absent / not trusted –– resulted in resulted in ICRAplusICRAplus, a free , a free platform bridging selfplatform bridging self--regulation with filtering regulation with filtering softwaresoftware

Protection of young studentsProtection of young studentsSCOFI: Different content access according to student SCOFI: Different content access according to student age via smartcardage via smartcard

Page 13: Bridging self-regulation and content-filtering · regulation and filters SIFT: Use of filters when self-or 3 rd party labeling party labeling absent / not trusted – resulted in

Proxy

INTERNET

Proxy

Proxy

USER

TCP/IP

HTTP

label bureau protocol(HTTP)

PICS-Rules Based Filter Web-basedUser Interface

TCP/IP

ICRAfilter

FILTER 1

Label Bureauinterface

Adaptation Module

BROWSER

HTTP

HTTP

Proxy

FILTER 2

Label Bureauinterface

Adaptation Module

HTTP

label bureau protocol(HTTP)

Securiitymechanisms

Transparent proxymechanism

Protocol Blockingmechanism

Traffic Interceprtion Module

Proxy

FILTER n

Label Bureauinterface

Adaptation Module

HTTP

label bureau protocol(HTTP)

CACHING SYSTEM

SIFT Platform

Communication through Communication through W3C standards (HTTP, Label W3C standards (HTTP, Label Bureau)Bureau)

Easy to implement public Easy to implement public APIAPI

Literally ANY filter can Literally ANY filter can become modulebecome module

High security High security -- DSigDSig

Will run on Win OSWill run on Win OS

http://www.http://www.icraicra.org/.org/icraplusicraplus//

Page 14: Bridging self-regulation and content-filtering · regulation and filters SIFT: Use of filters when self-or 3 rd party labeling party labeling absent / not trusted – resulted in

In the absence of authorIn the absence of author--provided ICRA labels, the requested provided ICRA labels, the requested page can still be blocked by copage can still be blocked by co--operating filtersoperating filters

Public API can help filter vendors wrap their existing Public API can help filter vendors wrap their existing software into an software into an ICRAplus ICRAplus module in no time.module in no time.

Users can override blocks temporarily or permanently Users can override blocks temporarily or permanently

Page 15: Bridging self-regulation and content-filtering · regulation and filters SIFT: Use of filters when self-or 3 rd party labeling party labeling absent / not trusted – resulted in

Filters can be combined per user profile in either a Filters can be combined per user profile in either a simple manner, with predetermined, factory settingssimple manner, with predetermined, factory settings

or….or….

Page 16: Bridging self-regulation and content-filtering · regulation and filters SIFT: Use of filters when self-or 3 rd party labeling party labeling absent / not trusted – resulted in

..users can have full control on the filtering procedure, incl...users can have full control on the filtering procedure, incl.filter priority, filter weights and tie resolution! filter priority, filter weights and tie resolution!

More info and downloads at http://www.More info and downloads at http://www.icraicra.org/.org/icraplusicraplus//

Page 17: Bridging self-regulation and content-filtering · regulation and filters SIFT: Use of filters when self-or 3 rd party labeling party labeling absent / not trusted – resulted in

Packetfilter

Client (PC) SCOFI Server

sproxydSCOFI/SGO

Proxy

sproxydSCOFI/SGO

Proxy

Web BrowserWeb Browser8080/ tcp

SGOClientSGO

Client7751/ udp

SmartCard

SGOServerSGO

Server

Net 1(classroom)

Net 2(ISP)

Image Server

HTTP-request

Image ClassifierIm age

Classifier

AccessconfigAccess

config

squixFilterix enhanced

Squid proxy server

squixFilterix enhanced

Squid proxy server Internet8081/ tcp

echp/ authhandshake

HTTP-response

sadSCOFI/SGO

Auth dem on

sadSCOFI/SGO

Auth demon

Auth reques t

echp/ authrequest/ response

SmartCardSmartCard based based authentication and age authentication and age profileprofile

Different levels of Different levels of filteringfiltering

High security High security -- DsigDsig

External image analysisExternal image analysis

Page 18: Bridging self-regulation and content-filtering · regulation and filters SIFT: Use of filters when self-or 3 rd party labeling party labeling absent / not trusted – resulted in

FilterX revisited

Trained on selfTrained on self--proclaimed porn sitesproclaimed porn sitesCreation of pageCreation of page--level representation (multimedia and level representation (multimedia and structural) structural) Turn “noise” to our advantage by using it as featureTurn “noise” to our advantage by using it as featureModels per language + language identifierModels per language + language identifierEvaluated using multiEvaluated using multi--fold crossfold cross--validation (before validation (before SIFT & SCOFI)SIFT & SCOFI)Can pass the decision to the user for thresholdingCan pass the decision to the user for thresholding

Page 19: Bridging self-regulation and content-filtering · regulation and filters SIFT: Use of filters when self-or 3 rd party labeling party labeling absent / not trusted – resulted in

FilterX Results

Filtering obscene Web pages

70%

75%

80%

85%

90%

95%

100%

5 25 45 65 85 105 125 145 165 185 205 225 245

number of retained attributes

obscene precision

obscene recall

Page 20: Bridging self-regulation and content-filtering · regulation and filters SIFT: Use of filters when self-or 3 rd party labeling party labeling absent / not trusted – resulted in

A case for email

Instead of learning what is spam, learn what is legitInstead of learning what is spam, learn what is legitUser profile/model based on user’s Inbox + spamUser profile/model based on user’s Inbox + spamTurn “noise” to our advantage by using it as featureTurn “noise” to our advantage by using it as featureModels per language + language identifierModels per language + language identifierEvaluated in house + multiEvaluated in house + multi--fold crossfold cross--validationvalidation

Page 21: Bridging self-regulation and content-filtering · regulation and filters SIFT: Use of filters when self-or 3 rd party labeling party labeling absent / not trusted – resulted in

Spam filter results

Page 22: Bridging self-regulation and content-filtering · regulation and filters SIFT: Use of filters when self-or 3 rd party labeling party labeling absent / not trusted – resulted in

Food for thought

Decide where to put the bias: overblocking vs. Decide where to put the bias: overblocking vs. underblockingunderblockingMost of the harmful content Most of the harmful content wants wants to be foundto be foundPossibility of “hostile” users! (enforced s/w!)Possibility of “hostile” users! (enforced s/w!)Very hard to detect Very hard to detect intentionintention of the content authorof the content authorReverse the problem of filtering by centering on the Reverse the problem of filtering by centering on the effect on the usereffect on the userTechnology only to solve problems it has created!Technology only to solve problems it has created!

Page 23: Bridging self-regulation and content-filtering · regulation and filters SIFT: Use of filters when self-or 3 rd party labeling party labeling absent / not trusted – resulted in

More info

Konstantinos ChandrinosKonstantinos Chandrinoskostel@[email protected]

Internet Content Filtering Group (iInternet Content Filtering Group (i--configconfig))NCSR “Demokritos”NCSR “Demokritos”

http://www.http://www.iitiit..demokritosdemokritos..grgr//skelskel/i/i--configconfig//