Bridging self-regulation and content-filtering · regulation and filters SIFT: Use of filters when...

Post on 17-Aug-2020

8 views 0 download

Transcript of Bridging self-regulation and content-filtering · regulation and filters SIFT: Use of filters when...

© 2004 Ινστιτούτο Πληροφορικής & Τηλεπικοινωνιών, ΕΚΕΦΕ "∆ηµόκριτος"

ΕΚΕΦΕ «∆ηµόκριτος»Ινστιτούτο Πληροφορικής & Τηλεπικοινωνιών

Bridging self-regulation and content-filtering

Internet Content Filtering GroupInternet Content Filtering GroupSoftware and Knowledge Engineering LabSoftware and Knowledge Engineering Labhttp://www.http://www.iitiit..demokritosdemokritos..grgr//skelskel/i/i--configconfig//

Research activities of Research activities of ii--configconfig

Language TechnologyLanguage Technology

Image UnderstandingImage Understanding

Knowledge Discovery in DataKnowledge Discovery in Data

User ModelingUser Modeling

Multimedia Information ProcessingMultimedia Information Processing

Core technologies and applications

TechnologiesTechnologies: : Information extractionInformation extractionInformation filteringInformation filtering

Filter generatorFilter generator::Web filteringWeb filteringSpam filteringSpam filtering (e(e--mail)mail)Classifying financial newsClassifying financial news

Filtering digital contentOriginal docs (e.g. e-mail messages, news from e-

agencies, web pages)

category 1

(e.g. complaints, financial news)

category 2

(e.g. tech support, sports news)

category n

Internet Content

~~98% 98% “safe”“safe”UncontrolledUncontrolledTotal and direct accessTotal and direct accessLack of provenance (relative to absolute)Lack of provenance (relative to absolute)High volatilityHigh volatility

Unsafe content

Illegal Illegal Pedophiles, Nazism (DE)Pedophiles, Nazism (DE)

OffensiveOffensivePornography, Racism, ViolencePornography, Racism, Violence

UndesiredUndesiredOnline gambling Online gambling Day trading sitesDay trading sites

Where and how to filter

SelfSelf--regulationregulationFiltering at the sourceFiltering at the sourceFiltering during distributionFiltering during distributionFiltering at the last mileFiltering at the last mile

Self-regulation

SelfSelf--labeling by content authors labeling by content authors –– producersproducersBrowsers block according to user settingsBrowsers block according to user settings

ICRA v.1.0/RSACi : (n 3 s 4 v 0 l 4)ICRA v.1.0/RSACi : (n 3 s 4 v 0 l 4)

New, more expressive vocabulary incl. context : New, more expressive vocabulary incl. context : ICRA v. 2.0 (ICRA v. 2.0 (http://www.http://www.icraicra.org/.org/faqfaq/decode/)/decode/)

Filtering at the source – distribution

Literally impossible due to network structure, Literally impossible due to network structure, lack of provenance and routing method (cf. lack of provenance and routing method (cf. legal case against Yahoo! France) legal case against Yahoo! France)

Filtering at the last-mile (‘consumer’)

ListList--based solutionsbased solutionsunderblockingunderblocking

Shallow keyword matching solutionsShallow keyword matching solutionsoverblockingoverblocking

FilterXFilterX: Web page filtering: Web page filtering

FilterX is a Web proxy server that filters pornographic content on the Web. Having been trained with suitable examples, FilterX operates in real time.

• Combining natural language processing image analysis and Web structure, FilterXanalyses all the information available on theHTTP stream, not just the URL or title.

• Using machine learning, FilterX considers the actual contribution of textual, structural and pictorial features.

•Creating a multimedia representation model, for each document FilterX achieves practically zero overblocking.

Last-mile applications of FilterX

SelfSelf--regulation and filtersregulation and filtersSIFT: Use of filters when selfSIFT: Use of filters when self-- or 3or 3rdrd party labeling party labeling absent / not trusted absent / not trusted –– resulted in resulted in ICRAplusICRAplus, a free , a free platform bridging selfplatform bridging self--regulation with filtering regulation with filtering softwaresoftware

Protection of young studentsProtection of young studentsSCOFI: Different content access according to student SCOFI: Different content access according to student age via smartcardage via smartcard

Proxy

INTERNET

Proxy

Proxy

USER

TCP/IP

HTTP

label bureau protocol(HTTP)

PICS-Rules Based Filter Web-basedUser Interface

TCP/IP

ICRAfilter

FILTER 1

Label Bureauinterface

Adaptation Module

BROWSER

HTTP

HTTP

Proxy

FILTER 2

Label Bureauinterface

Adaptation Module

HTTP

label bureau protocol(HTTP)

Securiitymechanisms

Transparent proxymechanism

Protocol Blockingmechanism

Traffic Interceprtion Module

Proxy

FILTER n

Label Bureauinterface

Adaptation Module

HTTP

label bureau protocol(HTTP)

CACHING SYSTEM

SIFT Platform

Communication through Communication through W3C standards (HTTP, Label W3C standards (HTTP, Label Bureau)Bureau)

Easy to implement public Easy to implement public APIAPI

Literally ANY filter can Literally ANY filter can become modulebecome module

High security High security -- DSigDSig

Will run on Win OSWill run on Win OS

http://www.http://www.icraicra.org/.org/icraplusicraplus//

In the absence of authorIn the absence of author--provided ICRA labels, the requested provided ICRA labels, the requested page can still be blocked by copage can still be blocked by co--operating filtersoperating filters

Public API can help filter vendors wrap their existing Public API can help filter vendors wrap their existing software into an software into an ICRAplus ICRAplus module in no time.module in no time.

Users can override blocks temporarily or permanently Users can override blocks temporarily or permanently

Filters can be combined per user profile in either a Filters can be combined per user profile in either a simple manner, with predetermined, factory settingssimple manner, with predetermined, factory settings

or….or….

..users can have full control on the filtering procedure, incl...users can have full control on the filtering procedure, incl.filter priority, filter weights and tie resolution! filter priority, filter weights and tie resolution!

More info and downloads at http://www.More info and downloads at http://www.icraicra.org/.org/icraplusicraplus//

Packetfilter

Client (PC) SCOFI Server

sproxydSCOFI/SGO

Proxy

sproxydSCOFI/SGO

Proxy

Web BrowserWeb Browser8080/ tcp

SGOClientSGO

Client7751/ udp

SmartCard

SGOServerSGO

Server

Net 1(classroom)

Net 2(ISP)

Image Server

HTTP-request

Image ClassifierIm age

Classifier

AccessconfigAccess

config

squixFilterix enhanced

Squid proxy server

squixFilterix enhanced

Squid proxy server Internet8081/ tcp

echp/ authhandshake

HTTP-response

sadSCOFI/SGO

Auth dem on

sadSCOFI/SGO

Auth demon

Auth reques t

echp/ authrequest/ response

SmartCardSmartCard based based authentication and age authentication and age profileprofile

Different levels of Different levels of filteringfiltering

High security High security -- DsigDsig

External image analysisExternal image analysis

FilterX revisited

Trained on selfTrained on self--proclaimed porn sitesproclaimed porn sitesCreation of pageCreation of page--level representation (multimedia and level representation (multimedia and structural) structural) Turn “noise” to our advantage by using it as featureTurn “noise” to our advantage by using it as featureModels per language + language identifierModels per language + language identifierEvaluated using multiEvaluated using multi--fold crossfold cross--validation (before validation (before SIFT & SCOFI)SIFT & SCOFI)Can pass the decision to the user for thresholdingCan pass the decision to the user for thresholding

FilterX Results

Filtering obscene Web pages

70%

75%

80%

85%

90%

95%

100%

5 25 45 65 85 105 125 145 165 185 205 225 245

number of retained attributes

obscene precision

obscene recall

A case for email

Instead of learning what is spam, learn what is legitInstead of learning what is spam, learn what is legitUser profile/model based on user’s Inbox + spamUser profile/model based on user’s Inbox + spamTurn “noise” to our advantage by using it as featureTurn “noise” to our advantage by using it as featureModels per language + language identifierModels per language + language identifierEvaluated in house + multiEvaluated in house + multi--fold crossfold cross--validationvalidation

Spam filter results

Food for thought

Decide where to put the bias: overblocking vs. Decide where to put the bias: overblocking vs. underblockingunderblockingMost of the harmful content Most of the harmful content wants wants to be foundto be foundPossibility of “hostile” users! (enforced s/w!)Possibility of “hostile” users! (enforced s/w!)Very hard to detect Very hard to detect intentionintention of the content authorof the content authorReverse the problem of filtering by centering on the Reverse the problem of filtering by centering on the effect on the usereffect on the userTechnology only to solve problems it has created!Technology only to solve problems it has created!

More info

Konstantinos ChandrinosKonstantinos Chandrinoskostel@kostel@iitiit..demokritosdemokritos..grgr

Internet Content Filtering Group (iInternet Content Filtering Group (i--configconfig))NCSR “Demokritos”NCSR “Demokritos”

http://www.http://www.iitiit..demokritosdemokritos..grgr//skelskel/i/i--configconfig//