EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search...

38
University of Cyprus Department of Computer Science EPL660: Information Retrieval and Search Engines Lab 3 Παύλος Αντωνίου Γραφείο: B109, ΘΕΕ01

Transcript of EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search...

Page 1: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

University of Cyprus

Department of

Computer Science

EPL660: Information

Retrieval and Search

Engines – Lab 3

Παύλος Αντωνίου

Γραφείο: B109, ΘΕΕ01

Page 2: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

Apache Solr

• Popular, fast, open-source search platform built

on Apache Lucene from the Apache Lucene

project

• Written in Java and runs as a standalone full-text

search server with standalone or distributed

(SolrCloud) operation

• Solr uses the Lucene Java search library at its core

for full-text indexing and search

Page 3: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

Apache Solr Features

• XML/HTTP and JSON APIs

• Hit highlighting

• Faceted Search and Filtering

• Near real-time indexing

• Database integration

• Rich document (e.g., Word, PDF) handling

• Geospatial Search

• Fast Incremental Updates and Index Replication

• Caching

• Replication

• Web administration interface etc

Page 4: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

Apache Solr vs Apache Lucene

• Relationship between Solr and Lucene is that of a

car and its engine.

• You can't drive an engine, but you can drive a car.

• Lucene is a library which you can't use as-is,

whereas Solr is a complete application which

you can use out-of-box.

• Unlike Lucene, Solr is a web application (WAR)

which can be deployed in any servlet container,

e.g. Jetty, Tomcat, Resin, etc.

– single JAR file needed to deploy application on server

• Solr can be installed and used easily by non-

programmers. Lucene needs programming skills.

Page 5: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

When to use Lucene?

• Need for embedded search functionality into a

desktop application for example

• Need very customized requirements requiring

low-level access to the Lucene API classes

– Solr may be more a hindrance than a help, since it is an

extra layer of indirection.

Page 6: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

SolrCloud

• Apache Solr includes the ability to set up a cluster

of Solr servers that combines fault tolerance and

high availability: SolrCloud

• SolrCloud allows for distributed search and

indexing

• SolrCloud features:

– Central configuration for the entire cluster

– Automatic load balancing and fail-over for queries

– ZooKeeper integration for cluster coordination and

configuration

Page 7: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

SolrCloud Concepts

• A Cluster is made up of one or more Solr Nodes,

which are running instances of the Solr server

process

Page 8: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

SolrCloud Concepts

• A Cluster can host multiple Collections of Solr

Documents

• A collection can be partitioned into multiple

Shards (pieces), which contain a subset of the

Documents in the Collection

• Each Shard can be replicated (Leader & Replicas)

Page 9: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

SolrCloud Concepts

• The number of Shards that a Collection has

determines:

– The theoretical limit to the number of Documents that

Collection can reasonably contain.

• The number of Replicas that each Shard has

determines:

– The level of redundancy built into the Collection and

how fault tolerant the Cluster can be in the event that

some Nodes become unavailable.

– The theoretical limit in the number concurrent search

requests that can be processed under heavy load.

Page 10: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

Getting Started

• Download Apache Solr (curr. version 8.2.0) from

https://www.apache.org/dyn/closer.lua/lucene/solr/8.2.0/s

olr-8.2.0.tgz (or zip for windows)

• Extract zip and go to solr directory

• Open a terminal and type:

bin/solr start -e cloud -noprompt

• This will start up a SolrCloud cluster with

embedded ZooKeeper (cloud management

service) on local workstation with 2 nodes

– First node listens on port 8983 & second on port 7574

• You can see that the Solr is running by loading

http://localhost:8983/solr/ in your web browser.

Page 11: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

Solr web interface

Page 12: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

SolrCloud

• Preview collections on tab

– One collection created automatically → gettingstarted

– Collection is partioned into 2 shards

• First node stores 2 leader shards / Second stores 2 replicas

• Solr server is up and running, with one collection

but no data indexed

• Important files configuration files: solrconfig.xml,

managed-schema (or schema.xml)– solr-dir/server/solr/configsets/_default/conf/solrconfig.xml

– solr-dir/server/solr/configsets/_default /conf/schema.xml

Page 13: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

DataDir and DirectoryFactory

• DataDir: location for index data

– Solr stores its index data in a directory called /data

under the core’s instance directory

– use <dataDir> parameter in solrconfig.xml to

modify it

• DirectoryFactory:– solr.StandardDirectoryFactory: filesystem-based directory factory

– solr.SimpleFSDirectoryFactory: index file on local filesystem, problem with multiple

threads

– solr.NIOFSDirectoryFactory: scales with many threads, does not work well with MS

Windows platforms

– solr.MMapDirectoryFactory: uses virtual memory and disk, not for real-time searching

– solr.NRTCachingDirectoryFactory: some chunks in RAM, for real-time searching

– solr.RAMDirectoryFactory: whole index in memory, does not work for replication

– solrHdfdDirectoryFactory: index in HDFS (desirable if using Hadoop)

https://lucene.apache.org/solr/guide/8_1/datadir-and-directoryfactory-in-solrconfig.html

Page 14: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

How Solr Sees the World

• Document: basic unit of information

– set of data that describes something

• E.g. document about a person, for example, might contain the

person’s name, biography, favorite color, and shoe size

– documents are expected to be composed of fields,

which are more specific pieces of information

• E.g. "first_name":"Pavlos", "shoe_size":42

– fields can contain different types of data

• first_name→ text, shoe_size→ number

• User defines type of each field

• Field type tells Solr how to interpret the field and how it can be

queried

– When document added into a collection, Solr takes

values from document fields and add them to index

– Queries consult index, return matching docs

Page 15: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

Field Analysis Process

• How does Solr process document fields when

building an index?

– Example: biography field in a person document"biography": "He received his Ph.D. from Department of

Computer Science of the University of Cyprus, in 2012"

– Index every word of biography in order to find quickly

people whose lives have had anything to do with university, or computer. Any issues?

• What if biography contains a lot of common words you don’t

really care about like "he", "the", "a", "to", "for", "is" (stop

words)?

• What if biography contains the word "University" and a user

makes a query for "university"?

• Solution: field analysis

Page 16: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

Field Analysis Process

• For each field, you can tell Solr:

– how to break apart the text into words (tokenization)

• E.g. split at whitespaces, commas, etc.

– to remove stop words (filtering)

– to make lower case normalization

– to remove accents marks

Read more here: Understanding

Analyzers, Tokenizers, and Filters

Page 17: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

Schema files and manipulation

• Solr stores details about the field types and fields it

is expected to understand in a schema file:

– managed-schema is the name for the schema file

– Solr uses by default to support making Schema changes

at runtime via the Schema API (via HTTP),

or Schemaless Mode / avoid hand editing of the

managed schema file

– schema.xml is the traditional name for a schema file

which can be edited manually by users who use the

ClassicIndexSchemaFactory (before Solr6)

– If you are using SolrCloud you may not be able to find

any file by these names on the local filesystem. You will

only be able to see the schema through the Schema API

(if enabled) or through the Solr Admin UI’s Cloud

Screens

Page 18: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

Field Analysis

• Schema defines

– The kind of fields available for indexing

– The type of analysis to be applied when indexing or

querying each field

– Available field types such as float, long, double, date,

text

• Explore the schema using Schema tab (see next

slide)

– Example: choose “*_txt” field to see how solr behaves

to field names ending by _txt

Page 19: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

Field Analysis

Schema tab

indexed fields are fields

which pass through

analysis phase, and are

added to the index so as

to be searchable/sortable

by queries

stored fields are fields whose the original

text is stored in the index somewhere so

as to be retrievable by queries

Page 20: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

Field Analysis

• Go to the Analysis Tab (see next slide) to see

how a text value is broken down into words by

Index and Query time analysis

– Field Value (Index): He received his Ph.D. from

Department of Computer Science of the

University of Cyprus, in 2012

– Analyse Fieldname / FieldType: text_en

Page 21: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

Field Analysis

Insert text to Analyze

Analysis tab

Page 22: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

Field Analysis

The word of has been “stopped”

Page 23: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

Indexing XML Data

• Solr includes a simple command line tool for

POSTing various types of content to a Solr server

– /bin/post in UNIX, different usage in Windows

• Let's first index two XML files

– UNIX: remain into solr directory

• bin/post –c gettingstarted

example/exampledocs/solr.xml

example/exampledocs/monitor.xml

– Windows: go to examples/exampledocs dir

• java -Dc=gettingstarted -jar post.jar solr.xml

monitor.xml

• You have now indexed two documents in Solr

• Browse the documents indexed at– http://localhost:8983/solr/gettingstarted/browse

Page 24: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

Collection browsing

Page 25: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

Collection querying

Page 26: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

Querying Data via Solr Admin UI

• Solr can be queried via REST clients, curl, wget,

Chrome POSTMAN, etc., as well as via native

clients available for many programming languages.

• Solr Admin UI includes a query builder interface

– In Admin interface choose gettingstarted collection

– In "Query" tab click button to display results

RequestHandlers are specified in solrconfig.xml

<requestHandler name="/select“ class="solr.SearchHandler">

<lst name="defaults">

<str name="echoParams">explicit</str>

<int name="rows">10</int>

</lst>

</requestHandler>

<initParams

path="/update/**,/query,/select,/tvrh,/elevate,/spell,/browse">

<lst name="defaults">

<str name="df">_text_</str>

</lst>

</initParams>

Search for anything

Default search field: text

Page 27: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

Querying Data via Solr Admin UI

– Enter "solr" in the "q" text box, to search for "solr" in

the index

• Why no results returned?

– Default field for searching the word solr is text. No text field

includes solr

– Change df to name and press button again

– Results can be also previewed in browser:

http://localhost:8983/solr/gettingstarted/select?q=solr&df=

name (response in JSON format)

http://localhost:8983/solr/gettingstarted/select?q=solr&df=

name&wt=xml (response in XML format)

Page 28: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

Querying Data via Solr Admin UI

RESTful url to query Solr.

Can be used when querying

Solr from custom apps.

Page 29: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

Querying Data

• Index all .xml documents in ‘example/exampledocs’UNIX: /bin/post -c gettingstarted

example/exampledocs/*.xml

Windows: java -Dc=gettingstarted -jar post.jar *.xml

• ...and now you can search for all sorts of things

using the default Solr Query Syntax (a superset of

the Lucene query syntax)...

– video

– name:*Video*

– address_s:*ist*

– +video +price:[* TO 400]

• docs having video in searchable fields and price up to 400

– -address_s:*

• docs that do not have address_s field

Page 30: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

Updating Data

• Although solr.xml has been POSTed to the

server twice

– “q”: solr

" { "numFound": 1, "start": 0, …

– Why?

"docs": [ { "id": "SOLR1000",

• This is because the example schema.xml

specifies a "uniqueKey" field called "id".

• Whenever you POST commands to Solr to add a

document with the same value for the uniqueKey

as an existing document, it automatically replaces

it for you.

Page 31: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

Updating Data

• You can see that that has happened by looking at

the values for numDocs and maxDoc in the

"CORE"/searcher section of the statistics page...

• http://localhost:8983/solr/index.html#/gettingstarte

d/plugins?entry=searcher&type=core

Page 32: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

Deleting Data

• You can delete data by POSTing a delete

command to the update URL and specifying the

value of the document's unique key field, or a query

that matches multiple documents

java -Dc=gettingstarted -Ddata=args -

Dcommit=false -jar post.jar

"<delete><id>SP2514N</id></delete>"

• Delete documents that match a specific query

java -Dc=gettingstarted -Dcommit=false -

Ddata=args -jar post.jar

"<delete><query>name:*DDR*</query></delete>"

Page 33: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

Querying Data via REST API

• Searches are done via HTTP GET on the select

URL with the query string in the q parameter.– http://localhost:8983/solr/gettingstarted/select?q=solr&df=name

• You can pass a number of optional request

parameters to the request handler to control what

information is returned.

– use the "fl" parameter to control what stored fields are

returned, and if the relevancy score is returned:

• q=video&fl=name,id (return only name and id fields)

• q=video&fl=name,id,score (return relevancy score as well)

• q=video&fl=*,score (return all stored fields, as well as relevancy

score)

• q=video&sort=address_s desc&fl=name,id,price (add sort

specification: sort by address_s descending)

• q=video&wt=json (return response in JSON format)

Page 34: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

Sorting

• Solr provides a simple method to sort on one or more

indexed fields. Use the "sort' parameter to specify "field

direction" pairs, separated by commas if there's more than

one sort field:

– q=video&sort=price desc

– q=video&sort=price asc

– q=video&sort=inStock asc, price desc

• "score" can also be used as a field name when specifying

a sort:

– q=video&sort=score desc

– q=video&sort=inStock asc, score desc

• Complex functions may also be used to sort results:

– q=video&sort=div(popularity,add(price,1)) desc

• If no sort is specified, the default is score desc to return

the matches having the highest relevancy

Page 35: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

Indexing “Rich” Data

• Index local "rich" files including HTML, PDF,

Microsoft Office formats (such as MS Word), plain

text and many other formats found in /docs

– UNIX: bin/post -c gettingstarted docs/

Page 36: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

Index Data

• There are many other different ways to import

your data into Solr... one can:

– Import records from a database using the Data Import

Handler (DIH)

• see tutorial here for MySQL or SQL Server database import

– Load a CSV file (comma separated values), including

those exported by Excel or MySQL.

– POST JSON documents

– Index binary documents such as Word and PDF

with Solr Cell (ExtractingRequestHandler).

– Use SolrJ for Java or other Solr clients to

programatically create documents to send to Solr.

Page 37: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

Stopping SolrCloud

• Stop SolrCloud nodes

– bin/solr stop -all

• Delete Solr home for nodes (if needed):

– rm -rf example/cloud/node1

– rm -rf example/cloud/node2

Page 38: EPL660: Information Retrieval and Search Engines Lab 3 · EPL660: Information Retrieval and Search ... • This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management

Useful Links

• http://lucene.apache.org/solr/index.html

• http://lucene.apache.org/solr/quickstart.html

• http://wiki.apache.org/solr/SolrResources

– Next Week: ElasticSearch