EPL660: Information Retrieval and Search Engines Lab 3 EPL660: Information Retrieval and Search

download EPL660: Information Retrieval and Search Engines Lab 3 EPL660: Information Retrieval and Search

of 38

  • date post

    24-May-2020
  • Category

    Documents

  • view

    3
  • download

    0

Embed Size (px)

Transcript of EPL660: Information Retrieval and Search Engines Lab 3 EPL660: Information Retrieval and Search

  • University of Cyprus

    Department of

    Computer Science

    EPL660: Information

    Retrieval and Search

    Engines – Lab 3

    Παύλος Αντωνίου

    Γραφείο: B109, ΘΕΕ01

  • Apache Solr

    • Popular, fast, open-source search platform built

    on Apache Lucene from the Apache Lucene

    project

    • Written in Java and runs as a standalone full-text

    search server with standalone or distributed

    (SolrCloud) operation

    • Solr uses the Lucene Java search library at its core

    for full-text indexing and search

  • Apache Solr Features

    • XML/HTTP and JSON APIs

    • Hit highlighting

    • Faceted Search and Filtering

    • Near real-time indexing

    • Database integration

    • Rich document (e.g., Word, PDF) handling

    • Geospatial Search

    • Fast Incremental Updates and Index Replication

    • Caching

    • Replication

    • Web administration interface etc

  • Apache Solr vs Apache Lucene

    • Relationship between Solr and Lucene is that of a

    car and its engine.

    • You can't drive an engine, but you can drive a car.

    • Lucene is a library which you can't use as-is,

    whereas Solr is a complete application which

    you can use out-of-box.

    • Unlike Lucene, Solr is a web application (WAR)

    which can be deployed in any servlet container,

    e.g. Jetty, Tomcat, Resin, etc.

    – single JAR file needed to deploy application on server

    • Solr can be installed and used easily by non-

    programmers. Lucene needs programming skills.

  • When to use Lucene?

    • Need for embedded search functionality into a

    desktop application for example

    • Need very customized requirements requiring

    low-level access to the Lucene API classes

    – Solr may be more a hindrance than a help, since it is an

    extra layer of indirection.

  • SolrCloud

    • Apache Solr includes the ability to set up a cluster

    of Solr servers that combines fault tolerance and

    high availability: SolrCloud

    • SolrCloud allows for distributed search and

    indexing

    • SolrCloud features:

    – Central configuration for the entire cluster

    – Automatic load balancing and fail-over for queries

    – ZooKeeper integration for cluster coordination and

    configuration

  • SolrCloud Concepts

    • A Cluster is made up of one or more Solr Nodes,

    which are running instances of the Solr server

    process

  • SolrCloud Concepts

    • A Cluster can host multiple Collections of Solr

    Documents

    • A collection can be partitioned into multiple

    Shards (pieces), which contain a subset of the

    Documents in the Collection

    • Each Shard can be replicated (Leader & Replicas)

  • SolrCloud Concepts

    • The number of Shards that a Collection has

    determines:

    – The theoretical limit to the number of Documents that

    Collection can reasonably contain.

    • The number of Replicas that each Shard has

    determines:

    – The level of redundancy built into the Collection and

    how fault tolerant the Cluster can be in the event that

    some Nodes become unavailable.

    – The theoretical limit in the number concurrent search

    requests that can be processed under heavy load.

  • Getting Started

    • Download Apache Solr (curr. version 8.2.0) from

    https://www.apache.org/dyn/closer.lua/lucene/solr/8.2.0/s

    olr-8.2.0.tgz (or zip for windows)

    • Extract zip and go to solr directory

    • Open a terminal and type:

    bin/solr start -e cloud -noprompt

    • This will start up a SolrCloud cluster with

    embedded ZooKeeper (cloud management

    service) on local workstation with 2 nodes

    – First node listens on port 8983 & second on port 7574

    • You can see that the Solr is running by loading

    http://localhost:8983/solr/ in your web browser.

    https://www.apache.org/dyn/closer.lua/lucene/solr/8.2.0/solr-8.2.0.tgz http://localhost:8983/solr/

  • Solr web interface

  • SolrCloud

    • Preview collections on tab

    – One collection created automatically → gettingstarted

    – Collection is partioned into 2 shards

    • First node stores 2 leader shards / Second stores 2 replicas

    • Solr server is up and running, with one collection

    but no data indexed

    • Important files configuration files: solrconfig.xml,

    managed-schema (or schema.xml) – solr-dir/server/solr/configsets/_default/conf/solrconfig.xml

    – solr-dir/server/solr/configsets/_default /conf/schema.xml

  • DataDir and DirectoryFactory

    • DataDir: location for index data

    – Solr stores its index data in a directory called /data

    under the core’s instance directory

    – use parameter in solrconfig.xml to

    modify it

    • DirectoryFactory: – solr.StandardDirectoryFactory: filesystem-based directory factory

    – solr.SimpleFSDirectoryFactory: index file on local filesystem, problem with multiple

    threads

    – solr.NIOFSDirectoryFactory: scales with many threads, does not work well with MS

    Windows platforms

    – solr.MMapDirectoryFactory: uses virtual memory and disk, not for real-time searching

    – solr.NRTCachingDirectoryFactory: some chunks in RAM, for real-time searching

    – solr.RAMDirectoryFactory: whole index in memory, does not work for replication

    – solrHdfdDirectoryFactory: index in HDFS (desirable if using Hadoop)

    https://lucene.apache.org/solr/guide/8_1/datadir-and-directoryfactory-in-solrconfig.html

    https://lucene.apache.org/solr/guide/8_1/datadir-and-directoryfactory-in-solrconfig.html

  • How Solr Sees the World

    • Document: basic unit of information

    – set of data that describes something

    • E.g. document about a person, for example, might contain the

    person’s name, biography, favorite color, and shoe size

    – documents are expected to be composed of fields,

    which are more specific pieces of information

    • E.g. "first_name":"Pavlos", "shoe_size":42

    – fields can contain different types of data

    • first_name→ text, shoe_size→ number

    • User defines type of each field

    • Field type tells Solr how to interpret the field and how it can be

    queried

    – When document added into a collection, Solr takes

    values from document fields and add them to index

    – Queries consult index, return matching docs

  • Field Analysis Process

    • How does Solr process document fields when

    building an index?

    – Example: biography field in a person document "biography": "He received his Ph.D. from Department of

    Computer Science of the University of Cyprus, in 2012"

    – Index every word of biography in order to find quickly

    people whose lives have had anything to do with university, or computer. Any issues?

    • What if biography contains a lot of common words you don’t

    really care about like "he", "the", "a", "to", "for", "is" (stop

    words)?

    • What if biography contains the word "University" and a user

    makes a query for "university"?

    • Solution: field analysis

  • Field Analysis Process

    • For each field, you can tell Solr:

    – how to break apart the text into words (tokenization)

    • E.g. split at whitespaces, commas, etc.

    – to remove stop words (filtering)

    – to make lower case normalization

    – to remove accents marks

    Read more here: Understanding

    Analyzers, Tokenizers, and Filters

    https://lucene.apache.org/solr/guide/7_2/understanding-analyzers-tokenizers-and-filters.html#understanding-analyzers-tokenizers-and-filters

  • Schema files and manipulation

    • Solr stores details about the field types and fields it

    is expected to understand in a schema file:

    – managed-schema is the name for the schema file

    – Solr uses by default to support making Schema changes

    at runtime via the Schema API (via HTTP),

    or Schemaless Mode / avoid hand editing of the

    managed schema file

    – schema.xml is the traditional name for a schema file

    which can be edited manually by users who use the

    ClassicIndexSchemaFactory (before Solr6)

    – If you are using SolrCloud you may not be able to find

    any file by these names on the local filesystem. You will

    only be able to see the schema through the Schema API

    (if enabled) or through the Solr Admin UI’s Cloud

    Screens

    https://lucene.apache.org/solr/guide/7_2/overview-of-documents-fields-and-schema-design.html#solr-s-schema-file https://lucene.apache.org/solr/guide/7_2/schema-api.html#schema-api https://lucene.apache.org/solr/guide/7_2/schemaless-mode.html#schemaless-mode https://lucene.apache.org/solr/guide/7_2/schema-factory-definition-in-solrconfig.html#schema-factory-definition-in-solrconfig https://lucene.apache.org/solr/guide/7_2/cloud-screens.html#cloud-screens