Dev:APIsearch

Aus YaCyWiki
Wechseln zu: Navigation, Suche

The Remote Search Servlet

/yacy/search.html is a servlet which returns a property list as result, not a html page as the name would suggest. This servlet is the RWI remote search query interface and is used during DHT-driven peer-to-peer searches.

Operations performed when the servlet is called

Each peer in the YaCy peer-to-peer network stores so-called Reverse Word Indexes (RWIs) which had been transmitted to that peer during the Distributed Hash Table (DHT) remote transmission process, described in /yacy/transferRWI.html. This servlet returns results from that transmissions out of the local search index. The servlet performs additionally an 'index abstract' computation whenever more than two search terms are submitted and adds that index abstract to the search result. In addition, specific index abstracts can be retrieved through this servlet as well. Index abstracts represent compressed lists of url hash collections which are used by the searching peer to do remote result list conjunctions.

Properties in the http header

The caller must obey the http header rules.

Properties in the http post-arguments

Calls to the servlet can be made using a GET or POST operation. The properties are:

  • myseed=<the SimpleCoding of the own Peer Seed> (optional)
  • query=<a string of word hashes without delimiters, x*12 bytes for x words>
  • exclude=<a string of word hashes> - those shall not be within the search results
  • urls=<a string of url hashes that are preselected for the search> - no other than these urlhashes may be returned. This is used for 'secondary searches', done after remote conjunction of index abstracts
  • abstracts=<a string of word hashes> - for abstracts that shall be generated, or 'auto' (for maxcount-word), or (for none)
  • count=<integer, maximum number of wanted results>
  • time=<long, maximum waiting time>
  • maxdist=<integer, maximum distance between the words in the text>
  • prefer=<regular expression for urls that shall be selected from the search results to appear in the result list>
  • contentdom=<content domain> - one of 'text', 'image', 'audio', 'video', 'app' or 'all'
  • filter=<regular expression for urls that must match in the result url>

Result

The returned document contains a list of property lines. These properties are:

  • version=<YaCy version number of responding peer>
  • uptime=
  • <code>searchtime=<elapsed time for the search></code>
  • <code>references=<a comma-separated list of words appearing in the result, i.e. for a tag cloud></code>
  • <code>joincount=<number of documents matching the search word hashes></code>
  • <code>count=<number of result lines returned in 'resource' within this result></code>
  • <code>resource<c>=<result entry, a comma-separated set of result properties within '{' and '}'></code>; list of resource0 .. resource<count> properties where 0 <= <c> <= <count>.
  • <code>indexcount.<wordhash>=<result count for each word></code>; this is a list of properties for each word in the query attribute, counting the number of entries in the word-RWI
  • <code>indexabstract.<wordhash>=<index abstract for each word></code>; this is a list of properties for each word in the query attribute, representing a compressed list of url hashes inside the corresponding RWI.

Resource

Each <code>resource</code>-entry contains a set of properties for each url. These properties are:

  • <code>hash=<url-hash></code>
  • <code>url=<SimpleCoding of the url></code>
  • <code>descr=<SimpleCoding of the document description, i.e. the document title></code>
  • <code>author=<SimpleCoding of the document author></code>
  • <code>tags=<SimpleCoding of the document keywords or subject content></code>
  • <code>publisher=<SimpleCoding of the document publisher></code>
  • <code>lat=<latitude of a document-attached location></code>
  • <code>lon=<longitude of a document-attached location></code>
  • <code>mod=<last_modified date in yyyyMMdd></code>
  • <code>load=<load date in yyyyMMdd></code>
  • <code>fresh=<date until resource shall be considered as fresh in yyyyMMdd></code>
  • <code>referrer=<url-hash of one document referrer></code> there may be more referrers but this is the first referrer that the crawler caused to load this page initially
  • <code>md5=<the md5 of the raw source></code>
  • <code>size=<the size of the document in bytes></code>
  • <code>wc=<the number of words in the document></code>
  • <code>dt=<the content_type of the document></code> this is a mime-type
  • <code>flags=<an encoded bitfield with attributes about the document></code>
  • <code>lang=<the language used in the document></code>
  • <code>llocal=<the inbound link count></code> number of links from this document to documents within the same domain
  • <code>lother=<the outbound link count></code> number of links from this document to documents within another domain
  • <code>limage=<number of embedded images></code>
  • <code>laudio=<number of embedded links to audio documents></code>
  • <code>lvideo=<number of embedded links to video documents></code>
  • <code>lapp=<number of embedded links to downloadable applications></code>
  • <code>snippet=<SimpleCoding of the search result snippet, a text where the search word appears></code>

Index Abstract

An index abstract is a compressed list of url hashes. The compression is made by removal of the host-hash part of an url-hash. url-hashes are constructed by the pattern <code><overall-hash><host-hash></code>. Both parts of the hash are six bytes in length. The last six bytes of url-hashes are therefore identical, if the host of the urls are identical. A set of url-hashes is compressed by grouping of the url-hashes by the host and then these groups are written by a leading host-hash followed by the list of the remaining hash part. The index abstract has the form

indexabstract.<wordhash>  = '{', domainabstract.<wordhash>, (',', domainabstract.<wordhash>)* '}'
domainabstract.<wordhash> = host-hash, ':',  (overall-hash)*

Example Usage

These are some examples searches: