Dev:Segments

Aus YaCyWiki
Wechseln zu: Navigation, Suche

This is a documentation of the new data structures for the next index file format. This is not yet implemented!

Segments Architecture: a Vertical Index

The new YaCy index will use separated indexes for different document domains. A document domain can be technical (like a separation into different URL hash patterns) or as given by the user when he/she wants to create a separate index that can also be tar/zipped and sent to sombody else. The segments correspond to the idea of a vertical index, and it is in fact a vertical index in the context of performance enhancements in large indexing systems. The benefits are:

  • Referenced URLs over the vertical index are disjunct
  • A search over a vertical index is like a meta-search.
  • If the segments of the vertical index are distributed to other peers, the search time decreases. But if they stay on the same peer, it increases IO which increases the search time.
  • If the segments are organized in small/large groups, the small groups can be hold in RAM which compensates the disadvantage of more IO using several segments.
  • The ability to hold specific segments completely in RAM will enable a 'very high performance option' for specific search cases, while the architecture also supports low-memory configurations.
  • Each segment represents a horizontal index (with many several files). The vertical and horizontal index creates a matrix of index files, which should be small enough to send them as they are to other peers. This will enable a high-performance RWI file distribution to other peers.

File Structure

The segments will be placed in DATA/INDEX/<network>/SEGMENTS (which we call the segments directory from now on). We will not follow the idea of separated media indexes as we had before because it is possible to modulate this using the segment structure. The segments directory contain segment subdirectories for each horizontal index. There is no special naming for the segments, but there will be a default directory for a default directory for all indexes that are not assigned to a special segment ('common'), and a directory for indexes that had been created using the proxy ('proxy').

The Segment: a Horizontal Index

Each segment will contain five parts:

  • a property file ('segment.properties') describing the segment. The property will describe which urls shall be stored into the segment using patterns for the url hash and/or the url string. A priority number will cause that conflicting patterns can be put into a specific order (for example the catch-all segment 'common' will have a catch-all pattern and a low priority).
  • a directory ('references') containing databases to the reference-hashes in the index. This will be equal to the URL database as currently used in DATA/INDEX/<network>/TEXT/urls.*
  • a directory ('objects') containing a cache of the indexed documents. This will be equal to the HTCACHE directory, which now becomes a index-specific cache.
  • a directory ('rwi') containing the index files for the horizontal index.
  • a directory ('queues') containing the indexing queues from a crawl and error database files. If a crawl is finished, a queues directory may be emptied. A crawl profile database may be kept to have a document about the crawls that filled the segment.

The 'rwi' subdirectory will contain a new data structure, which will realize the most demanded features in the past: less IO during creation and an option to delete parts of the index using a given time-out. The 'rwi' directory will contain the following files:

  • 64 subdirectories named 'cell_b64_<c>' where c is one of the literals of the b64 encoding as used for the word hashes. This cell directory contains files as currently used for the RWI RAM cache dump after SVN 5430.
  • cell files will be written after a cell-assigned RAM cache is full. When the index is opened, only the cell.idx file is opened and its content held in the RAM.
  • cell files will have a time stamp so it is possible to delete cell files by date.
  • if the number of dumped cell files is too large, several cell files can be merged into a new one. This should be possible without much IO, because the cell files can be read and write using streams. The RWI RAM cache dump procedure shows that this is very fast.
  • the cell file organization is partly available with the kelondroBLOBArray, the same class that holds the object files for the HTCACHE.
  • there will be no permanent RAM Cache flush as we have now to put indexes into the RICOLLECTION, which will reduce IO to almost zero.

Migration

The current directory DATA/INDEX/<network>/TEXT/RICACHE contains the same files as a cell directory will contain. The migration to the new data structure can be done in the following steps:

  • switch off of RAM cache flush to RICOLLECTION. Instead, the RAM Cache must be flushed to dump files each time when the cache is full.
  • implementation of a RAM Cache meta object, that can hold several RAM caches at the same time, but only one where data can be writen
  • implementation of a cache dump merge operation. This must be done each time when the number of cache dumps gets too large. I.e. when there are 10 dumps, the 3 smallest dumps will be merged into a new common dump for these 3 dump.

If we reach this point, we will have already a better data structure as it is implemented in lucene, because they have also a dump merge, but use several files for the index and index attributes, which we have combined in one (with less IO when reading that)

  • creation of the DATA/INDEX/<network>/SEGMENTS directory, then moving DATA/INDEX/<network>/TEXT/RICACHE to DATA/INDEX/<network>/SEGMENTS/common/rwi/row. The row directory is a special directory, which must be flushed into the cell directories in another step.
  • move DATA/INDEX/<network>/TEXT/urls.* to DATA/INDEX/<network>/SEGMENTS/common/references/
  • move DATA/HTCACHE to DATA/INDEX/<network>/SEGMENTS/common/objects
  • add a migration procedure to translate DATA/INDEX/<network>/TEXT/RICOLLECTION into DATA/INDEX/<network>/SEGMENTS/common/rwi/

At this point, we have just migrated the old data structure to the new one, but the next step will enable a more efficient RWI transmission to other peers:

  • replicate the segment data structure to artificial segments named vertical_e4_<h>, where <h> is a hex number from 0 to F. These are 16 segments that distinguish the document domain using the URL hash: a leading b64 literal is divided by 4 which creates a single-character hex number. This intersection of the index is already implemented into the new Peer target computation. When we put the index to these 16 partitions and distribute them to 16 DHT positions (plus redundancy), we get a 16-time performance during the search.