Dev:APICrawlProfileEditor

Aus YaCyWiki
Wechseln zu: Navigation, Suche

/CrawlProfileEditor_p.xml

Crawl profiles are collections of information containing start-url, crawling-depth and filters which specify each running crawl job. Crawl profiles can be retrieved as XML:

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?> 
- <crawlProfiles>
- <crawlProfile>
   <name>snippetLocalMedia</name> 
   <status>active</status> 
   <starturl /> 
   <depth>0</depth> 
   <mustmatch>.*</mustmatch> 
   <mustnotmatch /> 
   <crawlingIfOlder>02.01.2010 16:10:00</crawlingIfOlder> 
   <crawlingDomFilterDepth>inactive</crawlingDomFilterDepth> 
   <crawlingDomFilterContent /> 
   <crawlingDomMaxPages>unlimited</crawlingDomMaxPages> 
   <withQuery>yes</withQuery> 
   <storeCache>no</storeCache> 
   <indexText>no</indexText> 
   <indexMedia>no</indexMedia> 
   <remoteIndexing>no</remoteIndexing> 
  </crawlProfile>
- <crawlProfile>
   <name>/autoReCrawl/daily/http://www.acer-userforum.de/</name> 
   <status>active</status> 
   <starturl>http://www.acer-userforum.de/</starturl> 
   <depth>3</depth> 
   <mustmatch>.*</mustmatch> 
   <mustnotmatch>.*memberlist.*|.*previous.*|.*next.*|.*p=.*</mustnotmatch> 
   <crawlingIfOlder>31.01.2010 14:56:24</crawlingIfOlder> 
   <crawlingDomFilterDepth>inactive</crawlingDomFilterDepth> 
   <crawlingDomFilterContent /> 
   <crawlingDomMaxPages>unlimited</crawlingDomMaxPages> 
   <withQuery>yes</withQuery> 
   <storeCache>no</storeCache> 
   <indexText>yes</indexText> 
   <indexMedia>yes</indexMedia> 
   <remoteIndexing>no</remoteIndexing> 
   </crawlProfile>
  </crawlProfiles>

This native PHP example shows how to request a list of all crawl profiles a peer has loaded.

<?php
  $command="CrawlProfileEditor_p.xml";
  //open connection to peer    
  $YaCyURL="http://mypeer.tld:8080/";  
  $cu=$YaCyURL.$command;  
  $queryServer = curl_init($cu);     
  curl_setopt($queryServer, CURLOPT_HEADER, 0);
  curl_setopt($queryServer, CURLOPT_RETURNTRANSFER, 1);
  curl_setopt($queryServer, CURLOPT_USERPWD,$appID);
  $results = curl_exec($queryServer);
  curl_close($queryServer);  
  //parse xml...
  $resultarray=xml2array($results);  
  //get items only
  $items=$resultarray['crawlProfiles']['crawlProfile'];
  if ($items)
  {
   echo "<h1>Crawl Profiles</h1>";
   echo "<table>";
   foreach ($items as $item)
   {
   if ($tr=="ffffff") {$tr="aaaaaa";} else {$tr="ffffff";}
   echo "<tr bgcolor=".$tr.">";
   echo "<td>".$item['hash']."</td>";
   echo "<td>".$item['name']."</td>";
   echo "<td>".$item['status']."</td>";
   echo "<td>".$item['starturl']."</td>";
   echo "<td>".$item['depth']."</td>";
   echo "<td>".$item['mustmatch']."</td>";
   echo "<td>".$item['mustnotmatch']."</td>";
   echo "<td>".$item['crawlingIfOlder']."</td>";
   echo "<td>".$item['crawlingDomFilterDepth']."</td>";
   echo "<td>".$item['crawlingDomFilterContent']."</td>";
   echo "<td>".$item['DomMaxPages']."</td>";
   echo "<td>".$item['withQuery']."</td>";
   echo "<td>".$item['storeCache']."</td>";
   echo "<td>".$item['indexText']."</td>";
   echo "<td>".$item['indexMedia']."</td>";
   echo "<td>".$item['remoteIndexing']."</td>";
   echo "</tr>"; 
  }
  echo "</table>";
}