Holger's
Java API

com.antelmann.net
Class SampleCrawlerSetting

java.lang.Object
  extended by com.antelmann.net.SampleCrawlerSetting
All Implemented Interfaces:
CrawlerSetting, Serializable

public class SampleCrawlerSetting
extends Object
implements CrawlerSetting, Serializable

SampleCrawlerSetting is what it's named: a sample CrawlerSetting. It is currently used by JSpider as the default CrawlerSetting.

Author:
Holger Antelmann
See Also:
JSpider, Serialized Form

Field Summary
 boolean currentSiteOnly
           
static String[] defaultRestrictURLPattern
           
 int depth
           
 boolean includeHTMLCode
           
 String[] includeTextPattern
           
 String[] restrictURLPattern
           
 
Constructor Summary
SampleCrawlerSetting()
          searches all files 3 levels deep in current site only
SampleCrawlerSetting(int depth, boolean currentSiteOnly, String[] restrictURLPattern, String[] includeTextPattern, boolean includeHTMLCode)
           
SampleCrawlerSetting(int depth, String includeTextPattern)
           
 
Method Summary
 boolean followLinks(URL url, URL referer, int depth, List<URL> resultURLList, List<URL> closedURLList, List<Spider.URLWrapper> searchURLWrapperList)
          followLinks() determines whether the given URL is to be searched for its links to be examined further in the next level.
 boolean isActive()
          if inactive, followLinks() always returns false
 boolean matchesCriteria(URL url, URL referer, int depth, List<URL> resultURLList, List<URL> closedURLList)
          This method decides whether either the URL itself or its content qualifies for what this CrawlerSetting searches for; as this function is also called on every URL encountered, it is also the place for any custom parsing this CrawlerSetting wants to do.
 void setActive(boolean flag)
           
 String toString()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

defaultRestrictURLPattern

public static final String[] defaultRestrictURLPattern

depth

public int depth

currentSiteOnly

public boolean currentSiteOnly

restrictURLPattern

public String[] restrictURLPattern

includeTextPattern

public String[] includeTextPattern

includeHTMLCode

public boolean includeHTMLCode
Constructor Detail

SampleCrawlerSetting

public SampleCrawlerSetting()
searches all files 3 levels deep in current site only


SampleCrawlerSetting

public SampleCrawlerSetting(int depth,
                            String includeTextPattern)

SampleCrawlerSetting

public SampleCrawlerSetting(int depth,
                            boolean currentSiteOnly,
                            String[] restrictURLPattern,
                            String[] includeTextPattern,
                            boolean includeHTMLCode)
Method Detail

setActive

public void setActive(boolean flag)

isActive

public boolean isActive()
if inactive, followLinks() always returns false


followLinks

public boolean followLinks(URL url,
                           URL referer,
                           int depth,
                           List<URL> resultURLList,
                           List<URL> closedURLList,
                           List<Spider.URLWrapper> searchURLWrapperList)
Description copied from interface: CrawlerSetting
followLinks() determines whether the given URL is to be searched for its links to be examined further in the next level. The three List objects allow the CrawlerSetting to act on potential constrains that may result from e.g. a maximum number of total nodes to be examined (or any other custom checking imaginable). The url may include any URL, including non-HTTP protocols (such as mailto:, ftp:) and image or media URLs.

Specified by:
followLinks in interface CrawlerSetting
Parameters:
url - the URL that is to be examined for its links
referer - url's referer URL
depth - distance from the original root URL where the search began
resultURLList - List of URLs that have already been found to match this CrawlerSetting's criteria
closedURLList - List of URLs that have already been found not to match the CrawlerSetting's criteria
searchURLWrapperList - List of Spider.URLWrapper objects already identified to be examined in the next level
See Also:
Spider.URLWrapper

matchesCriteria

public boolean matchesCriteria(URL url,
                               URL referer,
                               int depth,
                               List<URL> resultURLList,
                               List<URL> closedURLList)
Description copied from interface: CrawlerSetting
This method decides whether either the URL itself or its content qualifies for what this CrawlerSetting searches for; as this function is also called on every URL encountered, it is also the place for any custom parsing this CrawlerSetting wants to do. The two List objects allow the CrawlerSetting to act on potential constrains that may result from e.g. a maximum number of total nodes to be examined (or any other custom checking imaginable). Note that it is the responsibility of the calling object to ensure that this function isn't called multiple times on the same URL if that's not desired. The url may include any URL, including non-HTTP protocols (such as mailto:, ftp:) and image or media URLs

Specified by:
matchesCriteria in interface CrawlerSetting
Parameters:
url - the URL in question to satisfy the criteria
referer - url's referer URL
depth - link distance from the original root URL where the search began
resultURLList - List of URLs that have already been found to match this CrawlerSetting's criteria
closedURLList - List of URLs that have already been found not to match the CrawlerSetting's criteria

toString

public String toString()
Overrides:
toString in class Object


(c) Holger Antelmann since 2001- all rights reserved (contact: info@antelmann.com)
see www.antelmann.com/developer for further details and available downloads