com.antelmann.net
Class MediaCrawler
java.lang.Object
java.lang.Thread
com.antelmann.net.MediaCrawler
- All Implemented Interfaces:
- CrawlerSetting, Runnable
public class MediaCrawler
- extends Thread
- implements CrawlerSetting
MediaCrawler is a single thread that searches the web for files that are of
a given type.
- Since:
- 10/29/2002
- Author:
- Holger Antelmann
- See Also:
Spider
|
Nested Class Summary |
static interface |
MediaCrawler.Handler
used to handle the media files found during the search of the MediaCrawler |
|
Method Summary |
void |
addHandler(MediaCrawler.Handler handler)
|
boolean |
followLinks(URL url,
URL referer,
int depth,
List<URL> resultURLList,
List<URL> closedURLList,
List<Spider.URLWrapper> searchURLWrapperList)
followLinks() determines whether the given URL is to be searched for
its links to be examined further in the next level. |
URLCache[] |
getFilesFound()
|
boolean |
matchesCriteria(URL url,
URL referer,
int depth,
List<URL> resultURLList,
List<URL> closedURLList)
This method decides whether either the URL itself or its content qualifies
for what this CrawlerSetting searches for; as this function is also called on every
URL encountered, it is also the place for any custom parsing this CrawlerSetting
wants to do. |
void |
run()
|
| Methods inherited from class java.lang.Thread |
activeCount, checkAccess, clone, countStackFrames, currentThread, destroy, dumpStack, enumerate, getAllStackTraces, getContextClassLoader, getDefaultUncaughtExceptionHandler, getId, getName, getPriority, getStackTrace, getState, getThreadGroup, getUncaughtExceptionHandler, holdsLock, interrupt, interrupted, isAlive, isDaemon, isInterrupted, join, join, join, resume, setContextClassLoader, setDaemon, setDefaultUncaughtExceptionHandler, setName, setPriority, setUncaughtExceptionHandler, sleep, sleep, start, stop, stop, suspend, toString, yield |
MediaCrawler
public MediaCrawler(URL rootURL,
int depth,
String mediaExtension,
boolean currentSiteOnly,
String[] pattern)
MediaCrawler
public MediaCrawler(URL rootURL,
int depth,
String mediaExtension,
boolean currentSiteOnly,
MediaCrawler.Handler handler,
String[] pattern)
addHandler
public void addHandler(MediaCrawler.Handler handler)
run
public void run()
- Specified by:
run in interface Runnable- Overrides:
run in class Thread
getFilesFound
public URLCache[] getFilesFound()
followLinks
public boolean followLinks(URL url,
URL referer,
int depth,
List<URL> resultURLList,
List<URL> closedURLList,
List<Spider.URLWrapper> searchURLWrapperList)
- Description copied from interface:
CrawlerSetting
- followLinks() determines whether the given URL is to be searched for
its links to be examined further in the next level.
The three List objects allow the CrawlerSetting to act on potential constrains
that may result from e.g. a maximum number of total nodes to be examined
(or any other custom checking imaginable).
The url may include any URL, including non-HTTP protocols
(such as mailto:, ftp:) and image or media URLs.
- Specified by:
followLinks in interface CrawlerSetting
- Parameters:
url - the URL that is to be examined for its linksreferer - url's referer URLdepth - distance from the original root URL where the search beganresultURLList - List of URLs that have already been found to match this CrawlerSetting's criteriaclosedURLList - List of URLs that have already been found not to match the CrawlerSetting's criteriasearchURLWrapperList - List of Spider.URLWrapper objects already identified to be examined in the next level- See Also:
Spider.URLWrapper
matchesCriteria
public boolean matchesCriteria(URL url,
URL referer,
int depth,
List<URL> resultURLList,
List<URL> closedURLList)
- Description copied from interface:
CrawlerSetting
- This method decides whether either the URL itself or its content qualifies
for what this CrawlerSetting searches for; as this function is also called on every
URL encountered, it is also the place for any custom parsing this CrawlerSetting
wants to do.
The two List objects allow the CrawlerSetting to act on potential constrains
that may result from e.g. a maximum number of total nodes to be examined
(or any other custom checking imaginable).
Note that it is the responsibility of the calling object to ensure that
this function isn't called multiple times on the same URL if that's not
desired.
The url may include any URL, including non-HTTP protocols
(such as mailto:, ftp:) and image or media URLs
- Specified by:
matchesCriteria in interface CrawlerSetting
- Parameters:
url - the URL in question to satisfy the criteriareferer - url's referer URLdepth - link distance from the original root URL where the search beganresultURLList - List of URLs that have already been found to match this CrawlerSetting's criteriaclosedURLList - List of URLs that have already been found not to match the CrawlerSetting's criteria
(c) Holger Antelmann since 2001- all rights reserved (contact: info@antelmann.com)
see www.antelmann.com/developer for further details and available downloads