|
Holger's Java API |
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectcom.antelmann.net.Spider
public class Spider
Spider provides several useful methods for accessing
web content and parsing HTML most based on a simple URL.
Note that because this class uses functionality from the javax.swing
package (although no GUIs are used in this class), there are non-terminating
javax.swing threads that get created when using this class.
I.e. an application using this class without using any other
javax.swing GUI components may end up with unwanted non-terminated
threads possible forcing calls to e.g. System.exit(0)
to terminate a simple program.
Most methods are synchronized, so don't expect to have longer running methods
(such as getting links from a URL) run simultaneously on the same Spider object.
CrawlerSetting,
URLCache| Nested Class Summary | |
|---|---|
static class |
Spider.SMonitor
Deprecated. |
static class |
Spider.URLWrapper
wrappes a java.net.URL and keeps a reference to its referer |
| Constructor Summary | |
|---|---|
Spider()
convenience constructor, that initializes the Spider with a null value as URL |
|
Spider(String urlString)
|
|
Spider(URL url)
constructs a Spider object based on the given URL |
|
Spider(URL url,
String user,
char[] password)
|
|
| Method Summary | |
|---|---|
int |
calculatePageWeight()
returns the page weigt in bytes (= content length of URL plus sum of embedded images) |
void |
clearAuthentication()
|
URL[] |
crawlWeb(CrawlerSetting crawler,
int numberOfURLsToFind,
Logger logger)
searches the web from the embedded URL (used as root) for URLs based on the criteria given in the crawler; search is performed breadth-first |
static URL[] |
crawlWeb(List<Spider.URLWrapper> searchList,
List<URL> resultList,
List<URL> closedList,
CrawlerSetting crawler,
int depth,
int numberOfURLsToFind,
Logger logger)
usually called by crawlWeb(URL root, CrawlerSetting crawler, Logger) |
String |
getAuthenticationUser()
|
URL[] |
getBrokenLinks()
Assuming the URL points to a HTML page, only links that are not accessible are returned. |
byte[] |
getBytes()
retrieves the raw content from the embedded URL. |
String |
getCharset()
|
int |
getConnectTimeout()
|
String |
getContentAsString()
uses default encoding |
String |
getContentAsString(String charset)
retrieves the entire content accessible through the embedded URL as a String. |
String |
getContentAsUTF8()
|
int |
getContentLength()
retrieves the content length from an URLConnection |
String |
getDomainName()
|
String |
getFileName()
returns only the last portion of URL.getPath() after the last '/' |
String |
getFullHeaderAsString()
|
String |
getHeaderValue(String name)
|
HTMLDocument |
getHTMLDocument()
returns an HTMLDocument object with the parsed content of the embedded URL for further examination |
HTMLDocument |
getHTMLDocument(Reader reader)
returns an HTMLDocument object with the parsed content from the given reader for further examination |
URL[] |
getImages(boolean allowDuplicates)
returns an array of images that are contained in the embedded URL |
URL[] |
getImages(Reader reader,
boolean allowDuplicates)
allows to read the content from another location but the url itself |
InputStream |
getInputStream()
obtains the InputStream (basic authentication is applied if previously set) |
InputStream |
getInputStreamUsingBasicAuthorization(String user,
char[] password)
obtains the InputStream using basic authorization mechanism if user is not null; the given authentication only applies to this call |
URL[] |
getLinks(boolean allowDuplicates)
returns an array containing URLs that the embedded URL links to; if the page is a frameset, the frame sources are returned. |
URL[] |
getLinks(boolean allowDuplicates,
String... protocol)
returns links filtered by the given protocol |
URL[] |
getLinks(Reader reader,
boolean allowDuplicates)
allows to read the content from another location but the url itself |
String |
getParameter(String key)
|
HashMap<String,String> |
getParameterMap()
|
Reader |
getReader()
This function constructs a reader appropriate for reading the content from the embedded URL. |
Reader |
getReader(String charsetName)
|
URL[] |
getRSSFeeds()
returns links to RSS feeds in the document |
URL[] |
getRSSFeeds(Reader reader)
parses the reader for RSS feeds |
SSLHelper |
getSSLHelper()
|
URL[] |
getStylesheets()
returns links to stylesheets in the document |
URL[] |
getStylesheets(Reader reader)
parses the reader for links to stylesheets |
String |
getTagText(HTML.Tag desiredTag,
String delimiter)
returns all text found in the given desiredTag delimited by the given delimiter |
String |
getTagText(Reader reader,
HTML.Tag desiredTag,
String delimiter)
allows to read the content from another location but the url itself |
String |
getTitle()
returns the title of the document |
URL |
getURL()
returns the embedded URL |
static URL |
getURLFromLink(String link,
URL context)
translates a relative URL to an absolute URL |
boolean |
includesPattern(String[] searchPattern,
boolean includeHTMLCode)
searches the content of the embedded URL for the presence of one of the searchPatterns given; returns true if one of the patterns was found |
boolean |
isAccessible()
actually connects to the embedded URL while executing |
boolean |
isHtmlPage()
checks the content type of the opened URLConnection |
long |
ping()
returns the time it takes to establish a live connection to the embedded URL and returns -1 only if the URL is unreachable. |
void |
saveURLtoFile(File file)
saves the content of the embedded URL to the given file |
static List<URL> |
searchWebFor(String[] searchPattern,
ArrayList<URL> searchList,
boolean includeHTMLCode,
int level,
boolean currentSiteOnly,
List<URL> excludeList,
List<URL> resultList,
String[] searchURLExclusionPatterns,
Monitor monitor)
Deprecated. |
static URL[] |
searchWebFor(String[] searchPattern,
URL entryPoint,
boolean includeHTMLCode,
int level,
boolean currentSiteOnly,
String[] searchURLExclusionPatterns,
Monitor monitor)
Deprecated. |
void |
setBasicAuthentication(String user,
char[] password)
|
static void |
setBasicAuthorization(URLConnection con,
String user,
char[] password)
enables basic authorization mechanism on the given URLConnection |
void |
setConnectTimeout(int timeout)
|
void |
setSSLHelper(SSLHelper helper)
|
void |
setURL(URL url)
sets the embedded URL |
String |
stripText()
a line break is put after each separate text occurrence |
String |
stripText(Reader reader,
String delimiter)
allows to read the content from another location but the url itself |
String |
stripText(String delimiter)
returns a String containing the text of all HTML tag types from the embedded URL |
File |
toFile()
|
String |
whois()
returns the registrant information from the Internic database; the embedded URL must use the host name and not the IP address |
static String |
whois(String domainName)
returns the registrant information from the Internic database |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
|---|
public Spider()
public Spider(String urlString)
throws MalformedURLException
MalformedURLExceptionpublic Spider(URL url)
public Spider(URL url,
String user,
char[] password)
| Method Detail |
|---|
public URL getURL()
public void setURL(URL url)
public SSLHelper getSSLHelper()
public void setSSLHelper(SSLHelper helper)
public int getConnectTimeout()
public void setConnectTimeout(int timeout)
public String getParameter(String key)
public HashMap<String,String> getParameterMap()
public File toFile()
public String getDomainName()
public String getFileName()
URL.getPath() after the last '/'
public String getHeaderValue(String name)
throws IOException
IOException
public String getFullHeaderAsString()
throws IOException
IOException
public String getCharset()
throws IOException
IOExceptionpublic long ping()
public void saveURLtoFile(File file)
throws IOException
IOException
public URL[] getLinks(boolean allowDuplicates,
String... protocol)
throws IOException
IOExceptiongetLinks(boolean)
public URL[] getLinks(boolean allowDuplicates)
throws IOException
IOException
public URL[] getLinks(Reader reader,
boolean allowDuplicates)
throws IOException
IOExceptiongetLinks(boolean)
public URL[] getBrokenLinks()
throws IOException
IOExceptionpublic boolean isAccessible()
public URL[] getImages(boolean allowDuplicates)
throws IOException
IOException
public URL[] getImages(Reader reader,
boolean allowDuplicates)
throws IOException
IOExceptiongetImages(boolean)
public static URL getURLFromLink(String link,
URL context)
throws MalformedURLException
MalformedURLException
public URL[] getStylesheets()
throws IOException
IOException
public URL[] getStylesheets(Reader reader)
throws IOException
IOException
public URL[] getRSSFeeds()
throws IOException
IOException
public URL[] getRSSFeeds(Reader reader)
throws IOException
IOException
public boolean includesPattern(String[] searchPattern,
boolean includeHTMLCode)
throws IOException
searchPattern - array of search patterns this function will look forincludeHTMLCode - if true, this function will search through all content of
the URL, including HTML code; if false, it will only search
through text found
IOException
public String getTitle()
throws IOException
IOException
public String getTagText(HTML.Tag desiredTag,
String delimiter)
throws IOException
IOExceptiongetTagText(Reader, HTML.Tag, String)
public String getTagText(Reader reader,
HTML.Tag desiredTag,
String delimiter)
throws IOException
IOExceptiongetTagText(HTML.Tag, String)
public String stripText()
throws IOException
IOException
public String stripText(String delimiter)
throws IOException
IOException
public String stripText(Reader reader,
String delimiter)
throws IOException
IOExceptionstripText(String)
public HTMLDocument getHTMLDocument()
throws IOException
IOException
public HTMLDocument getHTMLDocument(Reader reader)
throws IOException
IOException
public Reader getReader()
throws IOException
IOException
UnsupportedOperationException - if the given URL is of another
protocol than HTTP or FILE
public Reader getReader(String charsetName)
throws IOException
IOExceptiongetReader()public void clearAuthentication()
public String getAuthenticationUser()
public void setBasicAuthentication(String user,
char[] password)
public static void setBasicAuthorization(URLConnection con,
String user,
char[] password)
public InputStream getInputStream()
throws IOException
IOException
public InputStream getInputStreamUsingBasicAuthorization(String user,
char[] password)
throws IOException
IOException
public byte[] getBytes()
throws IOException
IOException
public String getContentAsString()
throws IOException
IOExceptiongetContentAsString(String)
public String getContentAsUTF8()
throws IOException
IOException
public String getContentAsString(String charset)
throws IOException
IOException
public int getContentLength()
throws IOException
IOException
public boolean isHtmlPage()
throws IOException
IOException
public int calculatePageWeight()
throws IOException
IOException
public String whois()
throws IOException
IOException
public static String whois(String domainName)
throws IOException
IOException
public URL[] crawlWeb(CrawlerSetting crawler,
int numberOfURLsToFind,
Logger logger)
crawler - criteria for crawlingnumberOfURLsToFind - if >0 the search is stopped when the given number
of URLs are found to match the crawler's criterialogger - to log IOExceptions occurring while processing links
public static URL[] crawlWeb(List<Spider.URLWrapper> searchList,
List<URL> resultList,
List<URL> closedList,
CrawlerSetting crawler,
int depth,
int numberOfURLsToFind,
Logger logger)
searchList - List of Spider.URLWrapper objects containing nodes to be examinedresultList - List of URL objectsclosedList - List of URL objectscrawler - criteria for crawlingdepth - link distance from the root of the searchnumberOfURLsToFind - if >0 the search is stopped when the given number
of URLs are found to match the crawler's criterialogger - to log IOExceptions occurring while processing links
crawlWeb(CrawlerSetting, int, Logger)
@Deprecated
public static URL[] searchWebFor(String[] searchPattern,
URL entryPoint,
boolean includeHTMLCode,
int level,
boolean currentSiteOnly,
String[] searchURLExclusionPatterns,
Monitor monitor)
searchPattern - an array containing String patterns to search for;
wildcards are not supportedentryPoint - the URL from where to start the searchincludeHTMLCode - if true, the search will include not only the text,
but also the HTML code of a pagelevel - limits the depth of the search; only pages that are reachable
with less or equal than the given number of recursive links will be includedcurrentSiteOnly - if true, the search is limited to the host of the entryPointsearchURLExclusionPatterns - if not null it contains an array of String patterns
which will be used to filter out unwanted URLs, i.e. if any of the patterns
are present in the URL's path, that URL will be disregarded;
wildcards are not supportedmonitor - see above for usage; may be nullcrawlWeb(CrawlerSetting, int, Logger)
@Deprecated
public static List<URL> searchWebFor(String[] searchPattern,
ArrayList<URL> searchList,
boolean includeHTMLCode,
int level,
boolean currentSiteOnly,
List<URL> excludeList,
List<URL> resultList,
String[] searchURLExclusionPatterns,
Monitor monitor)
searchWebFor(String[], URL, boolean, int, boolean, String[], Monitor)
|
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||