Spider (Antelmann.com Java Packages)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

Holger's
Java API

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

com.antelmann.net
Class Spider

java.lang.Object
  com.antelmann.net.Spider

public class Spider
extends Object
extends Object

Spider provides several useful methods for accessing web content and parsing HTML most based on a simple URL. Note that because this class uses functionality from the javax.swing package (although no GUIs are used in this class), there are non-terminating javax.swing threads that get created when using this class. I.e. an application using this class without using any other javax.swing GUI components may end up with unwanted non-terminated threads possible forcing calls to e.g. System.exit(0) to terminate a simple program. Most methods are synchronized, so don't expect to have longer running methods (such as getting links from a URL) run simultaneously on the same Spider object.

Author:: Holger Antelmann
See Also:: CrawlerSetting, URLCache

Nested Class Summary
`static class`	`Spider.SMonitor` Deprecated.
`static class`	`Spider.URLWrapper` wrappes a java.net.URL and keeps a reference to its referer

Constructor Summary
`Spider()` convenience constructor, that initializes the Spider with a null value as URL
`Spider(String urlString)`
`Spider(URL url)` constructs a Spider object based on the given URL
`Spider(URL url, String user, char[] password)`

Method Summary
`int`	`calculatePageWeight()` returns the page weigt in bytes (= content length of URL plus sum of embedded images)
`void`	`clearAuthentication()`
`URL[]`	`crawlWeb(CrawlerSetting crawler, int numberOfURLsToFind, Logger logger)` searches the web from the embedded URL (used as root) for URLs based on the criteria given in the crawler; search is performed breadth-first
`static URL[]`	`crawlWeb(List<Spider.URLWrapper> searchList, List<URL> resultList, List<URL> closedList, CrawlerSetting crawler, int depth, int numberOfURLsToFind, Logger logger)` usually called by crawlWeb(URL root, CrawlerSetting crawler, Logger)
`String`	`getAuthenticationUser()`
`URL[]`	`getBrokenLinks()` Assuming the URL points to a HTML page, only links that are not accessible are returned.
`byte[]`	`getBytes()` retrieves the raw content from the embedded URL.
`String`	`getCharset()`
`int`	`getConnectTimeout()`
`String`	`getContentAsString()` uses default encoding
`String`	`getContentAsString(String charset)` retrieves the entire content accessible through the embedded URL as a String.
`String`	`getContentAsUTF8()`
`int`	`getContentLength()` retrieves the content length from an URLConnection
`String`	`getDomainName()`
`String`	`getFileName()` returns only the last portion of `URL.getPath()` after the last '/'
`String`	`getFullHeaderAsString()`
`String`	`getHeaderValue(String name)`
`HTMLDocument`	`getHTMLDocument()` returns an HTMLDocument object with the parsed content of the embedded URL for further examination
`HTMLDocument`	`getHTMLDocument(Reader reader)` returns an HTMLDocument object with the parsed content from the given reader for further examination
`URL[]`	`getImages(boolean allowDuplicates)` returns an array of images that are contained in the embedded URL
`URL[]`	`getImages(Reader reader, boolean allowDuplicates)` allows to read the content from another location but the url itself
`InputStream`	`getInputStream()` obtains the InputStream (basic authentication is applied if previously set)
`InputStream`	`getInputStreamUsingBasicAuthorization(String user, char[] password)` obtains the InputStream using basic authorization mechanism if user is not null; the given authentication only applies to this call
`URL[]`	`getLinks(boolean allowDuplicates)` returns an array containing URLs that the embedded URL links to; if the page is a frameset, the frame sources are returned.
`URL[]`	`getLinks(boolean allowDuplicates, String... protocol)` returns links filtered by the given protocol
`URL[]`	`getLinks(Reader reader, boolean allowDuplicates)` allows to read the content from another location but the url itself
`String`	`getParameter(String key)`
`HashMap<String,String>`	`getParameterMap()`
`Reader`	`getReader()` This function constructs a reader appropriate for reading the content from the embedded URL.
`Reader`	`getReader(String charsetName)`
`URL[]`	`getRSSFeeds()` returns links to RSS feeds in the document
`URL[]`	`getRSSFeeds(Reader reader)` parses the reader for RSS feeds
`SSLHelper`	`getSSLHelper()`
`URL[]`	`getStylesheets()` returns links to stylesheets in the document
`URL[]`	`getStylesheets(Reader reader)` parses the reader for links to stylesheets
`String`	`getTagText(HTML.Tag desiredTag, String delimiter)` returns all text found in the given desiredTag delimited by the given delimiter
`String`	`getTagText(Reader reader, HTML.Tag desiredTag, String delimiter)` allows to read the content from another location but the url itself
`String`	`getTitle()` returns the title of the document
`URL`	`getURL()` returns the embedded URL
`static URL`	`getURLFromLink(String link, URL context)` translates a relative URL to an absolute URL
`boolean`	`includesPattern(String[] searchPattern, boolean includeHTMLCode)` searches the content of the embedded URL for the presence of one of the searchPatterns given; returns true if one of the patterns was found
`boolean`	`isAccessible()` actually connects to the embedded URL while executing
`boolean`	`isHtmlPage()` checks the content type of the opened URLConnection
`long`	`ping()` returns the time it takes to establish a live connection to the embedded URL and returns -1 only if the URL is unreachable.
`void`	`saveURLtoFile(File file)` saves the content of the embedded URL to the given file
`static List<URL>`	`searchWebFor(String[] searchPattern, ArrayList<URL> searchList, boolean includeHTMLCode, int level, boolean currentSiteOnly, List<URL> excludeList, List<URL> resultList, String[] searchURLExclusionPatterns, Monitor monitor)` Deprecated.
`static URL[]`	`searchWebFor(String[] searchPattern, URL entryPoint, boolean includeHTMLCode, int level, boolean currentSiteOnly, String[] searchURLExclusionPatterns, Monitor monitor)` Deprecated.
`void`	`setBasicAuthentication(String user, char[] password)`
`static void`	`setBasicAuthorization(URLConnection con, String user, char[] password)` enables basic authorization mechanism on the given URLConnection
`void`	`setConnectTimeout(int timeout)`
`void`	`setSSLHelper(SSLHelper helper)`
`void`	`setURL(URL url)` sets the embedded URL
`String`	`stripText()` a line break is put after each separate text occurrence
`String`	`stripText(Reader reader, String delimiter)` allows to read the content from another location but the url itself
`String`	`stripText(String delimiter)` returns a String containing the text of all HTML tag types from the embedded URL
`File`	`toFile()`
`String`	`whois()` returns the registrant information from the Internic database; the embedded URL must use the host name and not the IP address
`static String`	`whois(String domainName)` returns the registrant information from the Internic database

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

Spider

public Spider()

convenience constructor, that initializes the Spider with a null value as URL

Spider

public Spider(String urlString)
       throws MalformedURLException

Throws:: MalformedURLException

Spider

public Spider(URL url)

constructs a Spider object based on the given URL

Spider

public Spider(URL url,
              String user,
              char[] password)

Method Detail

getURL

public URL getURL()

returns the embedded URL

setURL

public void setURL(URL url)

sets the embedded URL

getSSLHelper

public SSLHelper getSSLHelper()

setSSLHelper

public void setSSLHelper(SSLHelper helper)

getConnectTimeout

public int getConnectTimeout()

setConnectTimeout

public void setConnectTimeout(int timeout)

getParameter

public String getParameter(String key)

getParameterMap

public HashMap<String,String> getParameterMap()

toFile

public File toFile()

getDomainName

public String getDomainName()

getFileName

public String getFileName()

returns only the last portion of URL.getPath() after the last '/'

getHeaderValue

public String getHeaderValue(String name)
                      throws IOException

Throws:: IOException

getFullHeaderAsString

public String getFullHeaderAsString()
                             throws IOException

Throws:: IOException

getCharset

public String getCharset()
                  throws IOException

Throws:: IOException

ping

public long ping()

returns the time it takes to establish a live connection to the embedded URL and returns -1 only if the URL is unreachable.

saveURLtoFile

public void saveURLtoFile(File file)
                   throws IOException

saves the content of the embedded URL to the given file

Throws:: IOException

getLinks

public URL[] getLinks(boolean allowDuplicates,
                      String... protocol)
               throws IOException

returns links filtered by the given protocol

Throws:: IOException
See Also:: getLinks(boolean)

getLinks

public URL[] getLinks(boolean allowDuplicates)
               throws IOException

returns an array containing URLs that the embedded URL links to; if the page is a frameset, the frame sources are returned. If no links are present within the given URL, an empty array is returned

Throws:: IOException

getLinks

public URL[] getLinks(Reader reader,
                      boolean allowDuplicates)
               throws IOException

allows to read the content from another location but the url itself

Throws:: IOException
See Also:: getLinks(boolean)

getBrokenLinks

public URL[] getBrokenLinks()
                     throws IOException

Assuming the URL points to a HTML page, only links that are not accessible are returned. If all links are valid (or the page didn't contain links), an empty array is returned. Only links with 'http', 'ftp' or 'file' protocol are checked.

Throws:: IOException

isAccessible

public boolean isAccessible()

actually connects to the embedded URL while executing

getImages

public URL[] getImages(boolean allowDuplicates)
                throws IOException

returns an array of images that are contained in the embedded URL

Throws:: IOException

getImages

public URL[] getImages(Reader reader,
                       boolean allowDuplicates)
                throws IOException

allows to read the content from another location but the url itself

Throws:: IOException
See Also:: getImages(boolean)

getURLFromLink

public static URL getURLFromLink(String link,
                                 URL context)
                          throws MalformedURLException

translates a relative URL to an absolute URL

Throws:: MalformedURLException

getStylesheets

public URL[] getStylesheets()
                     throws IOException

returns links to stylesheets in the document

Throws:: IOException

getStylesheets

public URL[] getStylesheets(Reader reader)
                     throws IOException

parses the reader for links to stylesheets

Throws:: IOException

getRSSFeeds

public URL[] getRSSFeeds()
                  throws IOException

returns links to RSS feeds in the document

Throws:: IOException

getRSSFeeds

public URL[] getRSSFeeds(Reader reader)
                  throws IOException

parses the reader for RSS feeds

Throws:: IOException

includesPattern

public boolean includesPattern(String[] searchPattern,
                               boolean includeHTMLCode)
                        throws IOException

searches the content of the embedded URL for the presence of one of the searchPatterns given; returns true if one of the patterns was found

Parameters:: searchPattern - array of search patterns this function will look for; includeHTMLCode - if true, this function will search through all content of the URL, including HTML code; if false, it will only search through text found
Throws:: IOException

getTitle

public String getTitle()
                throws IOException

returns the title of the document

Throws:: IOException

getTagText

public String getTagText(HTML.Tag desiredTag,
                         String delimiter)
                  throws IOException

returns all text found in the given desiredTag delimited by the given delimiter

Throws:: IOException
See Also:: getTagText(Reader, HTML.Tag, String)

getTagText

public String getTagText(Reader reader,
                         HTML.Tag desiredTag,
                         String delimiter)
                  throws IOException

allows to read the content from another location but the url itself

Throws:: IOException
See Also:: getTagText(HTML.Tag, String)

stripText

public String stripText()
                 throws IOException

a line break is put after each separate text occurrence

Throws:: IOException

stripText

public String stripText(String delimiter)
                 throws IOException

returns a String containing the text of all HTML tag types from the embedded URL

Throws:: IOException

stripText

public String stripText(Reader reader,
                        String delimiter)
                 throws IOException

allows to read the content from another location but the url itself

Throws:: IOException
See Also:: stripText(String)

getHTMLDocument

public HTMLDocument getHTMLDocument()
                             throws IOException

returns an HTMLDocument object with the parsed content of the embedded URL for further examination

Throws:: IOException

getHTMLDocument

public HTMLDocument getHTMLDocument(Reader reader)
                             throws IOException

returns an HTMLDocument object with the parsed content from the given reader for further examination

Throws:: IOException

getReader

public Reader getReader()
                 throws IOException

This function constructs a reader appropriate for reading the content from the embedded URL. Currently, this function only supports HTTP, FTP and FILE protocol.

Throws:: IOException; UnsupportedOperationException - if the given URL is of another protocol than HTTP or FILE

getReader

public Reader getReader(String charsetName)
                 throws IOException

Throws:: IOException
See Also:: getReader()

clearAuthentication

public void clearAuthentication()

getAuthenticationUser

public String getAuthenticationUser()

setBasicAuthentication

public void setBasicAuthentication(String user,
                                   char[] password)

setBasicAuthorization

public static void setBasicAuthorization(URLConnection con,
                                         String user,
                                         char[] password)

enables basic authorization mechanism on the given URLConnection

getInputStream

public InputStream getInputStream()
                           throws IOException

obtains the InputStream (basic authentication is applied if previously set)

Throws:: IOException

getInputStreamUsingBasicAuthorization

public InputStream getInputStreamUsingBasicAuthorization(String user,
                                                         char[] password)
                                                  throws IOException

obtains the InputStream using basic authorization mechanism if user is not null; the given authentication only applies to this call

Throws:: IOException

getBytes

public byte[] getBytes()
                throws IOException

retrieves the raw content from the embedded URL.

Throws:: IOException

getContentAsString

public String getContentAsString()
                          throws IOException

uses default encoding

Throws:: IOException
See Also:: getContentAsString(String)

getContentAsUTF8

public String getContentAsUTF8()
                        throws IOException

Throws:: IOException

getContentAsString

public String getContentAsString(String charset)
                          throws IOException

retrieves the entire content accessible through the embedded URL as a String. If the URL points to an HTML page, the full HTML code is returned. This method is not suitable for retrieving binary data as it uses a BufferedReader and also places platform specific line breaks between the lines read with readLine(). If the URL could not be accessed and an IOException was caught, null is returned.

Throws:: IOException

getContentLength

public int getContentLength()
                     throws IOException

retrieves the content length from an URLConnection

Throws:: IOException

isHtmlPage

public boolean isHtmlPage()
                   throws IOException

checks the content type of the opened URLConnection

Throws:: IOException

calculatePageWeight

public int calculatePageWeight()
                        throws IOException

returns the page weigt in bytes (= content length of URL plus sum of embedded images)

Throws:: IOException

whois

public String whois()
             throws IOException

returns the registrant information from the Internic database; the embedded URL must use the host name and not the IP address

Throws:: IOException

whois

public static String whois(String domainName)
                    throws IOException

returns the registrant information from the Internic database

Throws:: IOException

crawlWeb

public URL[] crawlWeb(CrawlerSetting crawler,
                      int numberOfURLsToFind,
                      Logger logger)

searches the web from the embedded URL (used as root) for URLs based on the criteria given in the crawler; search is performed breadth-first

Parameters:: crawler - criteria for crawling; numberOfURLsToFind - if >0 the search is stopped when the given number of URLs are found to match the crawler's criteria; logger - to log IOExceptions occurring while processing links
Returns:: an array containing URLs found that satisfy the crawler's criteria as defined by the crawler

crawlWeb

public static URL[] crawlWeb(List<Spider.URLWrapper> searchList,
                             List<URL> resultList,
                             List<URL> closedList,
                             CrawlerSetting crawler,
                             int depth,
                             int numberOfURLsToFind,
                             Logger logger)

usually called by crawlWeb(URL root, CrawlerSetting crawler, Logger)

Parameters:: searchList - List of Spider.URLWrapper objects containing nodes to be examined; resultList - List of URL objects; closedList - List of URL objects; crawler - criteria for crawling; depth - link distance from the root of the search; numberOfURLsToFind - if >0 the search is stopped when the given number of URLs are found to match the crawler's criteria; logger - to log IOExceptions occurring while processing links
Returns:: an array containing URLs found that satisfy the criteria as defined by the crawler
See Also:: crawlWeb(CrawlerSetting, int, Logger)

searchWebFor

@Deprecated
public static URL[] searchWebFor(String[] searchPattern,
                                            URL entryPoint,
                                            boolean includeHTMLCode,
                                            int level,
                                            boolean currentSiteOnly,
                                            String[] searchURLExclusionPatterns,
                                            Monitor monitor)

Deprecated.

This special web search function returns all URLs found that contain the desired searchString given the constrains of the other parameter. The search starts at the entryPoint and goes recursively through the tree derived from the links from that URL as deep as suggested by the level parameter; the search is conducted in a bredth-first-search manner. For more flexible web searches, consider the use of a com.antelmann.net.CrawlerSetting.
Use of Monitor - if present (i.e. monitor may be null, in which case no feedback is provided while the function is executing):

search is stopped when monitor is disabled (enabling e.g. to stop searching after 10 pages were found etc.)
for each examined node, monitor.increment() is called
for each examined node, monitor.runTask() is called
monitor.getObject() contains the current URL being examined
monitor.getObject(0) contains the current searchList
monitor.getObject(1) contains the current excludeList
monitor.getObject(2) contains the current resultSet
monitor.getNumber() contains the number of total pages examined so far
monitor.getNumber(0) contains the number of result pages found so far
monitor.getNumber(1) contains the current level (counting down to 0)
monitor.getNumber(2) contains the number of URLs to be searched in the current level (giving an indication on how many URLs are ahead in a new level)
thus, monitor.getSize() is required to be at least 3 if present

Parameters:: searchPattern - an array containing String patterns to search for; wildcards are not supported; entryPoint - the URL from where to start the search; includeHTMLCode - if true, the search will include not only the text, but also the HTML code of a page; level - limits the depth of the search; only pages that are reachable with less or equal than the given number of recursive links will be included; currentSiteOnly - if true, the search is limited to the host of the entryPoint; searchURLExclusionPatterns - if not null it contains an array of String patterns which will be used to filter out unwanted URLs, i.e. if any of the patterns are present in the URL's path, that URL will be disregarded; wildcards are not supported; monitor - see above for usage; may be null
See Also:: crawlWeb(CrawlerSetting, int, Logger)

searchWebFor

@Deprecated
public static List<URL> searchWebFor(String[] searchPattern,
                                                ArrayList<URL> searchList,
                                                boolean includeHTMLCode,
                                                int level,
                                                boolean currentSiteOnly,
                                                List<URL> excludeList,
                                                List<URL> resultList,
                                                String[] searchURLExclusionPatterns,
                                                Monitor monitor)

Deprecated.

usually called by the other searchWebFor() function; all Lists contain URL objects

See Also:: searchWebFor(String[], URL, boolean, int, boolean, String[], Monitor)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

com.antelmann.net Class Spider

Spider

Spider

Spider

Spider

getURL

setURL

getSSLHelper

setSSLHelper

getConnectTimeout

setConnectTimeout

getParameter

getParameterMap

toFile

getDomainName

getFileName

getHeaderValue

getFullHeaderAsString

getCharset

ping

saveURLtoFile

getLinks

getLinks

getLinks

getBrokenLinks

isAccessible

getImages

getImages

getURLFromLink

getStylesheets

getStylesheets

getRSSFeeds

getRSSFeeds

includesPattern

getTitle

getTagText

getTagText

stripText

stripText

stripText

getHTMLDocument

getHTMLDocument

getReader

getReader

clearAuthentication

getAuthenticationUser

setBasicAuthentication

setBasicAuthorization

getInputStream

getInputStreamUsingBasicAuthorization

getBytes

getContentAsString

getContentAsUTF8

getContentAsString

getContentLength

isHtmlPage

calculatePageWeight

whois

whois

crawlWeb

crawlWeb

searchWebFor

searchWebFor

com.antelmann.net
Class Spider