Holger's
Java API

com.antelmann.net
Class Spider

java.lang.Object
  extended by com.antelmann.net.Spider

public class Spider
extends Object

Spider provides several useful methods for accessing web content and parsing HTML most based on a simple URL. Note that because this class uses functionality from the javax.swing package (although no GUIs are used in this class), there are non-terminating javax.swing threads that get created when using this class. I.e. an application using this class without using any other javax.swing GUI components may end up with unwanted non-terminated threads possible forcing calls to e.g. System.exit(0) to terminate a simple program. Most methods are synchronized, so don't expect to have longer running methods (such as getting links from a URL) run simultaneously on the same Spider object.

Author:
Holger Antelmann
See Also:
CrawlerSetting, URLCache

Nested Class Summary
static class Spider.SMonitor
          Deprecated.  
static class Spider.URLWrapper
          wrappes a java.net.URL and keeps a reference to its referer
 
Constructor Summary
Spider()
          convenience constructor, that initializes the Spider with a null value as URL
Spider(String urlString)
           
Spider(URL url)
          constructs a Spider object based on the given URL
Spider(URL url, String user, char[] password)
           
 
Method Summary
 int calculatePageWeight()
          returns the page weigt in bytes (= content length of URL plus sum of embedded images)
 void clearAuthentication()
           
 URL[] crawlWeb(CrawlerSetting crawler, int numberOfURLsToFind, Logger logger)
          searches the web from the embedded URL (used as root) for URLs based on the criteria given in the crawler; search is performed breadth-first
static URL[] crawlWeb(List<Spider.URLWrapper> searchList, List<URL> resultList, List<URL> closedList, CrawlerSetting crawler, int depth, int numberOfURLsToFind, Logger logger)
          usually called by crawlWeb(URL root, CrawlerSetting crawler, Logger)
 String getAuthenticationUser()
           
 URL[] getBrokenLinks()
          Assuming the URL points to a HTML page, only links that are not accessible are returned.
 byte[] getBytes()
          retrieves the raw content from the embedded URL.
 String getCharset()
           
 int getConnectTimeout()
           
 String getContentAsString()
          uses default encoding
 String getContentAsString(String charset)
          retrieves the entire content accessible through the embedded URL as a String.
 String getContentAsUTF8()
           
 int getContentLength()
          retrieves the content length from an URLConnection
 String getDomainName()
           
 String getFileName()
          returns only the last portion of URL.getPath() after the last '/'
 String getFullHeaderAsString()
           
 String getHeaderValue(String name)
           
 HTMLDocument getHTMLDocument()
          returns an HTMLDocument object with the parsed content of the embedded URL for further examination
 HTMLDocument getHTMLDocument(Reader reader)
          returns an HTMLDocument object with the parsed content from the given reader for further examination
 URL[] getImages(boolean allowDuplicates)
          returns an array of images that are contained in the embedded URL
 URL[] getImages(Reader reader, boolean allowDuplicates)
          allows to read the content from another location but the url itself
 InputStream getInputStream()
          obtains the InputStream (basic authentication is applied if previously set)
 InputStream getInputStreamUsingBasicAuthorization(String user, char[] password)
          obtains the InputStream using basic authorization mechanism if user is not null; the given authentication only applies to this call
 URL[] getLinks(boolean allowDuplicates)
          returns an array containing URLs that the embedded URL links to; if the page is a frameset, the frame sources are returned.
 URL[] getLinks(boolean allowDuplicates, String... protocol)
          returns links filtered by the given protocol
 URL[] getLinks(Reader reader, boolean allowDuplicates)
          allows to read the content from another location but the url itself
 String getParameter(String key)
           
 HashMap<String,String> getParameterMap()
           
 Reader getReader()
          This function constructs a reader appropriate for reading the content from the embedded URL.
 Reader getReader(String charsetName)
           
 URL[] getRSSFeeds()
          returns links to RSS feeds in the document
 URL[] getRSSFeeds(Reader reader)
          parses the reader for RSS feeds
 SSLHelper getSSLHelper()
           
 URL[] getStylesheets()
          returns links to stylesheets in the document
 URL[] getStylesheets(Reader reader)
          parses the reader for links to stylesheets
 String getTagText(HTML.Tag desiredTag, String delimiter)
          returns all text found in the given desiredTag delimited by the given delimiter
 String getTagText(Reader reader, HTML.Tag desiredTag, String delimiter)
          allows to read the content from another location but the url itself
 String getTitle()
          returns the title of the document
 URL getURL()
          returns the embedded URL
static URL getURLFromLink(String link, URL context)
          translates a relative URL to an absolute URL
 boolean includesPattern(String[] searchPattern, boolean includeHTMLCode)
          searches the content of the embedded URL for the presence of one of the searchPatterns given; returns true if one of the patterns was found
 boolean isAccessible()
          actually connects to the embedded URL while executing
 boolean isHtmlPage()
          checks the content type of the opened URLConnection
 long ping()
          returns the time it takes to establish a live connection to the embedded URL and returns -1 only if the URL is unreachable.
 void saveURLtoFile(File file)
          saves the content of the embedded URL to the given file
static List<URL> searchWebFor(String[] searchPattern, ArrayList<URL> searchList, boolean includeHTMLCode, int level, boolean currentSiteOnly, List<URL> excludeList, List<URL> resultList, String[] searchURLExclusionPatterns, Monitor monitor)
          Deprecated.  
static URL[] searchWebFor(String[] searchPattern, URL entryPoint, boolean includeHTMLCode, int level, boolean currentSiteOnly, String[] searchURLExclusionPatterns, Monitor monitor)
          Deprecated.  
 void setBasicAuthentication(String user, char[] password)
           
static void setBasicAuthorization(URLConnection con, String user, char[] password)
          enables basic authorization mechanism on the given URLConnection
 void setConnectTimeout(int timeout)
           
 void setSSLHelper(SSLHelper helper)
           
 void setURL(URL url)
          sets the embedded URL
 String stripText()
          a line break is put after each separate text occurrence
 String stripText(Reader reader, String delimiter)
          allows to read the content from another location but the url itself
 String stripText(String delimiter)
          returns a String containing the text of all HTML tag types from the embedded URL
 File toFile()
           
 String whois()
          returns the registrant information from the Internic database; the embedded URL must use the host name and not the IP address
static String whois(String domainName)
          returns the registrant information from the Internic database
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Spider

public Spider()
convenience constructor, that initializes the Spider with a null value as URL


Spider

public Spider(String urlString)
       throws MalformedURLException
Throws:
MalformedURLException

Spider

public Spider(URL url)
constructs a Spider object based on the given URL


Spider

public Spider(URL url,
              String user,
              char[] password)
Method Detail

getURL

public URL getURL()
returns the embedded URL


setURL

public void setURL(URL url)
sets the embedded URL


getSSLHelper

public SSLHelper getSSLHelper()

setSSLHelper

public void setSSLHelper(SSLHelper helper)

getConnectTimeout

public int getConnectTimeout()

setConnectTimeout

public void setConnectTimeout(int timeout)

getParameter

public String getParameter(String key)

getParameterMap

public HashMap<String,String> getParameterMap()

toFile

public File toFile()

getDomainName

public String getDomainName()

getFileName

public String getFileName()
returns only the last portion of URL.getPath() after the last '/'


getHeaderValue

public String getHeaderValue(String name)
                      throws IOException
Throws:
IOException

getFullHeaderAsString

public String getFullHeaderAsString()
                             throws IOException
Throws:
IOException

getCharset

public String getCharset()
                  throws IOException
Throws:
IOException

ping

public long ping()
returns the time it takes to establish a live connection to the embedded URL and returns -1 only if the URL is unreachable.


saveURLtoFile

public void saveURLtoFile(File file)
                   throws IOException
saves the content of the embedded URL to the given file

Throws:
IOException

getLinks

public URL[] getLinks(boolean allowDuplicates,
                      String... protocol)
               throws IOException
returns links filtered by the given protocol

Throws:
IOException
See Also:
getLinks(boolean)

getLinks

public URL[] getLinks(boolean allowDuplicates)
               throws IOException
returns an array containing URLs that the embedded URL links to; if the page is a frameset, the frame sources are returned. If no links are present within the given URL, an empty array is returned

Throws:
IOException

getLinks

public URL[] getLinks(Reader reader,
                      boolean allowDuplicates)
               throws IOException
allows to read the content from another location but the url itself

Throws:
IOException
See Also:
getLinks(boolean)

getBrokenLinks

public URL[] getBrokenLinks()
                     throws IOException
Assuming the URL points to a HTML page, only links that are not accessible are returned. If all links are valid (or the page didn't contain links), an empty array is returned. Only links with 'http', 'ftp' or 'file' protocol are checked.

Throws:
IOException

isAccessible

public boolean isAccessible()
actually connects to the embedded URL while executing


getImages

public URL[] getImages(boolean allowDuplicates)
                throws IOException
returns an array of images that are contained in the embedded URL

Throws:
IOException

getImages

public URL[] getImages(Reader reader,
                       boolean allowDuplicates)
                throws IOException
allows to read the content from another location but the url itself

Throws:
IOException
See Also:
getImages(boolean)

getURLFromLink

public static URL getURLFromLink(String link,
                                 URL context)
                          throws MalformedURLException
translates a relative URL to an absolute URL

Throws:
MalformedURLException

getStylesheets

public URL[] getStylesheets()
                     throws IOException
returns links to stylesheets in the document

Throws:
IOException

getStylesheets

public URL[] getStylesheets(Reader reader)
                     throws IOException
parses the reader for links to stylesheets

Throws:
IOException

getRSSFeeds

public URL[] getRSSFeeds()
                  throws IOException
returns links to RSS feeds in the document

Throws:
IOException

getRSSFeeds

public URL[] getRSSFeeds(Reader reader)
                  throws IOException
parses the reader for RSS feeds

Throws:
IOException

includesPattern

public boolean includesPattern(String[] searchPattern,
                               boolean includeHTMLCode)
                        throws IOException
searches the content of the embedded URL for the presence of one of the searchPatterns given; returns true if one of the patterns was found

Parameters:
searchPattern - array of search patterns this function will look for
includeHTMLCode - if true, this function will search through all content of the URL, including HTML code; if false, it will only search through text found
Throws:
IOException

getTitle

public String getTitle()
                throws IOException
returns the title of the document

Throws:
IOException

getTagText

public String getTagText(HTML.Tag desiredTag,
                         String delimiter)
                  throws IOException
returns all text found in the given desiredTag delimited by the given delimiter

Throws:
IOException
See Also:
getTagText(Reader, HTML.Tag, String)

getTagText

public String getTagText(Reader reader,
                         HTML.Tag desiredTag,
                         String delimiter)
                  throws IOException
allows to read the content from another location but the url itself

Throws:
IOException
See Also:
getTagText(HTML.Tag, String)

stripText

public String stripText()
                 throws IOException
a line break is put after each separate text occurrence

Throws:
IOException

stripText

public String stripText(String delimiter)
                 throws IOException
returns a String containing the text of all HTML tag types from the embedded URL

Throws:
IOException

stripText

public String stripText(Reader reader,
                        String delimiter)
                 throws IOException
allows to read the content from another location but the url itself

Throws:
IOException
See Also:
stripText(String)

getHTMLDocument

public HTMLDocument getHTMLDocument()
                             throws IOException
returns an HTMLDocument object with the parsed content of the embedded URL for further examination

Throws:
IOException

getHTMLDocument

public HTMLDocument getHTMLDocument(Reader reader)
                             throws IOException
returns an HTMLDocument object with the parsed content from the given reader for further examination

Throws:
IOException

getReader

public Reader getReader()
                 throws IOException
This function constructs a reader appropriate for reading the content from the embedded URL. Currently, this function only supports HTTP, FTP and FILE protocol.

Throws:
IOException
UnsupportedOperationException - if the given URL is of another protocol than HTTP or FILE

getReader

public Reader getReader(String charsetName)
                 throws IOException
Throws:
IOException
See Also:
getReader()

clearAuthentication

public void clearAuthentication()

getAuthenticationUser

public String getAuthenticationUser()

setBasicAuthentication

public void setBasicAuthentication(String user,
                                   char[] password)

setBasicAuthorization

public static void setBasicAuthorization(URLConnection con,
                                         String user,
                                         char[] password)
enables basic authorization mechanism on the given URLConnection


getInputStream

public InputStream getInputStream()
                           throws IOException
obtains the InputStream (basic authentication is applied if previously set)

Throws:
IOException

getInputStreamUsingBasicAuthorization

public InputStream getInputStreamUsingBasicAuthorization(String user,
                                                         char[] password)
                                                  throws IOException
obtains the InputStream using basic authorization mechanism if user is not null; the given authentication only applies to this call

Throws:
IOException

getBytes

public byte[] getBytes()
                throws IOException
retrieves the raw content from the embedded URL.

Throws:
IOException

getContentAsString

public String getContentAsString()
                          throws IOException
uses default encoding

Throws:
IOException
See Also:
getContentAsString(String)

getContentAsUTF8

public String getContentAsUTF8()
                        throws IOException
Throws:
IOException

getContentAsString

public String getContentAsString(String charset)
                          throws IOException
retrieves the entire content accessible through the embedded URL as a String. If the URL points to an HTML page, the full HTML code is returned. This method is not suitable for retrieving binary data as it uses a BufferedReader and also places platform specific line breaks between the lines read with readLine(). If the URL could not be accessed and an IOException was caught, null is returned.

Throws:
IOException

getContentLength

public int getContentLength()
                     throws IOException
retrieves the content length from an URLConnection

Throws:
IOException

isHtmlPage

public boolean isHtmlPage()
                   throws IOException
checks the content type of the opened URLConnection

Throws:
IOException

calculatePageWeight

public int calculatePageWeight()
                        throws IOException
returns the page weigt in bytes (= content length of URL plus sum of embedded images)

Throws:
IOException

whois

public String whois()
             throws IOException
returns the registrant information from the Internic database; the embedded URL must use the host name and not the IP address

Throws:
IOException

whois

public static String whois(String domainName)
                    throws IOException
returns the registrant information from the Internic database

Throws:
IOException

crawlWeb

public URL[] crawlWeb(CrawlerSetting crawler,
                      int numberOfURLsToFind,
                      Logger logger)
searches the web from the embedded URL (used as root) for URLs based on the criteria given in the crawler; search is performed breadth-first

Parameters:
crawler - criteria for crawling
numberOfURLsToFind - if >0 the search is stopped when the given number of URLs are found to match the crawler's criteria
logger - to log IOExceptions occurring while processing links
Returns:
an array containing URLs found that satisfy the crawler's criteria as defined by the crawler

crawlWeb

public static URL[] crawlWeb(List<Spider.URLWrapper> searchList,
                             List<URL> resultList,
                             List<URL> closedList,
                             CrawlerSetting crawler,
                             int depth,
                             int numberOfURLsToFind,
                             Logger logger)
usually called by crawlWeb(URL root, CrawlerSetting crawler, Logger)

Parameters:
searchList - List of Spider.URLWrapper objects containing nodes to be examined
resultList - List of URL objects
closedList - List of URL objects
crawler - criteria for crawling
depth - link distance from the root of the search
numberOfURLsToFind - if >0 the search is stopped when the given number of URLs are found to match the crawler's criteria
logger - to log IOExceptions occurring while processing links
Returns:
an array containing URLs found that satisfy the criteria as defined by the crawler
See Also:
crawlWeb(CrawlerSetting, int, Logger)

searchWebFor

@Deprecated
public static URL[] searchWebFor(String[] searchPattern,
                                            URL entryPoint,
                                            boolean includeHTMLCode,
                                            int level,
                                            boolean currentSiteOnly,
                                            String[] searchURLExclusionPatterns,
                                            Monitor monitor)
Deprecated. 

This special web search function returns all URLs found that contain the desired searchString given the constrains of the other parameter. The search starts at the entryPoint and goes recursively through the tree derived from the links from that URL as deep as suggested by the level parameter; the search is conducted in a bredth-first-search manner. For more flexible web searches, consider the use of a com.antelmann.net.CrawlerSetting.
Use of Monitor - if present (i.e. monitor may be null, in which case no feedback is provided while the function is executing):

Parameters:
searchPattern - an array containing String patterns to search for; wildcards are not supported
entryPoint - the URL from where to start the search
includeHTMLCode - if true, the search will include not only the text, but also the HTML code of a page
level - limits the depth of the search; only pages that are reachable with less or equal than the given number of recursive links will be included
currentSiteOnly - if true, the search is limited to the host of the entryPoint
searchURLExclusionPatterns - if not null it contains an array of String patterns which will be used to filter out unwanted URLs, i.e. if any of the patterns are present in the URL's path, that URL will be disregarded; wildcards are not supported
monitor - see above for usage; may be null
See Also:
crawlWeb(CrawlerSetting, int, Logger)

searchWebFor

@Deprecated
public static List<URL> searchWebFor(String[] searchPattern,
                                                ArrayList<URL> searchList,
                                                boolean includeHTMLCode,
                                                int level,
                                                boolean currentSiteOnly,
                                                List<URL> excludeList,
                                                List<URL> resultList,
                                                String[] searchURLExclusionPatterns,
                                                Monitor monitor)
Deprecated. 

usually called by the other searchWebFor() function; all Lists contain URL objects

See Also:
searchWebFor(String[], URL, boolean, int, boolean, String[], Monitor)


(c) Holger Antelmann since 2001- all rights reserved (contact: info@antelmann.com)
see www.antelmann.com/developer for further details and available downloads