URLCache (Antelmann.com Java Packages)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

Holger's
Java API

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

com.antelmann.net
Class URLCache

java.lang.Object
  com.antelmann.net.URLCache

All Implemented Interfaces:: Refreshable, Serializable

public class URLCache
extends Object
implements Serializable, Refreshable
extends Object
implements Serializable, Refreshable

A wrapper around java.net.URL that caches a copy of the content and adds some additional functionality designed for HTML pages.

The URLCache cache can be made up-to-date by calling refresh().
The URLCache object also maintains the time when it was last updated successfully. The method lastUpdated() will return that time.

Most methods operate on the cached data to avoid the need to reconnect over the web for each operation. This allows to call several methods on a URLCache object sequentially with reasonable performance.

If a call to a content accessing method is first made and the data has not been cached yet, the method will enforce a refresh and wait for completion. If the content could not be refreshed due to an IOException, that exception is then immediately thrown as well as in all subsequent method calls accessing the content.

If refresh() is then called at later times, either the content is successfully cached then (and the initial exception is irrelevant) or the exception is refreshed.

If - once the content is initially cached - subsequent attempts to refresh() the the cache fail, all content accessing methods will still revert to the cached data.
If you then want to find out the cause for the failure of subsequent calls to refresh(), you will have to register a RefreshListener, as the callback method defined there will include the IOException in case of an unsuccessful refresh attempt.

A special case is if the content data of the embedded URL is too large to currently fit into memory; i.e. caching is impossible, although the URL is accessible. In that case, methods like getContentAsString() will return an IOException with the message that the data content is too large to fit into memory, while other methods (like saveContentToFile() or getInputStream()) will then directly work off the online content. Even though many operations may probably be quite difficult with content that cannot be cached due to its size, the method saveContentToFile() can still be used savely regardless of the size of the content (of course it may take quite a bit of time). The method tooLargeForCaching() will indicate whether this is the case. Note that a later call to refresh() - after some memory has been freed up - may be able to cache the same object successfully.

In addition, this class maintains a static map serving as an application-wide cache for URLCache objects, which can be accessed using the put() and get() methods.

Currently, this implementation starts a new thread whenever refresh() is called. A future revision may want to revise the performance overhead associated with this. The implementation is suited for large content on slow networks, so that it makes sense to load the data for each URL in a separate thread simultaneously.

Note that many methods assume that the underlying content is HTML data; if that is untrue for a specific object, these methods may return empty objects.

Note that the only data that is actually cached is a byte array that represents the content fetched from the URL and a header map; all other information (title, links, images, etc.) will be calculated each time based on the cached byte array.

Since:: 4/2/2002
Author:: Holger Antelmann
See Also:: Spider, Serialized Form

Nested Class Summary
`static interface`	`URLCache.RefreshListener` RefreshListener objects can register with URLCache objects to be notified when the URLCache object is refreshed

Constructor Summary
`URLCache(String spec)` constructs the URLCache object based on the spec denoting the absolute path of the URL and without refresh
`URLCache(URL url)` calls `URLCache(url, false)`
`URLCache(URL url, boolean refreshNow)` constructs the URLCache object based on the given URL.

Method Summary
`void`	`addRefreshListener(URLCache.RefreshListener listener)`
`int`	`bytesReadByCurrentRefresh()` returns the number of bytes currently read by the refresh thread; returns -1 if no refresh in progress
`void`	`clearCache()` interrupts any ongoing refresh process and clears the cache; subsequent calls to any content will force a new refresh
`boolean`	`containsRefreshListener(URLCache.RefreshListener listener)`
`boolean`	`equals(Object obj)` tests equality on whether the embedded URL is the same file
`byte[]`	`getContent()` returns the raw cached content.
`String`	`getContentAsString()`
`String`	`getContentAsString(String charsetName)`
`String`	`getContentEncoding()` returns the header value from the cached content
`String`	`getContentType()` returns the header value from the cached content
`String`	`getFileExtension()` returns the file type denoted by the path of the URL.
`String`	`getHeaderField(String fieldKey)` retrieves the first field value matching the fieldKey based on case-insensitive key search
`Map<?,?>`	`getHeaderFields()` returns a Map to the cached header fields
`HTMLDocument`	`getHTMLDocument()` returns a new HTMLDocument initialized with the cached content of this URLCache object.
`URL[]`	`getImages()` returns URLs to all unique images embedded in the cached HTML document
`InputStream`	`getInputStream()` returns an input stream from the cached content (suitable for binary data).
`long`	`getLastModified()` returns the header value from the cached content
`long`	`getLastRefreshTime()` returns the time taken by the last successful refresh; -1 is returned if content was never successfully refreshed.
`URL[]`	`getLinks()` returns URLs of all links from the cached HTML document.
`Reader`	`getReader()` returns a reader from the cached content (suitable for non-binary data).
`Reader`	`getReader(String charsetName)` returns a reader from the cached content by using the specified charset for decoding.
`int`	`getRealContentLength()` returns the actual length of the already cached data or -1 if the data is too large to fit into memory.
`URLCache.RefreshListener[]`	`getRefreshListener()`
`String`	`getTagText(HTML.Tag desiredTag, String delimiter)` returns all text from the HTML cache data that is found in the given tag.
`String`	`getTitle()` returns the HTML title of the cached document
`URL`	`getURL()` returns the underlying URL object.
`int`	`hashCode()` hashes based on the embedded URL
`boolean`	`isCached()` returns true only if the content has ever been successfully refreshed before
`boolean`	`isRefreshing()` returns true only if the cache is currently being refreshed
`boolean`	`isUpToDate()` checks whether the timestamp provided by the online content is no later than your last successful refresh.
`long`	`lastRefreshed()` returns the time when the last refresh() attempt was performed - whether or not successful.
`long`	`lastUpdated()` returns the time when this object was last refreshed successfully; 0 is returned if no refresh has been performed, yet
`int`	`peekContentLength()` returns the `content-length` header field directly from the online data; the cache is neither affected nor used.
`void`	`refreshAndWait()` returns only after the refresh finished
`void`	`refreshContent()` updates the cached content asynchronously with a fresh copy directly from the web.
`void`	`removeRefreshListener(URLCache.RefreshListener listener)`
`void`	`saveContentToFile(File file)` calls `saveContentToFile(file, false)`
`void`	`saveContentToFile(File file, boolean streamDirectlyFromURL)` If the file might not fit into memory, `streamDirectlyFromURL` should be true to stream directly from the URL; otherwise this method simply writes the cache to the file.
`void`	`stopCurrentRefresh()` interrupts any currently ongoing refresh process (if any) and then returns; previously cached data and subsequent calls are uneffected
`String`	`stripText()` calls the other stripText() method with a line break as delimiter
`String`	`stripText(String delimiter)` returns a String containing the text of all HTML tag types, separated by the given delimiter
`boolean`	`tooLargeForCaching()` return value of true indicates that even though the content to the URL is accessible, the data is too large to be cached given the current memory.
`String`	`toString()`
`boolean`	`verifyContent()` checks whether the cached content equals the current live online content.
`void`	`waitForRefresh()` This method only returns after ensuring that a cached a result from a previous refresh() is available (either the cached data or the cached IOException).

Methods inherited from class java.lang.Object
`clone, finalize, getClass, notify, notifyAll, wait, wait, wait`

Constructor Detail

URLCache

public URLCache(String spec)
         throws MalformedURLException

constructs the URLCache object based on the spec denoting the absolute path of the URL and without refresh

Throws:: MalformedURLException
See Also:: URLCache(URL, boolean)

URLCache

public URLCache(URL url)

calls URLCache(url, false)

See Also:: URLCache(URL, boolean)

URLCache

public URLCache(URL url,
                boolean refreshNow)

constructs the URLCache object based on the given URL. If refreshNow is true, an immediate call to refresh() is made in order to obtain a cached version; the refresh is performed asynchronously, i.e. the constructor returns immediately. A delay may then only be experienced when the first content accessing method is called.

See Also:: refreshContent()

Method Detail

refreshAndWait

public void refreshAndWait()

returns only after the refresh finished

See Also:: refreshContent()

refreshContent

public void refreshContent()

updates the cached content asynchronously with a fresh copy directly from the web.
This method returns immediately; the refresh is performed in a separate thread.

If you want to be notified when a URLCache object has completed a refresh call, (whether or not the refresh was successful) you can register a RefreshListener. The call to the RefreshListener will also contain the information whether the refresh was successfully performed or not; if unsuccessful, the IOException that caused this latest refresh() to fail will be included there, too.

As only one refresh thread per instance is allowed at a time, a call to refresh() may cause listeners to be notified of a failure due to concurrent refresh() calls; the IOException included in the callback will reflect that. In fact, this method doesn't allow a subsequent refresh in less than a second after the last refresh finished (in those cases, the method call is simply ignored). If a refresh attempt is unsuccessful, all previously cached data is maintained from the time that lastUdated() indicates. The time of the last refresh attempt can be retrieved by calling lastRefreshAttempt(); if the last call was successful, lastUdated() and lastRefreshAttempt() will return the same value.

Specified by:: refreshContent in interface Refreshable

See Also:: addRefreshListener(URLCache.RefreshListener), lastRefreshed(), lastUpdated(), isRefreshing(), refreshAndWait(), URLCache.RefreshListener

stopCurrentRefresh

public void stopCurrentRefresh()

interrupts any currently ongoing refresh process (if any) and then returns; previously cached data and subsequent calls are uneffected

See Also:: refreshContent()

clearCache

public void clearCache()

interrupts any ongoing refresh process and clears the cache; subsequent calls to any content will force a new refresh

See Also:: refreshContent()

waitForRefresh

public void waitForRefresh()

This method only returns after ensuring that a cached a result from a previous refresh() is available (either the cached data or the cached IOException).

See Also:: refreshContent(), isCached()

getHeaderFields

public Map<?,?> getHeaderFields()
                         throws IOException

returns a Map to the cached header fields

Throws:: IOException - if the headers could not have been cached
See Also:: URLConnection.getHeaderFields()

getHeaderField

public String getHeaderField(String fieldKey)
                      throws IOException

retrieves the first field value matching the fieldKey based on case-insensitive key search

Parameters:: fieldKey - if null, it returns the HTTP response if available
Throws:: IOException

getContentType

public String getContentType()
                      throws IOException

returns the header value from the cached content

Throws:: IOException

getContentEncoding

public String getContentEncoding()
                          throws IOException

returns the header value from the cached content

Throws:: IOException

getLastModified

public long getLastModified()
                     throws IOException

returns the header value from the cached content

Throws:: IOException

getContent

public byte[] getContent()
                  throws IOException

returns the raw cached content. null is returned only if content cannot be cached (due to memory limitations)

WARNING: Altering the returned array means altering the internal cache, which is in effect until the next successful refresh.

Throws:: IOException

getContentAsString

public String getContentAsString()
                          throws IOException

Throws:: IOException

getContentAsString

public String getContentAsString(String charsetName)
                          throws IOException

Throws:: IOException

getReader

public Reader getReader()
                 throws IOException

returns a reader from the cached content (suitable for non-binary data). If the data is too large to be cached, this method returns a reader to the online content.

Throws:: IOException

getReader

public Reader getReader(String charsetName)
                 throws IOException

returns a reader from the cached content by using the specified charset for decoding. If the data is too large to be cached, this method returns a reader to the online content.

Throws:: IOException

getInputStream

public InputStream getInputStream()
                           throws IOException

returns an input stream from the cached content (suitable for binary data). Only if the data is too large to be cached, this method returns a stream to the online content.

Throws:: IOException

getTitle

public String getTitle()
                throws IOException

returns the HTML title of the cached document

Throws:: IOException

getLinks

public URL[] getLinks()
               throws IOException

returns URLs of all links from the cached HTML document. Note that duplicate links will only be included once

Throws:: IOException

getImages

public URL[] getImages()
                throws IOException

returns URLs to all unique images embedded in the cached HTML document

Throws:: IOException

getHTMLDocument

public HTMLDocument getHTMLDocument()
                             throws IOException

returns a new HTMLDocument initialized with the cached content of this URLCache object. If - for some reason - a BadLocationException is caught, this method returns null.

Throws:: IOException

getTagText

public String getTagText(HTML.Tag desiredTag,
                         String delimiter)
                  throws IOException

returns all text from the HTML cache data that is found in the given tag.
The separate text sequences found are delimited by the given delimiter.

Throws:: IOException

stripText

public String stripText()
                 throws IOException

calls the other stripText() method with a line break as delimiter

Throws:: IOException

stripText

public String stripText(String delimiter)
                 throws IOException

returns a String containing the text of all HTML tag types, separated by the given delimiter

Throws:: IOException

isRefreshing

public boolean isRefreshing()

returns true only if the cache is currently being refreshed

lastRefreshed

public long lastRefreshed()

returns the time when the last refresh() attempt was performed - whether or not successful. If the last attempt was successful, the returned value is identical to the return value of lastUdated()

See Also:: lastUpdated()

lastUpdated

public long lastUpdated()

returns the time when this object was last refreshed successfully; 0 is returned if no refresh has been performed, yet

See Also:: lastRefreshed()

getLastRefreshTime

public long getLastRefreshTime()

returns the time taken by the last successful refresh; -1 is returned if content was never successfully refreshed.

See Also:: refreshContent()

verifyContent

public boolean verifyContent()
                      throws IOException

checks whether the cached content equals the current live online content. This method will connect to the actual URL and only return once the cached data has been fully compared to the live content.

It is a bad idea to call this method to see whether a refresh() is needed, as a call to this method is just as expensive as refresh() itself, only that the latter returns immediately. If you just want to check the provided timestamp of the online content to see whether it is no later than your last successful refresh, use isUpToDate() instead.

Returns:: true if the cached content is equal to the live online content and false if the content is different
Throws:: IOException - if the connection to the live online content failed
See Also:: isUpToDate()

isUpToDate

public boolean isUpToDate()
                   throws IOException

checks whether the timestamp provided by the online content is no later than your last successful refresh.

Note that the result may not be accurrate if the header timestamp of the online content is incorrect or missing.

If you need to verify that the exact online content is in fact identical to the cached content, use verifyContent() instead.

Throws:: IOException
See Also:: verifyContent()

getURL

public URL getURL()

returns the underlying URL object.
Note that any operation on the URL directly cannot take advantage of the caching in URLCache.

addRefreshListener

public void addRefreshListener(URLCache.RefreshListener listener)

See Also:: removeRefreshListener(URLCache.RefreshListener), refreshContent()

removeRefreshListener

public void removeRefreshListener(URLCache.RefreshListener listener)

See Also:: addRefreshListener(URLCache.RefreshListener)

containsRefreshListener

public boolean containsRefreshListener(URLCache.RefreshListener listener)

getRefreshListener

public URLCache.RefreshListener[] getRefreshListener()

isCached

public boolean isCached()

returns true only if the content has ever been successfully refreshed before

bytesReadByCurrentRefresh

public int bytesReadByCurrentRefresh()

returns the number of bytes currently read by the refresh thread; returns -1 if no refresh in progress

getRealContentLength

public int getRealContentLength()
                         throws IOException

returns the actual length of the already cached data or -1 if the data is too large to fit into memory.

Note that before this method can return, this object will have already attempted to load the entire content into memory. If you try to avoid that and just want to peek at what the online content provides for its content length, use peekContentLength() instead.

Throws:: IOException - if the data cannot be accessed
See Also:: peekContentLength()

peekContentLength

public int peekContentLength()
                      throws IOException

returns the content-length header field directly from the online data; the cache is neither affected nor used.

You can use this method if you want to get information about the content length before you attempt to download or cache the content. If you need an always accurate content length, use getContentLengh(), which will return the exact length of the cached content (after the entire content has been loaded, though).

Throws:: IOException
See Also:: getRealContentLength(), Spider.getContentLength()

tooLargeForCaching

public boolean tooLargeForCaching()

return value of true indicates that even though the content to the URL is accessible, the data is too large to be cached given the current memory.
This method can only return true after a refresh attempt has failed due to the size of the content.

If this method returns true, you can still obtain an input stream or a reader to the data; these methods will then simply read from the online content. Also, you can still save the data to a file. Methods like getContentAsString(), however, will then return an IOException stating that there is no memory available.

This method may also return true if your memory is simply exhausted in a particular point in time and the URL content is acutually not that large; in that case, you may try refreshing again after freeing up some memory or you can check the online content lenght - if provided for the URL in question - with peekContentLength()

See Also:: peekContentLength()

getFileExtension

public String getFileExtension()

returns the file type denoted by the path of the URL. The extension is the String of those characters that follow the last 'dot' (".") in the file name in lowercase. If no extension is present, null is returned.

saveContentToFile

public void saveContentToFile(File file)
                       throws IOException

calls saveContentToFile(file, false)

Throws:: IOException
See Also:: (File, boolean)

saveContentToFile

public void saveContentToFile(File file,
                              boolean streamDirectlyFromURL)
                       throws IOException

If the file might not fit into memory, streamDirectlyFromURL should be true to stream directly from the URL; otherwise this method simply writes the cache to the file.

Throws:: IOException

equals

public boolean equals(Object obj)

tests equality on whether the embedded URL is the same file

Overrides:: equals in class Object

hashCode

public int hashCode()

hashes based on the embedded URL

Overrides:: hashCode in class Object

toString

public String toString()

Overrides:: toString in class Object

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

com.antelmann.net Class URLCache

URLCache

URLCache

URLCache

refreshAndWait

refreshContent

stopCurrentRefresh

clearCache

waitForRefresh

getHeaderFields

getHeaderField

getContentType

getContentEncoding

getLastModified

getContent

getContentAsString

getContentAsString

getReader

getReader

getInputStream

getTitle

getLinks

getImages

getHTMLDocument

getTagText

stripText

stripText

isRefreshing

lastRefreshed

lastUpdated

getLastRefreshTime

verifyContent

isUpToDate

getURL

addRefreshListener

removeRefreshListener

containsRefreshListener

getRefreshListener

isCached

bytesReadByCurrentRefresh

getRealContentLength

peekContentLength

tooLargeForCaching

getFileExtension

saveContentToFile

saveContentToFile

equals

hashCode

toString

com.antelmann.net
Class URLCache