|
Holger's Java API |
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectcom.antelmann.net.URLCache
public class URLCache
A wrapper around java.net.URL that caches a copy of the content and adds some additional functionality designed for HTML pages.
The URLCache cache can be made up-to-date by calling refresh().
The URLCache object also maintains the time when it was last updated successfully.
The method lastUpdated() will return that time.
Most methods operate on the cached data to avoid the need to reconnect over the web for each operation. This allows to call several methods on a URLCache object sequentially with reasonable performance.
If a call to a content accessing method is first made and the data has not been cached yet, the method will enforce a refresh and wait for completion. If the content could not be refreshed due to an IOException, that exception is then immediately thrown as well as in all subsequent method calls accessing the content.
If refresh() is then called at later times, either the content
is successfully cached then (and the initial exception is irrelevant) or
the exception is refreshed.
If - once the content is initially cached - subsequent attempts to
refresh() the the cache fail, all content accessing methods will
still revert to the cached data.
If you then want to find out the cause for the failure of subsequent calls to
refresh(), you will have to register a RefreshListener, as
the callback method defined there will include the IOException in case of an
unsuccessful refresh attempt.
A special case is if the content data of the embedded URL is too large
to currently fit into memory; i.e. caching is impossible, although the URL
is accessible. In that case, methods like getContentAsString()
will return an IOException with the message that the data content is too
large to fit into memory, while other methods (like
saveContentToFile() or getInputStream())
will then directly work off the online content.
Even though many operations may probably be quite difficult with content
that cannot be cached due to its size, the method
saveContentToFile() can still be used savely regardless of
the size of the content (of course it may take quite a bit of time).
The method tooLargeForCaching() will indicate whether this
is the case. Note that a later call to refresh() - after
some memory has been freed up - may be able to cache the same object
successfully.
In addition, this class maintains a static map serving as an application-wide cache for URLCache objects, which can be accessed using the put() and get() methods.
Currently, this implementation starts a new thread whenever refresh() is called. A future revision may want to revise the performance overhead associated with this. The implementation is suited for large content on slow networks, so that it makes sense to load the data for each URL in a separate thread simultaneously.
Note that many methods assume that the underlying content is HTML data; if that is untrue for a specific object, these methods may return empty objects.
Note that the only data that is actually cached is a byte array that represents the content fetched from the URL and a header map; all other information (title, links, images, etc.) will be calculated each time based on the cached byte array.
Spider,
Serialized Form| Nested Class Summary | |
|---|---|
static interface |
URLCache.RefreshListener
RefreshListener objects can register with URLCache objects to be notified when the URLCache object is refreshed |
| Constructor Summary | |
|---|---|
URLCache(String spec)
constructs the URLCache object based on the spec denoting the absolute path of the URL and without refresh |
|
URLCache(URL url)
calls URLCache(url, false) |
|
URLCache(URL url,
boolean refreshNow)
constructs the URLCache object based on the given URL. |
|
| Method Summary | |
|---|---|
void |
addRefreshListener(URLCache.RefreshListener listener)
|
int |
bytesReadByCurrentRefresh()
returns the number of bytes currently read by the refresh thread; returns -1 if no refresh in progress |
void |
clearCache()
interrupts any ongoing refresh process and clears the cache; subsequent calls to any content will force a new refresh |
boolean |
containsRefreshListener(URLCache.RefreshListener listener)
|
boolean |
equals(Object obj)
tests equality on whether the embedded URL is the same file |
byte[] |
getContent()
returns the raw cached content. |
String |
getContentAsString()
|
String |
getContentAsString(String charsetName)
|
String |
getContentEncoding()
returns the header value from the cached content |
String |
getContentType()
returns the header value from the cached content |
String |
getFileExtension()
returns the file type denoted by the path of the URL. |
String |
getHeaderField(String fieldKey)
retrieves the first field value matching the fieldKey based on case-insensitive key search |
Map<?,?> |
getHeaderFields()
returns a Map to the cached header fields |
HTMLDocument |
getHTMLDocument()
returns a new HTMLDocument initialized with the cached content of this URLCache object. |
URL[] |
getImages()
returns URLs to all unique images embedded in the cached HTML document |
InputStream |
getInputStream()
returns an input stream from the cached content (suitable for binary data). |
long |
getLastModified()
returns the header value from the cached content |
long |
getLastRefreshTime()
returns the time taken by the last successful refresh; -1 is returned if content was never successfully refreshed. |
URL[] |
getLinks()
returns URLs of all links from the cached HTML document. |
Reader |
getReader()
returns a reader from the cached content (suitable for non-binary data). |
Reader |
getReader(String charsetName)
returns a reader from the cached content by using the specified charset for decoding. |
int |
getRealContentLength()
returns the actual length of the already cached data or -1 if the data is too large to fit into memory. |
URLCache.RefreshListener[] |
getRefreshListener()
|
String |
getTagText(HTML.Tag desiredTag,
String delimiter)
returns all text from the HTML cache data that is found in the given tag. |
String |
getTitle()
returns the HTML title of the cached document |
URL |
getURL()
returns the underlying URL object. |
int |
hashCode()
hashes based on the embedded URL |
boolean |
isCached()
returns true only if the content has ever been successfully refreshed before |
boolean |
isRefreshing()
returns true only if the cache is currently being refreshed |
boolean |
isUpToDate()
checks whether the timestamp provided by the online content is no later than your last successful refresh. |
long |
lastRefreshed()
returns the time when the last refresh() attempt was performed - whether or not successful. |
long |
lastUpdated()
returns the time when this object was last refreshed successfully; 0 is returned if no refresh has been performed, yet |
int |
peekContentLength()
returns the content-length header field
directly from the online data; the cache is neither
affected nor used. |
void |
refreshAndWait()
returns only after the refresh finished |
void |
refreshContent()
updates the cached content asynchronously with a fresh copy directly from the web. |
void |
removeRefreshListener(URLCache.RefreshListener listener)
|
void |
saveContentToFile(File file)
calls saveContentToFile(file, false) |
void |
saveContentToFile(File file,
boolean streamDirectlyFromURL)
If the file might not fit into memory, streamDirectlyFromURL
should be true to stream directly from the URL; otherwise this method
simply writes the cache to the file. |
void |
stopCurrentRefresh()
interrupts any currently ongoing refresh process (if any) and then returns; previously cached data and subsequent calls are uneffected |
String |
stripText()
calls the other stripText() method with a line break as delimiter |
String |
stripText(String delimiter)
returns a String containing the text of all HTML tag types, separated by the given delimiter |
boolean |
tooLargeForCaching()
return value of true indicates that even though the content to the URL is accessible, the data is too large to be cached given the current memory. |
String |
toString()
|
boolean |
verifyContent()
checks whether the cached content equals the current live online content. |
void |
waitForRefresh()
This method only returns after ensuring that a cached a result from a previous refresh() is available (either the cached data or the cached IOException). |
| Methods inherited from class java.lang.Object |
|---|
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
| Constructor Detail |
|---|
public URLCache(String spec)
throws MalformedURLException
MalformedURLExceptionURLCache(URL, boolean)public URLCache(URL url)
URLCache(url, false)
URLCache(URL, boolean)
public URLCache(URL url,
boolean refreshNow)
refresh() is made in order to obtain a cached version;
the refresh is performed asynchronously, i.e. the constructor
returns immediately. A delay may then only be experienced when
the first content accessing method is called.
refreshContent()| Method Detail |
|---|
public void refreshAndWait()
refreshContent()public void refreshContent()
If you want to be notified when a URLCache object has completed a refresh call,
(whether or not the refresh was successful) you can register a RefreshListener.
The call to the RefreshListener will also contain the information whether the
refresh was successfully performed or not; if unsuccessful, the IOException that
caused this latest refresh() to fail will be included there, too.
As only one refresh thread per instance is allowed at a time, a call
to refresh() may cause listeners to be notified of a failure
due to concurrent refresh() calls; the IOException included
in the callback will reflect that.
In fact, this method doesn't allow a subsequent refresh in less than a second
after the last refresh finished (in those cases, the method call is simply ignored).
If a refresh attempt is unsuccessful, all previously cached data
is maintained from the time that lastUdated() indicates.
The time of the last refresh attempt can be retrieved by calling
lastRefreshAttempt(); if the last call was successful,
lastUdated() and lastRefreshAttempt() will
return the same value.
refreshContent in interface RefreshableaddRefreshListener(URLCache.RefreshListener),
lastRefreshed(),
lastUpdated(),
isRefreshing(),
refreshAndWait(),
URLCache.RefreshListenerpublic void stopCurrentRefresh()
refreshContent()public void clearCache()
refreshContent()public void waitForRefresh()
refreshContent(),
isCached()
public Map<?,?> getHeaderFields()
throws IOException
IOException - if the headers could not have been cachedURLConnection.getHeaderFields()
public String getHeaderField(String fieldKey)
throws IOException
fieldKey - if null, it returns the HTTP response if available
IOException
public String getContentType()
throws IOException
IOException
public String getContentEncoding()
throws IOException
IOException
public long getLastModified()
throws IOException
IOException
public byte[] getContent()
throws IOException
WARNING: Altering the returned array means altering the internal cache, which is in effect until the next successful refresh.
IOException
public String getContentAsString()
throws IOException
IOException
public String getContentAsString(String charsetName)
throws IOException
IOException
public Reader getReader()
throws IOException
IOException
public Reader getReader(String charsetName)
throws IOException
IOException
public InputStream getInputStream()
throws IOException
IOException
public String getTitle()
throws IOException
IOException
public URL[] getLinks()
throws IOException
IOException
public URL[] getImages()
throws IOException
IOException
public HTMLDocument getHTMLDocument()
throws IOException
IOException
public String getTagText(HTML.Tag desiredTag,
String delimiter)
throws IOException
IOException
public String stripText()
throws IOException
IOException
public String stripText(String delimiter)
throws IOException
IOExceptionpublic boolean isRefreshing()
public long lastRefreshed()
lastUdated()
lastUpdated()public long lastUpdated()
lastRefreshed()public long getLastRefreshTime()
refreshContent()
public boolean verifyContent()
throws IOException
It is a bad idea to call this method to see whether a
refresh() is needed, as a call to this method is just
as expensive as refresh() itself, only that the latter
returns immediately.
If you just want to check the provided timestamp of the online content
to see whether it is no later than your last successful refresh,
use isUpToDate() instead.
IOException - if the connection to the live online
content failedisUpToDate()
public boolean isUpToDate()
throws IOException
Note that the result may not be accurrate if the header timestamp of the online content is incorrect or missing.
If you need to verify that the exact online content is in fact
identical to the cached content, use verifyContent()
instead.
IOExceptionverifyContent()public URL getURL()
public void addRefreshListener(URLCache.RefreshListener listener)
removeRefreshListener(URLCache.RefreshListener),
refreshContent()public void removeRefreshListener(URLCache.RefreshListener listener)
addRefreshListener(URLCache.RefreshListener)public boolean containsRefreshListener(URLCache.RefreshListener listener)
public URLCache.RefreshListener[] getRefreshListener()
public boolean isCached()
public int bytesReadByCurrentRefresh()
public int getRealContentLength()
throws IOException
Note that before this method can return, this object will have
already attempted to load the entire content into memory.
If you try to avoid that and just want to peek at what the
online content provides for its content length, use
peekContentLength() instead.
IOException - if the data cannot be accessedpeekContentLength()
public int peekContentLength()
throws IOException
content-length header field
directly from the online data; the cache is neither
affected nor used.
You can use this method if you want to get information
about the content length before you attempt to download
or cache the content. If you need an always accurate
content length, use getContentLengh(),
which will return the exact length of the cached content
(after the entire content has been loaded, though).
IOExceptiongetRealContentLength(),
Spider.getContentLength()public boolean tooLargeForCaching()
If this method returns true, you can still obtain an input stream
or a reader to the data; these methods will then simply read from
the online content. Also, you can still save the data to a file.
Methods like getContentAsString(), however, will then
return an IOException stating that there is no memory available.
This method may also return true if your memory is simply exhausted
in a particular point in time and the URL content is acutually not
that large; in that case, you may try refreshing again after freeing
up some memory or you can check the online content lenght - if
provided for the URL in question - with peekContentLength()
peekContentLength()public String getFileExtension()
public void saveContentToFile(File file)
throws IOException
saveContentToFile(file, false)
IOException(File, boolean)
public void saveContentToFile(File file,
boolean streamDirectlyFromURL)
throws IOException
streamDirectlyFromURL
should be true to stream directly from the URL; otherwise this method
simply writes the cache to the file.
IOExceptionpublic boolean equals(Object obj)
equals in class Objectpublic int hashCode()
hashCode in class Objectpublic String toString()
toString in class Object
|
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||