cache¶
A disk cache layer to store url and its html.
-
class
crawlib.cache.
CacheBackedDownloader
(cache_dir=None, read_cache_first=False, alert_when_cache_missing=False, always_update_cache=False, cache_expire=None, cache_value_type_is_binary=None, cache_compress_level=6, **kwargs)¶ Implement a disk cache backed url content downloader functionality.
Parameters: - cache_dir – str, diskCache directory.
- read_cache_first – bool, If true, downloader will try read binary content from cache.
- alert_when_cache_missing – bool, If true, a log message will be displayed when url has not been seen in cache.
- always_update_cache – bool, If true, the response content will be saved to cache anyway.
- cache_expire – int, number seconds to expire.
- cache_value_type_is_binary – bool
- cache_compress_level – compress level, 1-9. 9 is highest.
-
should_we_update_cache
(any_type_response, cache_cb, cache_consumed_flag)¶ Parameters: - any_type_response – any response object.
- cache_cb – a call back function taking
any_type_response
as input, and return a boolean value to indicate that whether we should update cache.
Returns: bool.
中文文档
- 如果
cache_consumed_flag
为 True, 那么说明已经从cache中读取过数据了, - 再存也没有意义.
- 如果
- 如果
self.always_update_cache
为 True, 那么强制更新cache. 我们不用担心 - 发生已经读取过cache, 然后再强制更新的情况, 因为之前我们已经检查过
cache_consumed_flag
了.
- 如果
- 如果没有指定
cache_cb
函数, 那么默认不更新cache.
-
class
crawlib.cache.
CompressedDisk
(directory, compress_level=6, value_type_is_binary=False, **kwargs)¶ Serialization Layer. Value has to be bytes or string type, and will be compressed using zlib before stored to disk.
- Key: str, url.
- Value: str or bytes, html or binary content.
-
fetch
(mode, filename, value, read)¶ Convert fields mode, filename, and value from Cache table to value.
Parameters: Returns: corresponding Python value
-
crawlib.cache.
create_cache
(directory, compress_level=6, value_type_is_binary=False, **kwargs)¶ Create a html cache. Html string will be automatically compressed.
Parameters: - directory – path for the cache directory.
- compress_level – 0 ~ 9, 9 is slowest and smallest.
- kwargs – other arguments.
Returns: a diskcache.Cache()