cache

A disk cache layer to store url and its html.

class crawlib.cache.CacheBackedDownloader(cache_dir=None, read_cache_first=False, alert_when_cache_missing=False, always_update_cache=False, cache_expire=None, cache_value_type_is_binary=None, cache_compress_level=6, **kwargs)

Implement a disk cache backed url content downloader functionality.

Parameters:
  • cache_dir – str, diskCache directory.
  • read_cache_first – bool, If true, downloader will try read binary content from cache.
  • alert_when_cache_missing – bool, If true, a log message will be displayed when url has not been seen in cache.
  • always_update_cache – bool, If true, the response content will be saved to cache anyway.
  • cache_expire – int, number seconds to expire.
  • cache_value_type_is_binary – bool
  • cache_compress_level – compress level, 1-9. 9 is highest.
should_we_update_cache(any_type_response, cache_cb, cache_consumed_flag)
Parameters:
  • any_type_response – any response object.
  • cache_cb – a call back function taking any_type_response as input, and return a boolean value to indicate that whether we should update cache.
Returns:

bool.

中文文档

  1. 如果 cache_consumed_flag 为 True, 那么说明已经从cache中读取过数据了,
    再存也没有意义.
  2. 如果 self.always_update_cache 为 True, 那么强制更新cache. 我们不用担心
    发生已经读取过cache, 然后再强制更新的情况, 因为之前我们已经检查过 cache_consumed_flag 了.
  3. 如果没有指定 cache_cb 函数, 那么默认不更新cache.
class crawlib.cache.CompressedDisk(directory, compress_level=6, value_type_is_binary=False, **kwargs)

Serialization Layer. Value has to be bytes or string type, and will be compressed using zlib before stored to disk.

  • Key: str, url.
  • Value: str or bytes, html or binary content.
fetch(mode, filename, value, read)

Convert fields mode, filename, and value from Cache table to value.

Parameters:
  • mode (int) – value mode raw, binary, text, or pickle
  • filename (str) – filename of corresponding value
  • value – database value
  • read (bool) – when True, return an open file handle
Returns:

corresponding Python value

get(key, raw)

Convert fields key and raw from Cache table to key.

Parameters:
  • key – database key to convert
  • raw (bool) – flag indicating raw database storage
Returns:

corresponding Python key

store(value, read, **kwargs)

Convert value to fields size, mode, filename, and value for Cache table.

Parameters:
  • value – value to convert
  • read (bool) – True when value is file-like object
  • key – key for item (default UNKNOWN)
Returns:

(size, mode, filename, value) tuple for Cache table

crawlib.cache.create_cache(directory, compress_level=6, value_type_is_binary=False, **kwargs)

Create a html cache. Html string will be automatically compressed.

Parameters:
  • directory – path for the cache directory.
  • compress_level – 0 ~ 9, 9 is slowest and smallest.
  • kwargs – other arguments.
Returns:

a diskcache.Cache()