requests_downloader

url content downloader middleware using requests.

class crawlib.downloader.requests_downloader.RequestsDownloader(use_session=False, use_tor=False, tor_port=9050, cache_dir=None, read_cache_first=False, alert_when_cache_missing=False, always_update_cache=False, cache_expire=None, use_random_user_agent=True, **kwargs)

Rich feature downloader for making http request.

Parameters:
  • use_session – bool, whether you use session to communicate.
  • use_tor – bool, whether you use tor network. For information about installation for tor, see https://www.torproject.org/docs/tor-doc-osx.html.en
  • tor_port – int, By default, is 9050.
  • cache_dir – str, diskCache directory.
  • read_cache_first – bool, If true, downloader will try read binary content from cache.
  • alert_when_cache_missing – bool, If true, a log message will be displayed when url has not been seen in cache.
  • always_update_cache – bool, If true, the response content will be saved to cache anyway.
  • cache_expire – int, number seconds to expire.
  • use_random_user_agent – bool, if true, a random user agent will be used for http request.
download(url, dst, params=None, cache_cb=None, overwrite=False, stream=False, minimal_size=-1, maximum_size=1152921504606846976, **kwargs)

Download binary content to destination.

Parameters:
  • url – binary content url
  • dst – path to the ‘save_as’ file
  • cache_cb – (optional) a function that taking requests.Response as input, and returns a bool flag, indicate whether should update the cache.
  • overwrite – bool,
  • stream – bool, whether we load everything into memory at once, or read the data chunk by chunk
  • minimal_size – default -1, if response content smaller than minimal_size, then delete what just download.
  • maximum_size – default 1GB, if response content greater than maximum_size, then delete what just download.
get(url, params=None, cache_cb=None, **kwargs)

Make http get request.

Parameters:
  • url
  • params
  • cache_cb – (optional) a function that taking requests.Response as input, and returns a bool flag, indicate whether should update the cache.
  • cache_expire – (optional).
  • kwargs – optional arguments.
get_html(url, params=None, cache_cb=None, decoder_encoding=None, decoder_errors='strict', **kwargs)

Get html of an url.