requests_downloader¶
url content downloader middleware using requests
.
-
class
crawlib.downloader.requests_downloader.
RequestsDownloader
(use_session=False, use_tor=False, tor_port=9050, cache_dir=None, read_cache_first=False, alert_when_cache_missing=False, always_update_cache=False, cache_expire=None, use_random_user_agent=True, **kwargs)¶ Rich feature downloader for making http request.
Parameters: - use_session – bool, whether you use session to communicate.
- use_tor – bool, whether you use tor network. For information about installation for tor, see https://www.torproject.org/docs/tor-doc-osx.html.en
- tor_port – int, By default, is 9050.
- cache_dir – str, diskCache directory.
- read_cache_first – bool, If true, downloader will try read binary content from cache.
- alert_when_cache_missing – bool, If true, a log message will be displayed when url has not been seen in cache.
- always_update_cache – bool, If true, the response content will be saved to cache anyway.
- cache_expire – int, number seconds to expire.
- use_random_user_agent – bool, if true, a random user agent will be used for http request.
-
download
(url, dst, params=None, cache_cb=None, overwrite=False, stream=False, minimal_size=-1, maximum_size=1152921504606846976, **kwargs)¶ Download binary content to destination.
Parameters: - url – binary content url
- dst – path to the ‘save_as’ file
- cache_cb – (optional) a function that taking requests.Response as input, and returns a bool flag, indicate whether should update the cache.
- overwrite – bool,
- stream – bool, whether we load everything into memory at once, or read the data chunk by chunk
- minimal_size – default -1, if response content smaller than minimal_size, then delete what just download.
- maximum_size – default 1GB, if response content greater than maximum_size, then delete what just download.
-
get
(url, params=None, cache_cb=None, **kwargs)¶ Make http get request.
Parameters: - url –
- params –
- cache_cb – (optional) a function that taking requests.Response as input, and returns a bool flag, indicate whether should update the cache.
- cache_expire – (optional).
- kwargs – optional arguments.
-
get_html
(url, params=None, cache_cb=None, decoder_encoding=None, decoder_errors='strict', **kwargs)¶ Get html of an url.