spider¶
integrate these modules:
crawlib.pipeline.mongodb.orm
crawlib.data_class
crawlib.html_parser
crawlib.logger
-
crawlib.spider.
execute_one_to_many_job
(parent_class=None, get_unfinished_kwargs=None, get_unfinished_limit=None, parser_func=None, parser_func_kwargs=None, build_url_func_kwargs=None, downloader_func=None, downloader_func_kwargs=None, post_process_response_func=None, post_process_response_func_kwargs=None, process_item_func_kwargs=None, logger=None, sleep_time=None)¶ A standard one-to-many crawling workflow.
Parameters: - parent_class –
- get_unfinished_kwargs –
- get_unfinished_limit –
- parser_func – html parser function.
- parser_func_kwargs – other keyword arguments for
parser_func
- build_url_func_kwargs – other keyword arguments for
parent_class().build_url(**build_url_func_kwargs)
- downloader_func – a function that taking
url
as first arg, make http request and return response/html. - downloader_func_kwargs – other keyword arguments for
downloader_func
- post_process_response_func – a callback function taking response/html as first argument. You can put any logic in it. For example, you can make it sleep if you detect that you got banned.
- post_process_response_func_kwargs – other keyword arguments for
post_process_response_func
- process_item_func_kwargs – other keyword arguments for
ParseResult().process_item(**process_item_func_kwargs)
- logger –
- sleep_time – default 0, wait time before making each request.