spider¶

integrate these modules:

crawlib.pipeline.mongodb.orm
crawlib.data_class
crawlib.html_parser
crawlib.logger

crawlib.spider.execute_one_to_many_job(parent_class=None, get_unfinished_kwargs=None, get_unfinished_limit=None, parser_func=None, parser_func_kwargs=None, build_url_func_kwargs=None, downloader_func=None, downloader_func_kwargs=None, post_process_response_func=None, post_process_response_func_kwargs=None, process_item_func_kwargs=None, logger=None, sleep_time=None)¶

A standard one-to-many crawling workflow.

Parameters:

parent_class –
get_unfinished_kwargs –
get_unfinished_limit –
parser_func – html parser function.
parser_func_kwargs – other keyword arguments for parser_func
build_url_func_kwargs – other keyword arguments for parent_class().build_url(**build_url_func_kwargs)
downloader_func – a function that taking url as first arg, make http request and return response/html.
downloader_func_kwargs – other keyword arguments for downloader_func
post_process_response_func – a callback function taking response/html as first argument. You can put any logic in it. For example, you can make it sleep if you detect that you got banned.
post_process_response_func_kwargs – other keyword arguments for post_process_response_func
process_item_func_kwargs – other keyword arguments for ParseResult().process_item(**process_item_func_kwargs)
logger –
sleep_time – default 0, wait time before making each request.

spider¶

crawlib

Navigation

Related Topics