spider

integrate these modules:

  • crawlib.pipeline.mongodb.orm
  • crawlib.data_class
  • crawlib.html_parser
  • crawlib.logger
crawlib.spider.execute_one_to_many_job(parent_class=None, get_unfinished_kwargs=None, get_unfinished_limit=None, parser_func=None, parser_func_kwargs=None, build_url_func_kwargs=None, downloader_func=None, downloader_func_kwargs=None, post_process_response_func=None, post_process_response_func_kwargs=None, process_item_func_kwargs=None, logger=None, sleep_time=None)

A standard one-to-many crawling workflow.

Parameters:
  • parent_class
  • get_unfinished_kwargs
  • get_unfinished_limit
  • parser_func – html parser function.
  • parser_func_kwargs – other keyword arguments for parser_func
  • build_url_func_kwargs – other keyword arguments for parent_class().build_url(**build_url_func_kwargs)
  • downloader_func – a function that taking url as first arg, make http request and return response/html.
  • downloader_func_kwargs – other keyword arguments for downloader_func
  • post_process_response_func – a callback function taking response/html as first argument. You can put any logic in it. For example, you can make it sleep if you detect that you got banned.
  • post_process_response_func_kwargs – other keyword arguments for post_process_response_func
  • process_item_func_kwargs – other keyword arguments for ParseResult().process_item(**process_item_func_kwargs)
  • logger
  • sleep_time – default 0, wait time before making each request.