decorator¶
There are three major popular libraries widely used for making http request:
requests: http://docs.python-requests.org/en/master/scrapy: https://doc.scrapy.org/en/latest/topics/request-response.htmlselenium: http://selenium-python.readthedocs.io/index.html
And there are two major popular library widely used for extracting data from html:
beautifulsoup4: https://www.crummy.com/software/BeautifulSoup/bs4/doc/scrapy.selector: https://doc.scrapy.org/en/latest/topics/selectors.html
This module bridge the gap.
-
crawlib.html_parser.decorator.auto_decode_and_soupify(encoding=None, errors='strict')¶ This decorator assume that there are three argument in keyword syntax:
response:requests.Responseorscrapy.http.Reponsehtml: html stringsoup:bs4.BeautifulSoup
- if
soupis not available, it will automatically be generated from html.
- if
- if
htmlis not available, it will automatically be generated from response.
- if
Usage:
@auto_decode_and_soupify() def parse(response, html, soup): ...
中文文档
此装饰器会自动检测函数中名为
response,html,soup的参数, 并在html,soup未给出的情况下, 自动生成所期望的值. 被此装饰器装饰的函数必须 要有以上提到的三个参数. 并且在使用时, 必须使用keyword的形式进行输入.
-
crawlib.html_parser.decorator.soupify(html)¶ Convert html to BeautifulSoup. It solves api change in bs4.3.
-
crawlib.html_parser.decorator.validate_implementation_for_auto_decode_and_soupify(func)¶ Validate that
auto_decode_and_soupify()is applicable to this function. If not applicable, aNotImplmentedErrorwill be raised.