decorator¶
There are three major popular libraries widely used for making http request:
requests
: http://docs.python-requests.org/en/master/scrapy
: https://doc.scrapy.org/en/latest/topics/request-response.htmlselenium
: http://selenium-python.readthedocs.io/index.html
And there are two major popular library widely used for extracting data from html:
beautifulsoup4
: https://www.crummy.com/software/BeautifulSoup/bs4/doc/scrapy.selector
: https://doc.scrapy.org/en/latest/topics/selectors.html
This module bridge the gap.
-
crawlib.html_parser.decorator.
auto_decode_and_soupify
(encoding=None, errors='strict')¶ This decorator assume that there are three argument in keyword syntax:
response
:requests.Response
orscrapy.http.Reponse
html
: html stringsoup
:bs4.BeautifulSoup
- if
soup
is not available, it will automatically be generated from html
.
- if
- if
html
is not available, it will automatically be generated from response
.
- if
Usage:
@auto_decode_and_soupify() def parse(response, html, soup): ...
中文文档
此装饰器会自动检测函数中名为
response
,html
,soup
的参数, 并在html
,soup
未给出的情况下, 自动生成所期望的值. 被此装饰器装饰的函数必须 要有以上提到的三个参数. 并且在使用时, 必须使用keyword的形式进行输入.
-
crawlib.html_parser.decorator.
soupify
(html)¶ Convert html to BeautifulSoup. It solves api change in bs4.3.
-
crawlib.html_parser.decorator.
validate_implementation_for_auto_decode_and_soupify
(func)¶ Validate that
auto_decode_and_soupify()
is applicable to this function. If not applicable, aNotImplmentedError
will be raised.