decorator

There are three major popular libraries widely used for making http request:

And there are two major popular library widely used for extracting data from html:

This module bridge the gap.

crawlib.html_parser.decorator.auto_decode_and_soupify(encoding=None, errors='strict')

This decorator assume that there are three argument in keyword syntax:

  • response: requests.Response or scrapy.http.Reponse
  • html: html string
  • soup: bs4.BeautifulSoup
  1. if soup is not available, it will automatically be generated from
    html.
  2. if html is not available, it will automatically be generated from
    response.

Usage:

@auto_decode_and_soupify()
def parse(response, html, soup):
    ...

中文文档

此装饰器会自动检测函数中名为 response, html, soup 的参数, 并在 html, soup 未给出的情况下, 自动生成所期望的值. 被此装饰器装饰的函数必须 要有以上提到的三个参数. 并且在使用时, 必须使用keyword的形式进行输入.

crawlib.html_parser.decorator.soupify(html)

Convert html to BeautifulSoup. It solves api change in bs4.3.

crawlib.html_parser.decorator.validate_implementation_for_auto_decode_and_soupify(func)

Validate that auto_decode_and_soupify() is applicable to this function. If not applicable, a NotImplmentedError will be raised.