decode

This module provide extended power to decode HTML you crawled.

class crawlib.decode.UrlSpecifiedDecoder

Designed for automatically decoding html from binary content of an url.

First, chardet.detect is very expensive in time. Second, usually each website (per domain) only use one encoding algorithm.

This class avoid perform chardet.detect twice on the same domain.

Parameters:domain_encoding_table – dict, key is root domain, and value is the domain’s default encoding.
decode(binary, url, encoding=None, errors='strict')

Decode binary to string.

Parameters:
  • binary – binary content of a http request.
  • url – endpoint of the request.
  • encoding – manually specify the encoding.
  • errors – errors handle method.
Returns:

str

crawlib.decode.smart_decode(binary, errors='strict')

Automatically find the right codec to decode binary data to string.

Parameters:
  • binary – binary data
  • errors – one of ‘strict’, ‘ignore’ and ‘replace’
Returns:

string