data_class

中文文档

Extended

每当我们收到一个 Request 之后, DownloaderMiddleWare 会给我们传回一个 Response. 而我们会用在 HtmlParser 中定义的以 parse 开头的方法来对 Response 中的html进行解析. 在这个解析过程中, 解析的结果会被放入 ParseResult 类中. 这个类除了包含了:

  • 解析中被传输的参数.
  • 经过解析, 判断解析是否成功了. 如果不成功, 判断是解析中出了差错, 还是这个 Response
    本身就是错误的. 然后保存状态码.
  • 记录该次解析开始的时间.

还包含一个重要的属性 ParseResult.item`. 这个item是一个 Scrapy.Item 的子类. 包含了我们所感兴趣的数据本体.

class crawlib.data_class.ExtendedItem(*args, **kwargs)

An abstract data container class hold data that is extracted from html.

process(parse_result, **kwargs)

define a method that how this method been processed

to_me_orm()

take data out, and put in corresponding mongoengine orm class.

to_sa_orm()

take data out, and put in corresponding sqlalchemy orm class.

class crawlib.data_class.OneToManyItem(*args, **kwargs)

One To

Parameters:
  • parent_class – a mongoengine.Document class.
  • parent – instance of mongoengine.Document.

中文文档

class crawlib.data_class.OneToManyMongoEngineItem(*args, **kwargs)
process(parse_result, **kwargs)

define a method that how this method been processed

class crawlib.data_class.OneToManyRdsItem(*args, **kwargs)
process(parse_result, engine, **kwargs)

define a method that how this method been processed

class crawlib.data_class.ParseResult(params=NOTHING, item=None, log=NOTHING, status=None, create_at=NOTHING)

A data container class holds:

  • extracted item
  • request and response info
  • status info
  • time info
Parameters:
  • params – parser function parameters. parser函数的所有参数.
  • item – parsed data. 从html中解析出的数据.
  • log – error dictionary.
  • status – int, status code. 抓取状态码.
  • time – datetime. 抓取的时间.
is_finished()

test if the status should be marked as finished.

process_item(**kwargs)

Could be used for item pipeline in scrapy framework.