data_class¶
中文文档
Extended
每当我们收到一个 Request 之后, DownloaderMiddleWare 会给我们传回一个
Response. 而我们会用在 HtmlParser 中定义的以 parse 开头的方法来对
Response 中的html进行解析. 在这个解析过程中, 解析的结果会被放入
ParseResult 类中. 这个类除了包含了:
- 解析中被传输的参数.
- 经过解析, 判断解析是否成功了. 如果不成功, 判断是解析中出了差错, 还是这个
Response - 本身就是错误的. 然后保存状态码.
- 经过解析, 判断解析是否成功了. 如果不成功, 判断是解析中出了差错, 还是这个
- 记录该次解析开始的时间.
还包含一个重要的属性 ParseResult.item`. 这个item是一个 Scrapy.Item
的子类. 包含了我们所感兴趣的数据本体.
-
class
crawlib.data_class.ExtendedItem(*args, **kwargs)¶ An abstract data container class hold data that is extracted from html.
-
process(parse_result, **kwargs)¶ define a method that how this method been processed
-
to_me_orm()¶ take data out, and put in corresponding mongoengine orm class.
-
to_sa_orm()¶ take data out, and put in corresponding sqlalchemy orm class.
-
-
class
crawlib.data_class.OneToManyItem(*args, **kwargs)¶ One To
Parameters: - parent_class – a
mongoengine.Documentclass. - parent – instance of
mongoengine.Document.
中文文档
- parent_class – a
-
class
crawlib.data_class.OneToManyMongoEngineItem(*args, **kwargs)¶ -
process(parse_result, **kwargs)¶ define a method that how this method been processed
-
-
class
crawlib.data_class.OneToManyRdsItem(*args, **kwargs)¶ -
process(parse_result, engine, **kwargs)¶ define a method that how this method been processed
-
-
class
crawlib.data_class.ParseResult(params=NOTHING, item=None, log=NOTHING, status=None, create_at=NOTHING)¶ A data container class holds:
- extracted item
- request and response info
- status info
- time info
Parameters: - params – parser function parameters. parser函数的所有参数.
- item – parsed data. 从html中解析出的数据.
- log – error dictionary.
- status – int, status code. 抓取状态码.
- time – datetime. 抓取的时间.
-
is_finished()¶ test if the status should be marked as finished.
-
process_item(**kwargs)¶ Could be used for item pipeline in scrapy framework.