data_class¶
中文文档
Extended
每当我们收到一个 Request
之后, DownloaderMiddleWare
会给我们传回一个
Response
. 而我们会用在 HtmlParser
中定义的以 parse
开头的方法来对
Response
中的html进行解析. 在这个解析过程中, 解析的结果会被放入
ParseResult
类中. 这个类除了包含了:
- 解析中被传输的参数.
- 经过解析, 判断解析是否成功了. 如果不成功, 判断是解析中出了差错, 还是这个
Response
- 本身就是错误的. 然后保存状态码.
- 经过解析, 判断解析是否成功了. 如果不成功, 判断是解析中出了差错, 还是这个
- 记录该次解析开始的时间.
还包含一个重要的属性 ParseResult.item`
. 这个item是一个 Scrapy.Item
的子类. 包含了我们所感兴趣的数据本体.
-
class
crawlib.data_class.
ExtendedItem
(*args, **kwargs)¶ An abstract data container class hold data that is extracted from html.
-
process
(parse_result, **kwargs)¶ define a method that how this method been processed
-
to_me_orm
()¶ take data out, and put in corresponding mongoengine orm class.
-
to_sa_orm
()¶ take data out, and put in corresponding sqlalchemy orm class.
-
-
class
crawlib.data_class.
OneToManyItem
(*args, **kwargs)¶ One To
Parameters: - parent_class – a
mongoengine.Document
class. - parent – instance of
mongoengine.Document
.
中文文档
- parent_class – a
-
class
crawlib.data_class.
OneToManyMongoEngineItem
(*args, **kwargs)¶ -
process
(parse_result, **kwargs)¶ define a method that how this method been processed
-
-
class
crawlib.data_class.
OneToManyRdsItem
(*args, **kwargs)¶ -
process
(parse_result, engine, **kwargs)¶ define a method that how this method been processed
-
-
class
crawlib.data_class.
ParseResult
(params=NOTHING, item=None, log=NOTHING, status=None, create_at=NOTHING)¶ A data container class holds:
- extracted item
- request and response info
- status info
- time info
Parameters: - params – parser function parameters. parser函数的所有参数.
- item – parsed data. 从html中解析出的数据.
- log – error dictionary.
- status – int, status code. 抓取状态码.
- time – datetime. 抓取的时间.
-
is_finished
()¶ test if the status should be marked as finished.
-
process_item
(**kwargs)¶ Could be used for item pipeline in scrapy framework.