selenium_downloader¶

url content downloader middleware using selenium.

class crawlib.downloader.selenium_downloader.BaseSeleliumDownloader(init_driver_func=<function <lambda>>, cache_dir=None, read_cache_first=False, alert_when_cache_missing=False, always_update_cache=False, cache_expire=None, testmode=False, **kwargs)¶

Implements common behavior for downloading url content.

Note

In __init__(self, ...) method, we only save the parameters. The actually webdriver creation happened in BaseSeleliumDownloader.create_driver().

Parameters:	testmode – bool, see `BaseSeleniumDownloader.use_testmode()`.

create_driver(**kwargs)¶: Create webdriver instance.

download(*args, **kwargs)¶: Warning

NOT IMPLEMENTED! python selenium doesn’t support file and image downloading.

get_html(url, params=None, cache_cb=None, **kwargs)¶: Get html of an url.

use_testmode()¶

中文文档

在测试中我们会有特殊的需求. 我们希望能用Selenium从测试Url上抓取Html, 然后对 html_parser 中的函数进行测试. 期间我们希望能对Html进行缓存, 在短时间内重复运行测试时, 使用缓存中的Html. 当且仅当我们需要时, 才启动浏览器进行抓取. 这种模式我们称之为 测试模式 (testmode).

测试模式: 在测试模式中, 我们设定:

在未遭遇缓存未命中之前, 并不真正的创建 WebDriver 对象 (并不打开浏览器)

只有在未命中时, 再创建并初始化 BaseSeleliumDownloader.driver.
永远先尝试从缓存中读取数据.
当缓存未命中时显示提示信息.
永远自动更新缓存.

class crawlib.downloader.selenium_downloader.ChromeDownloader(chromedriver_executable_path, init_driver_func=<function <lambda>>, cache_dir=None, read_cache_first=False, alert_when_cache_missing=False, always_update_cache=False, cache_expire=None, testmode=False, **kwargs)¶: Chrome browser url content downloader.

selenium_downloader¶

crawlib

Navigation

Related Topics