# Scrapy-link-filter
![Python 3.6](https://img.shields.io/badge/python-3.6-blue.svg) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/ambv/black)
Spider Middleware that allows a [Scrapy Spider](https://scrapy.readthedocs.io/en/latest/topics/spiders.html) to filter requests.
There is similar functionality in the [CrawlSpider](https://scrapy.readthedocs.io/en/latest/topics/spiders.html#crawlspider) already using Rules and in the [RobotsTxtMiddleware](https://scrapy.readthedocs.io/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.robotstxt), but there are twists.
This middleware allows defining rules dinamically per spider, or job, or request.
## Install
This project requires [Python 3.6+](https://www.python.org/) and [pip](https://pip.pypa.io/). Using a [virtual environment](https://virtualenv.pypa.io/) is strongly encouraged.
```sh
$ pip install git+https://github.com/croqaz/scrapy-link-filter
```
## Usage
For the middleware to be enabled as a Spider Middleware, it must be added in the project `settings.py`:
```
SPIDER_MIDDLEWARES = {
# maybe other Spider Middlewares ...
# can go after DepthMiddleware: 900
'scrapy_link_filter.middleware.LinkFilterMiddleware': 950,
}
```
Or, it can be enabled as a Downloader Middleware, in the project `settings.py`:
```
DOWNLOADER_MIDDLEWARES = {
# maybe other Downloader Middlewares ...
# can go before RobotsTxtMiddleware: 100
'scrapy_link_filter.middleware.LinkFilterMiddleware': 50,
}
```
The rules must be defined either in the spider instance, in a `spider.extract_rules` dict, or per request, in `request.meta['extract_rules']`.
Internally, the extract_rules dict is converted into a [LinkExtractor](https://docs.scrapy.org/en/latest/topics/link-extractors.html), which is used to match the requests.
Example of a specific allow filter:
```py
extract_rules = {"allow_domains": "example.com", "allow": "/en/items/"}
```
Or a specific deny filter:
```py
extract_rules = {
"deny_domains": ["whatever.com", "ignore.me"],
"deny": ["/privacy-policy/?$", "/about-?(us)?$"]
}
```
The allowed fields are:
* `allow_domains` and `deny_domains` - one, or more domains to specifically limit to, or specifically reject
* `allow` and `deny` - one, or more sub-strings, or patterns to specifically allow, or reject
-----
## License
[BSD3](LICENSE) © Cristi Constantin.
没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
共12个文件
py:4个
txt:3个
pkg-info:2个
资源分类:Python库 所属语言:Python 资源全名:scrapy-link-filter-0.1.1.tar.gz 资源来源:官方 安装方法:https://lanzao.blog.csdn.net/article/details/101784059
资源推荐
资源详情
资源评论
收起资源包目录
scrapy-link-filter-0.1.1.tar.gz (12个子文件)
scrapy-link-filter-0.1.1
PKG-INFO 4KB
LICENSE 1KB
scrapy_link_filter
middleware.py 2KB
__init__.py 59B
__version__.py 63B
setup.cfg 249B
setup.py 2KB
scrapy_link_filter.egg-info
PKG-INFO 4KB
SOURCES.txt 302B
top_level.txt 19B
dependency_links.txt 1B
README.md 2KB
共 12 条
- 1
资源评论
挣扎的蓝藻
- 粉丝: 12w+
- 资源: 15万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功