django-scraper
==============
**django-scraper** is a Django application which crawls and downloads online content following configurable instructions.
* Extract content of given online websites/pages using XPath queries.
* Automatically browse and download content in related pages, with given depth.
* Support metadata extract along with other content
* Have content refinement rules and black words filtering
* Store and prevent duplication of downloaded content
* Support HTTP, HTTPS proxies.
Installation
------------
This application requires some other tools installed first:
lxml
requests
**django-scraper** installation can be made using `pip`:
pip install django-scraper
Configuration
-------------
In order to use **django-scraper**, it should be put into `Django` settings as installed application.
INSTALLED_APPS = (
...
'scraper',
)
There are also some important configuration values should be added into settings file:
CRAW_ROOT = '/path/to/local/storage'
PROXIES = {
'http': 'sample_proxy:80',
'https': sample_proxy:8080'
}
Usage
-----
To start using the application, you should create new `Source` object via admin interface. There, please enter following information:
* `url` - URL to the start page of `source` (website, entry list,...)
* `name` - Name of the source
* `link xpath` - XPATH to links of main content page (entries, articles,...)
* `expand rules` - XPATH to url values of next scraping session ~ higher depth
* `crawl depth` - Max depth of scraping session. This relates to expand rules
* `content xpath` - XPATH to the target value of content page (article body,...)
* `content type` - Type of the current `source`
* `meta xpath` - Python dictionary of meta-data information will extracted along the main content
*Example:*
{
'title': '//h1[@class="title"]/text()',
'keywords': 'keywords': '//meta[@name="keywords"]/@content',
}
* `extra xpath` - XPATH to additional content that will be downloaded (PDF files, video clips,...)
* `refine rules` - List of regular expressions will be applied to content to remove redundant data. Each regex stays in one different line.
*Example:*
<div class="tags".*$
<br/?>
* `black words` - Select set of words, a content will not be downloaded if containing one of those words
* `active` - Determine if this `source` will run or not
* `download image` - Check this to download all images present inside the specified content
After being saved, the `source` object will run a scraping session by calling crawl() method:
source_object.crawl()
or under console, by running management command `run_scraper`:
python manage.py run_scraper
With this command, all active sources inside current Django instance will be processed consecutively.
--
*For further information, issues, or any questions regarding this, please email to me[at]zniper.net*
Python库 | django-scraper-0.1.tar.gz
版权申诉
111 浏览量
2022-04-08
06:01:46
上传
评论
收藏 7KB GZ 举报
挣扎的蓝藻
- 粉丝: 13w+
- 资源: 15万+
最新资源
- PTK9.py
- 6种MySQL批量更新方式的实现原理及效率比较.zip
- 【AC620+ACM2108】小梅哥Altera FPGA高速AD/DA实验
- Springboot3+Vue3实现副业(创业)智能语音项目开发17章
- 【AC620+ACM2108】小梅哥Altera FPGA高速AD/DA实验
- MySQL批量更新实战:6种方式效率PK.zip
- MySQL批量更新效率对比:原生MyBatis、MyBatis-Plus与JdbcTemplate等六种方法.zip
- 性能实测:Spring Boot中六种批量更新技术谁更快?.zip
- 使用winsper语音识别开源模型封装成openai chatgpt兼容接口
- Spring Boot中六种批量更新方法效率对比.zip
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈