Python库|django-scraper-0.1.tar.gz资源-CSDN文库

版权申诉

111 浏览量 2022-04-08 06:01:46 上传评论收藏 7KB GZ 举报

共21个文件

py：10个

txt：4个

pkg-info：2个

资源推荐

资源详情

资源评论

收起资源包目录

django-scraper-0.1.tar.gz （21个子文件）

django-scraper-0.1

MANIFEST.in 64B

PKG-INFO 1KB

README.rst 754B

scraper

models.py 5KB

utils.py 6KB

management

commands

run_scraper.py 312B

__init__.py 0B

__init__.py 19B

views.py 0B

admin.py 201B

tests.py 60B

LICENSE 1KB

setup.cfg 100B

django_scraper.egg-info

PKG-INFO 1KB

requires.txt 13B

SOURCES.txt 468B

top_level.txt 8B

dependency_links.txt 1B

setup.py 650B

README.md 3KB

django-scraper ============== **django-scraper** is a Django application which crawls and downloads online content following configurable instructions. * Extract content of given online websites/pages using XPath queries. * Automatically browse and download content in related pages, with given depth. * Support metadata extract along with other content * Have content refinement rules and black words filtering * Store and prevent duplication of downloaded content * Support HTTP, HTTPS proxies. Installation ------------ This application requires some other tools installed first: lxml requests **django-scraper** installation can be made using `pip`: pip install django-scraper Configuration ------------- In order to use **django-scraper**, it should be put into `Django` settings as installed application. INSTALLED_APPS = ( ... 'scraper', ) There are also some important configuration values should be added into settings file: CRAW_ROOT = '/path/to/local/storage' PROXIES = { 'http': 'sample_proxy:80', 'https': sample_proxy:8080' } Usage ----- To start using the application, you should create new `Source` object via admin interface. There, please enter following information: * `url` - URL to the start page of `source` (website, entry list,...) * `name` - Name of the source * `link xpath` - XPATH to links of main content page (entries, articles,...) * `expand rules` - XPATH to url values of next scraping session ~ higher depth * `crawl depth` - Max depth of scraping session. This relates to expand rules * `content xpath` - XPATH to the target value of content page (article body,...) * `content type` - Type of the current `source` * `meta xpath` - Python dictionary of meta-data information will extracted along the main content *Example:* { 'title': '//h1[@class="title"]/text()', 'keywords': 'keywords': '//meta[@name="keywords"]/@content', } * `extra xpath` - XPATH to additional content that will be downloaded (PDF files, video clips,...) * `refine rules` - List of regular expressions will be applied to content to remove redundant data. Each regex stays in one different line. *Example:* <div class="tags".*$ <br/?> * `black words` - Select set of words, a content will not be downloaded if containing one of those words * `active` - Determine if this `source` will run or not * `download image` - Check this to download all images present inside the specified content After being saved, the `source` object will run a scraping session by calling crawl() method: source_object.crawl() or under console, by running management command `run_scraper`: python manage.py run_scraper With this command, all active sources inside current Django instance will be processed consecutively. -- *For further information, issues, or any questions regarding this, please email to me[at]zniper.net*

评论收藏

内容反馈

版权申诉