定义一个json格式的爬虫规则，Nodejs按照该规则爬取所需要的内容.zip资源-CSDN文库

共35个文件

ts：16个

json：7个

yml：4个

爬虫

python

数据收集

需积分: 5 103 浏览量 2024-01-19 16:44:53 上传评论收藏 71KB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

定义一个json格式的爬虫规则，Nodejs按照该规则爬取所需要的内容.zip （35个子文件）

SJT-code

tsconfig.cjs.json 203B

.editorconfig 210B

.vscode

settings.json 504B

launch.json 1KB

.prettierrc 100B

.github

release-drafter.yml 637B

workflows

code-format-test.yml 1KB

npm-package-version-check.yml 1KB

npm-publish-github-packages.yml 2KB

src

crawl.ts 2KB

types.ts 2KB

api.ts 8KB

result.ts 2KB

request.ts 427B

index.ts 87B

util.ts 2KB

LICENSE 1KB

.husky

pre-commit 69B

tsconfig.eslint.json 58B

tests

gitee.com

sum.test.ts 550B

bswtan.com

content.test.ts 1KB

detail.test.ts 4KB

search.test.ts 2KB

error.test.ts 2KB

completion.test.ts 1003B

json.test.ts 2KB

CHANGELOG.md 2KB

package.json 2KB

pnpm-lock.yaml 139KB

.eslintrc.json 641B

jest.config.ts 415B

.gitignore 26B

debugger.ts 2KB

tsconfig.json 881B

README.md 13KB

# spider-crawler [![Lint & Test](https://github.com/wtto00/spider-crawler/actions/workflows/code-format-test.yml/badge.svg)](https://github.com/wtto00/spider-crawler/actions/workflows/code-format-test.yml) ![coverage](https://img.shields.io/codecov/c/github/wtto00/spider-crawler/main) [![downloads](https://img.shields.io/npm/dm/@wtto00/spider-crawler)](https://www.npmjs.com/package/@wtto00/spider-crawler) ![license](https://img.shields.io/github/license/wtto00/spider-crawler) 定义一个 json 格式的爬虫规则，Nodejs 按照该规则爬取所需要的内容 ## 使用 ```javascript import { crawlFromUrl, crawlFromJson, crawlFromHtml } from '@wtto00/spider-crawler'; crawlFromUrl(urlOptions).then((res) => { console.log(res); }); crawlFromJson(jsonOptions).then((res) => { console.log(res); }); crawlFromHtml(htmlOptions).then((res) => { console.log(res); }); ``` ## 示例 #### crawlFromUrl ```javascript const options = { url: 'https://marketplace.visualstudio.com/items?itemName=Orta.vscode-jest', rules: { name: { selector: '.ux-item-name', handlers: [{ method: 'text' }, { method: 'trim' }], }, author: { selector: '.ux-item-publisher', handlers: [{ method: 'text' }, { method: 'trim' }], }, installs: { selector: '.installs-text', handlers: [ { method: 'text' }, { method: 'substring', args: [0, -9] }, { method: 'trim' }, { method: 'replace', args: [',', ''] }, { method: 'number' }, ], }, tags: { selector: '.meta-data-list-link', handlers: [ { method: 'map', args: [ { text: { handlers: [{ method: 'text' }] }, link: { handlers: [{ method: 'attr', args: ['href'] }, { method: 'resolveUrl' }] }, }, ], }, ], }, }, }; crawlFromUrl(options).then((res) => { console.log(res); }); // {"code":0,"message":"success","data":{"name":"Jest","author":"Orta","installs":1148080,"tags":null}} ``` #### crawlFromJson ```javascript const options = { json: JSON.stringify({ test: '1' }), rules: { test: { selector: 'test', handlers: [{ method: 'number' }], }, }, }; const res = crawlFromJson(jsonOptions); // {"code":0,"message":"success","data":{"test":1}} ``` #### crawlFromHtml ```javascript const options = { html: '<p class="test">content</p>', rules: { content: { selector: '.test', handlers: [{ method: 'text' }], }, }, }; const res = crawlFromHtml(options); // {"code":0,"message":"success","data":{"content":"content"}} ``` ## CrawlFromJson Options | 字段 | 类型 | 备注 | | ----- | --------------- | ------------ | | json | string | json 字符串 | | rules | [Rules](#Rules) | 取值处理规则 | ## CrawlFromHtml Options | 字段 | 类型 | 必填 | 备注 | | ------- | --------------- | ---- | ------------------------------------- | | baseUrl | string | 否 | baseUrl 用于 html 中某些 url 属性处理 | | html | string | 是 | html 字符串 | | rules | [Rules](#Rules) | 是 | 取值处理规则 | ## CrawlFromUrl Options | 字段 | 类型 | 备注 | | ------- | --------------------------------------------------------------- | -------- | | url | string | 请求地址 | | options | [RequestInit](https://www.npmjs.com/package/node-fetch#options) | 请求参数 | | rules | [Rules](#Rules) | 爬虫规则 | ### Rules ```typescript type Rules = Record<string, Rule>; ``` #### Rule | 字段 | 类型 | 必填 | 备注 | | -------- | --------------------- | ---- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | selector | string | 否 | [cheerio 选择器](https://github.com/cheeriojs/cheerio/wiki/Chinese-README#%E9%80%89%E6%8B%A9%E5%99%A8) | | dataType | 'html'\|'json' | 否 | selector 是 [cheerio 选择器](https://github.com/cheeriojs/cheerio/wiki/Chinese-README#%E9%80%89%E6%8B%A9%E5%99%A8)，还是 [json 选择器](https://www.lodashjs.com/docs/lodash.at) | | handlers | [Handler](#Handler)[] | 是 | 爬虫爬取到的元素的处理方法集合 | #### Handler ```typescript interface Handler { method: Method; args?: Args; } ``` #### Method & Args 下边列举所有的可以方法以及相对应的参数 | 方法`Method` | 参数`args` | 说明 | | ------------ | ----------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | prefix | (string) | 字符串开头添加字符串 | | substring | (number,?number) | 对字符串结果进行截取 | | replace | (string,string) | 字符串全局替换 | | trim | - | 去除开头与结尾的空格 | | number | - | 把字符串转为数字，转换失败时默认为 0

评论收藏

内容反馈