![](https://csdnimg.cn/release/download_crawler_static/88864606/bg1.jpg)
CSDN Search Engine
![](https://csdnimg.cn/release/download_crawler_static/88864606/bg2.jpg)
C
ontents
目录
爬虫
01
搜索算法
02
演示
04
前端 & 后端
03
![](https://csdnimg.cn/release/download_crawler_static/88864606/bg3.jpg)
爬 虫
Requests + BeautifulSoup
url: https://blog.csdn.net/diandianxiyu_geek/article/details/83657231
正则表达式匹配url
存储到mysql数据库
模拟登陆
Selenium 模拟点击 + send_keys
Cookies格式转换 + session登陆
20w+数据
![](https://csdnimg.cn/release/download_crawler_static/88864606/bg4.jpg)
常见开源全文搜索引擎
Lucene(pyLucene)
Elasticsearch
Whoosh
Sphinx
Nutch
Solr
![](https://csdnimg.cn/release/download_crawler_static/88864606/bg5.jpg)
搜索算法: whoosh
Jieba 中文分词解析器
from jieba.analyse.analyzer import ChineseAnalyzer
多字段索引模式 schema
schema = Schema(url=ID(stored=True), title=TEXT(stored=True),
nickname=TEXT(stored=True), readcount=TEXT(stored=True)
, text=TEXT(stored=True, analyzer=analyzer), time=DATETIME(stored=True))
多字段查询词解析器 MultifieldParser 可同时匹配标题、文章内容
parser = MultifieldParser(['title', 'text'], schema=ix.schema)
with ix.searcher(weighting=scoring.BM25F()) as searcher:
query = parser.parse(q)
results = searcher.search(query, limit=100)
搜索器(searcher)采用BM25F得分策略