# Crawlab
<p>
<a href="https://hub.docker.com/r/tikazyq/crawlab/builds" target="_blank">
<img src="https://github.com/crawlab-team/crawlab/workflows/Docker/badge.svg">
</a>
<a href="https://hub.docker.com/r/tikazyq/crawlab" target="_blank">
<img src="https://img.shields.io/docker/pulls/tikazyq/crawlab?label=pulls&logo=docker">
</a>
<a href="https://github.com/crawlab-team/crawlab/releases" target="_blank">
<img src="https://img.shields.io/github/release/crawlab-team/crawlab.svg?logo=github">
</a>
<a href="https://github.com/crawlab-team/crawlab/commits/master" target="_blank">
<img src="https://img.shields.io/github/last-commit/crawlab-team/crawlab.svg">
</a>
<a href="https://github.com/crawlab-team/crawlab/issues?q=is%3Aissue+is%3Aopen+label%3Abug" target="_blank">
<img src="https://img.shields.io/github/issues/crawlab-team/crawlab/bug.svg?label=bugs&color=red">
</a>
<a href="https://github.com/crawlab-team/crawlab/issues?q=is%3Aissue+is%3Aopen+label%3Aenhancement" target="_blank">
<img src="https://img.shields.io/github/issues/crawlab-team/crawlab/enhancement.svg?label=enhancements&color=cyan">
</a>
<a href="https://github.com/crawlab-team/crawlab/blob/master/LICENSE" target="_blank">
<img src="https://img.shields.io/github/license/crawlab-team/crawlab.svg">
</a>
</p>
[中文](https://github.com/crawlab-team/crawlab/blob/master/README-zh.md) | English
[Installation](#installation) | [Run](#run) | [Screenshot](#screenshot) | [Architecture](#architecture) | [Integration](#integration-with-other-frameworks) | [Compare](#comparison-with-other-frameworks) | [Community & Sponsorship](#community--sponsorship) | [CHANGELOG](https://github.com/crawlab-team/crawlab/blob/master/CHANGELOG.md) | [Disclaimer](https://github.com/crawlab-team/crawlab/blob/master/DISCLAIMER.md)
Golang-based distributed web crawler management platform, supporting various languages including Python, NodeJS, Go, Java, PHP and various web crawler frameworks including Scrapy, Puppeteer, Selenium.
[Demo](http://crawlab.cn/demo) | [Documentation](http://docs.crawlab.cn)
## Installation
Three methods:
1. [Docker](http://docs.crawlab.cn/Installation/Docker.html) (Recommended)
2. [Direct Deploy](http://docs.crawlab.cn/Installation/Direct.html) (Check Internal Kernel)
3. [Kubernetes](http://docs.crawlab.cn/Installation/Kubernetes.html) (Multi-Node Deployment)
### Pre-requisite (Docker)
- Docker 18.03+
- Redis 5.x+
- MongoDB 3.6+
- Docker Compose 1.24+ (optional but recommended)
### Pre-requisite (Direct Deploy)
- Go 1.12+
- Node 8.12+
- Redis 5.x+
- MongoDB 3.6+
## Quick Start
Please open the command line prompt and execute the command below. Make sure you have installed `docker-compose` in advance.
```bash
git clone https://github.com/crawlab-team/crawlab
cd crawlab
docker-compose up -d
```
Next, you can look into the `docker-compose.yml` (with detailed config params) and the [Documentation (Chinese)](http://docs.crawlab.cn) for further information.
## Run
### Docker
Please use `docker-compose` to one-click to start up. By doing so, you don't even have to configure MongoDB and Redis databases. Create a file named `docker-compose.yml` and input the code below.
```yaml
version: '3.3'
services:
master:
image: tikazyq/crawlab:latest
container_name: master
environment:
CRAWLAB_SERVER_MASTER: "Y"
CRAWLAB_MONGO_HOST: "mongo"
CRAWLAB_REDIS_ADDRESS: "redis"
ports:
- "8080:8080"
depends_on:
- mongo
- redis
mongo:
image: mongo:latest
restart: always
ports:
- "27017:27017"
redis:
image: redis:latest
restart: always
ports:
- "6379:6379"
```
Then execute the command below, and Crawlab Master Node + MongoDB + Redis will start up. Open the browser and enter `http://localhost:8080` to see the UI interface.
```bash
docker-compose up
```
For Docker Deployment details, please refer to [relevant documentation](https://tikazyq.github.io/crawlab-docs/Installation/Docker.html).
## Screenshot
#### Login
![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/login.png)
#### Home Page
![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/home.png)
#### Node List
![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/node-list.png)
#### Node Network
![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/node-network.png)
#### Spider List
![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/spider-list.png)
#### Spider Overview
![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/spider-overview.png)
#### Spider Analytics
![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/spider-analytics.png)
#### Spider File Edit
![](http://static-docs.crawlab.cn/file-edit.png)
#### Task Log
![](http://static-docs.crawlab.cn/task-log2.png)
#### Task Results
![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/task-results.png)
#### Cron Job
![](http://static-docs.crawlab.cn/schedule-v0.4.4.png)
#### Language Installation
![](http://static-docs.crawlab.cn/node-install-langs.png)
#### Dependency Installation
![](http://static-docs.crawlab.cn/node-install-dependencies.png)
#### Notifications
<img src="http://static-docs.crawlab.cn/notification-mobile.jpeg" height="480px">
## Architecture
The architecture of Crawlab is consisted of the Master Node and multiple Worker Nodes, and Redis and MongoDB databases which are mainly for nodes communication and data storage.
![](http://static-docs.crawlab.cn/architecture.png)
The frontend app makes requests to the Master Node, which assigns tasks and deploys spiders through MongoDB and Redis. When a Worker Node receives a task, it begins to execute the crawling task, and stores the results to MongoDB. The architecture is much more concise compared with versions before `v0.3.0`. It has removed unnecessary Flower module which offers node monitoring services. They are now done by Redis.
### Master Node
The Master Node is the core of the Crawlab architecture. It is the center control system of Crawlab.
The Master Node offers below services:
1. Crawling Task Coordination;
2. Worker Node Management and Communication;
3. Spider Deployment;
4. Frontend and API Services;
5. Task Execution (one can regard the Master Node as a Worker Node)
The Master Node communicates with the frontend app, and send crawling tasks to Worker Nodes. In the mean time, the Master Node synchronizes (deploys) spiders to Worker Nodes, via Redis and MongoDB GridFS.
### Worker Node
The main functionality of the Worker Nodes is to execute crawling tasks and store results and logs, and communicate with the Master Node through Redis `PubSub`. By increasing the number of Worker Nodes, Crawlab can scale horizontally, and different crawling tasks can be assigned to different nodes to execute.
### MongoDB
MongoDB is the operational database of Crawlab. It stores data of nodes, spiders, tasks, schedules, etc. The MongoDB GridFS file system is the medium for the Master Node to store spider files and synchronize to the Worker Nodes.
### Redis
Redis is a very popular Key-Value database. It offers node communication services in Crawlab. For example, nodes will execute `HSET` to set their info into a hash list named `nodes` in Redis, and the Master Node will identify online nodes according to the hash list.
### Frontend
Frontend is a SPA based on
[Vue-Element-Admin](https://github.com/PanJiaChen/vue-element-admin). It has re-used many Element-UI components to support corresponding display.
## Integration with Other Frameworks
[Crawlab SDK](https://github.com/crawlab-t