# cc_net
Tools to download and clean Common Crawl as introduced in our paper [CCNet](https://arxiv.org/abs/1911.00359).
If you found these resources useful, please consider citing:
```
@inproceedings{wenzek2020ccnet,
title={CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data},
author={Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzm{\'a}n, Francisco and Joulin, Armand and Grave, {\'E}douard},
booktitle={Proceedings of The 12th Language Resources and Evaluation Conference},
pages={4003--4012},
year={2020}
}
```
[![CircleCI](https://circleci.com/gh/facebookresearch/cc_net.svg?style=svg)](https://circleci.com/gh/facebookresearch/cc_net)
## Installation
We only tried this on Linux but installation should be possible on MacOS too.
1. Create or simlink a `data` folder to where you want to download the corpus.
2. Run `make install`. This will download some resources and install required packages.
3. If you have a C++ 17 compiler you can also run
`pip install .[getpy]`, it provides more memory efficient hashset.
4. Install the following tools manually if `make install` failed:
- `lmplz` and `build_binary` from [KenLM](https://github.com/kpu/kenlm)
- `spm_train` and `spm_encode` from [Sentence Piece](https://github.com/google/sentencepiece)
## Training Language Models
The `Makefile` is used to train Sentence Piece and LM on Wikipedia data.
* `make help` shows help
* `make lang=de lm` trains a Sentence Piece and a LM on German Wikipedia
* `make all_lm` trains the same model than in the paper
* `make lang=de dl_lm` downloads the LM trained for the paper
* `make dl_all_lm` downloads all of them
## Pipeline overview
The full mining pipeline is divided in 3 steps:
- `hashes` downloads one Common-Crawl snapshot, and compute hashes for each paragraph
- `mine` removes duplicates, detects language, run the LM and split by lang/perplexity buckets
- `regroup` regroup the files created by `mine` in chunks of 4Gb
Each step needs the previous step to be over before starting.
You can launch the full pipeline using `python -m cc_net`.
* `python -m cc_net --help` shows help
* `python -m cc_net --dump 2019-13` treats a specific snapshot
* `python -m cc_net -l my -l gu`
restricts to specific languages
* `python -m cc_net --lm_dir my_lms/` uses custom LMs
* `python -m cc_net --lang_threshold 0.3` set a specific field in `mine.Config`
* `python -m cc_net --config test` runs on a tiny subset of a snapshot
* `python -m cc_net --config config/my_config.json` uses configuration from the given config file
## Reproducing our work
Given the CPU required to run the full pipeline on such a big corpus we share a mapping from url to the information we
computed.
You can reconstruct the corpus used in the paper by using:
```sh
python -m cc_net --conf reproduce --dump 2019-09
```
## Extract XLM-R data
[Unsupervised Cross-lingual Representation Learning at Scale (XLM-RoBERTa)](https://arxiv.org/pdf/1911.02116.pdf)
paper was trained on data extracted by an internal version of cc_net.
Due to the format being a little bit different please use the following command instead:
```sh
python cc_net/tools/dl_cc_100.py --help
python cc_net/tools/dl_cc_100.py --outdir data_cc100 --process 8
```
If you use this version of the data please also consider citing:
```bibtex
@article{conneau2019unsupervised,
title={Unsupervised Cross-lingual Representation Learning at Scale},
author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
journal={arXiv preprint arXiv:1911.02116},
year={2019}
}
```
## Adapting to your infrastructure
Given the computation cost of running the full pipeline we distributed the computation
on a [Slurm](https://slurm.schedmd.com/) cluster using [submitit](https://github.com/facebookincubator/submitit).
`submitit` will default to spawning processes on your machine if Slurm cluster is found.
You should tweak `--task_parallelism` to something adapated to your machine.
Defaults are 512 for mining and 20 for reproducing.
To run the tasks in-process use `--execution debug`.
## Output format
Generated files are compressed JSON files. There is one JSON object per line.
__List of fields__:
- url: webpage URL (part of CC)
- date_download: date of download (part of CC)
- digest: sha1 digest of the webpage (part of CC)
- length: number of chars
- nlines: number of lines
- source_domain: web domain of the webpage
- title: page title (part of CC)
- raw_content: webpage content after deduplication
- original_nlines: number of lines before deduplication
- original_length: number of chars before deduplication
- language: language detected by FastText LID
- language_score: language score
- perplexity: perplexity of a LM trained on Wikipedia
__Sample JSON object__:
```json
{
"url": "http://www.pikespeakhospice.org/members/1420",
"date_download": "2019-02-15T18:40:25Z",
"digest": "sha1:VQW3KXUOALO543IJGTK2JLVEAN2XXKHI",
"length": 752,
"nlines": 5,
"source_domain": "www.pikespeakhospice.org",
"title": "LeeRoy Aragon",
"raw_content": "Date Honored: March 2017\nHe was a man of integrity, a hard worker, and a dedicated family man. He loved spending time with family camping, fishing, hunting, boating and just hanging out.\nHis Catholic faith was extremely important to him as he gave of his time and talents to the community. He had many friends through church and the Knights of Columbus. He was a meticulous handyman, and enjoyed building and fixing things and restoring antique furniture to perfection. He was a fan and supported his Colorado Rockies and Denver Broncos. Throughout the years he had devoted four-legged friends (his dogs and a horse named Sunny Boy).\nWe have many cherished memories of him that we will treasure until we are with him again.\n~ Family of LeeRoy F. Aragon",
"original_nlines": 7,
"original_length": 754,
"language": "en",
"language_score": 0.99,
"perplexity": 255.11
}
```
You can peak at those files using UNIX tools `zcat` and [`jq`](https://stedolan.github.io/jq/manual/), eg:
`zcat data/mined/2019-09/en_head_0000.json.gz | head -1 | jq .`
`jq` can do some complicated filtering.
`jsonql.py` provides a Python API with multiprocess support to do more complicated operations like LM scoring of the
document.
## License
By contributing to `cc_net`, you agree that your contributions will be licensed
under the LICENSE file in the root directory of this source tree.
没有合适的资源?快使用搜索试试~ 我知道了~
RedPajama 项目旨在创建一套领先的全开源大语言模型.rar
共118个文件
py:69个
md:15个
sbatch:10个
需积分: 5 3 下载量 172 浏览量
2023-08-02
22:30:47
上传
评论
收藏 1.04MB RAR 举报
温馨提示
Large Language Model (LLM) 即大规模语言模型,是一种基于深度学习的自然语言处理模型,它能够学习到自然语言的语法和语义,从而可以生成人类可读的文本。 所谓"语言模型",就是只用来处理语言文字(或者符号体系)的 AI 模型,发现其中的规律,可以根据提示 (prompt),自动生成符合这些规律的内容。 LLM 通常基于神经网络模型,使用大规模的语料库进行训练,比如使用互联网上的海量文本数据。这些模型通常拥有数十亿到数万亿个参数,能够处理各种自然语言处理任务,如自然语言生成、文本分类、文本摘要、机器翻译、语音识别等。 本文对国内外公司、科研机构等组织开源的 LLM 进行了全面的整理。 GPT教程 ChatGPT开发、使用视频教程合集:https://xueshu.fun/?s=gpt ChatGPT Flutter 应用程序开发 ChatGPT 4 和 Midjourney提示工程 OpenAI Python API 训练营 使用Django创建ChatGPT AI 机器人 ChatGPT Javascript开发教程
资源推荐
资源详情
资源评论
收起资源包目录
RedPajama 项目旨在创建一套领先的全开源大语言模型.rar (118个子文件)
cutoff.csv 21KB
.gitignore 2KB
aws_config.ini 62B
test_stats.json 1KB
test_segment.json 506B
test_reproduce.json 378B
mine_segment.json 260B
lid_exp.json 217B
LICENSE 11KB
LICENSE 1KB
Makefile 8KB
README.md 6KB
README.md 6KB
README.md 5KB
README.md 4KB
README.md 4KB
README.md 3KB
CODE_OF_CONDUCT.md 3KB
README.md 3KB
README.md 3KB
README.md 2KB
README.md 2KB
CONTRIBUTING.md 2KB
README.md 1KB
README.md 1KB
CHANGELOG.md 98B
redpajama.png 900KB
jsonql.py 41KB
mine.py 22KB
arxiv_cleaner.py 17KB
dedup.py 15KB
perplexity.py 11KB
test_jsonql.py 10KB
minify.py 9KB
process_wet_file.py 9KB
expand_corpus.py 9KB
execution.py 7KB
flat_hash_set.py 7KB
test_dedup.py 7KB
github_clean_dedup_local.py 7KB
run_download.py 6KB
dl_cc_100.py 6KB
embed_jsonl.py 6KB
dedup.py 5KB
test_minify.py 5KB
text_normalizer.py 5KB
split_by_lang.py 5KB
utils.py 4KB
filter.py 4KB
github_run_filter.py 4KB
get_wiki_cirrus.py 4KB
run_clean.py 4KB
regroup.py 3KB
main.py 3KB
make_dmoz_corpus.py 3KB
post_processing.py 3KB
test_transformer.py 2KB
token_count.py 2KB
github_global_dedup.py 2KB
tokenizer.py 2KB
test_flat_hash_set.py 2KB
github_token_count.py 2KB
dedup_phase2.py 2KB
setup.py 2KB
github_merge_dedup.py 2KB
c4_reformat.py 2KB
utils.py 2KB
test_regroup.py 2KB
classify.py 2KB
test_parse_wet_file.py 2KB
token_count.py 1KB
create_corpus.py 1KB
extract_urls.py 1KB
token_count.py 1KB
count.py 1KB
token_count.py 1KB
reduce_pca32.py 1KB
download.py 1KB
test_normalizer.py 993B
index_faiss.py 845B
convert_format.py 816B
dedup_phase1.py 727B
count_tokens.py 697B
download.py 651B
conftest.py 596B
print_stats.py 524B
__main__.py 340B
download.py 320B
__init__.py 181B
__init__.py 179B
__init__.py 0B
__init__.py 0B
__init__.py 0B
__init__.py 0B
__init__.py 0B
__init__.py 0B
github-merge-dedup-slurm.sbatch 831B
github-run-filter-slurm.sbatch 772B
arxiv-clean-slurm.sbatch 764B
github-token-count-slurm.sbatch 760B
共 118 条
- 1
- 2
资源评论
野生的狒狒
- 粉丝: 3388
- 资源: 2436
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功