没有合适的资源?快使用搜索试试~ 我知道了~
2023_GPT4All-J_Technical_Report_2.pdf
需积分: 0 0 下载量 192 浏览量
2023-12-13
15:29:42
上传
评论
收藏 3.11MB PDF 举报
温馨提示
试读
3页
2023_GPT4All-J_Technical_Report_2.pdf
资源推荐
资源详情
资源评论
GPT4All-J: An Apache-2 Licensed Assistant-Style Chatbot
Yuvanesh Anand
yuvanesh@nomic.ai
Zach Nussbaum
zach@nomic.ai
Brandon Duderstadt
brandon@nomic.ai
Benjamin M. Schmidt
ben@nomic.ai
Adam Treat
treat.adam@gmail.com
Andriy Mulyar
andriy@nomic.ai
Abstract
GPT4All-J is an Apache-2 licensed chatbot
trained over a massive curated corpus of as-
sistant interactions including word problems,
multi-turn dialogue, code, poems, songs, and
stories. It builds on the March 2023 GPT4All
release by training on a significantly larger
corpus, by deriving its weights from the
Apache-licensed GPT-J model rather than the
GPL-licensed of LLaMA, and by demonstrat-
ing improved performance on creative tasks
such as writing stories, poems, songs and
plays. We openly release the training data,
data curation procedure, training code, and fi-
nal model weights to promote open research
and reproducibility. Additionally, we release
Python bindings and a Chat UI to a quantized
4-bit version of GPT4All-J allowing virtually
anyone to run the model on CPU.
1 Data Collection and Curation
We gather a diverse sample of questions/prompts
by leveraging several publicly available datasets
and curating our own set of prompts:
• Several subsamples from subsets of
LAION OIG including unified chip2,
unified unifiedskg instruction, uni-
fied hc3 human, unified multi news and
unified abstract infill
• Coding questions with a random sub-sample
of Stackoverflow Questions
• Instruction-tuning with a sub-sample of Big-
science/P3
• Custom-generated creative questions.
We accompany this paper with the 800k point
GPT4All-J dataset that is a superset of the origi-
nal 400k points GPT4All dataset. We dedicated
substantial attention to data preparation and cura-
tion.
Building on the GPT4All dataset, we curated
the GPT4All-J dataset by augmenting the origi-
nal 400k GPT4All examples with new samples
encompassing additional multi-turn QA samples
and creative writing such as poetry, rap, and short
stories. We designed prompt templates to create
different scenarios for creative writing. The cre-
ative prompt template was inspired by Mad Libs
style variations of ‘Write a [creative story type]
about [NOUN] in the style of [PERSON]‘. In ear-
lier versions of GPT4All, we found that rather than
writing actual creative content, the model would
discuss how it would go about writing the content.
Training on this new dataset allows GPT4All-J to
write poems, songs, and plays with increased com-
petence.
We used Atlas to inform our data cleaning and
curation efforts. We started with a collection of ap-
proximately 1,000,000 points. Several data cura-
tion iterations produced our final GPT4All-J train-
ing set. Among other changes, we removed exact
duplicate prompts and responses characterized by
homogeneous clusters in the Atlas map. We also
removed prompts that were less than 10 characters
such as single words like ‘The‘, ‘And‘, as well as
poorly formatted examples.
Interactively explore the cleaned dataset in At-
las:
• GPT4All-J Curated Training Set Map
2 Model Training
We trained several models finetuned from both
LLaMA 7B (Touvron et al., 2023) and GPT-J
(Wang and Komatsuzaki, 2021) checkpoints. The
model associated with our initial public release
is trained with LoRA (Hu et al., 2021) on the
437,605 post-processed examples for four epochs
while the finetuned GPT-J was trained for one
epoch. Detailed model hyper-parameters and
training code can be found in the associated repos-
资源评论
2301_81884845
- 粉丝: 0
- 资源: 1
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功