【免费】2023_GPT4All-J_Technical_Report

需积分: 0 192 浏览量 2023-12-13 15:29:42 上传评论收藏 3.11MB PDF 举报

资源推荐

资源详情

资源评论

GPT4All-J: An Apache-2 Licensed Assistant-Style Chatbot

Yuvanesh Anand

yuvanesh@nomic.ai

Zach Nussbaum

zach@nomic.ai

Brandon Duderstadt

brandon@nomic.ai

Benjamin M. Schmidt

ben@nomic.ai

Adam Treat

treat.adam@gmail.com

Andriy Mulyar

andriy@nomic.ai

Abstract

GPT4All-J is an Apache-2 licensed chatbot

trained over a massive curated corpus of as-

sistant interactions including word problems,

multi-turn dialogue, code, poems, songs, and

stories. It builds on the March 2023 GPT4All

release by training on a signiﬁcantly larger

corpus, by deriving its weights from the

Apache-licensed GPT-J model rather than the

GPL-licensed of LLaMA, and by demonstrat-

ing improved performance on creative tasks

such as writing stories, poems, songs and

plays. We openly release the training data,

data curation procedure, training code, and ﬁ-

nal model weights to promote open research

and reproducibility. Additionally, we release

Python bindings and a Chat UI to a quantized

4-bit version of GPT4All-J allowing virtually

anyone to run the model on CPU.

1 Data Collection and Curation

We gather a diverse sample of questions/prompts

by leveraging several publicly available datasets

and curating our own set of prompts:

• Several subsamples from subsets of

LAION OIG including uniﬁed chip2,

uniﬁed uniﬁedskg instruction, uni-

ﬁed hc3 human, uniﬁed multi news and

uniﬁed abstract inﬁll

• Coding questions with a random sub-sample

of Stackoverﬂow Questions

• Instruction-tuning with a sub-sample of Big-

science/P3

• Custom-generated creative questions.

We accompany this paper with the 800k point

GPT4All-J dataset that is a superset of the origi-

nal 400k points GPT4All dataset. We dedicated

substantial attention to data preparation and cura-

tion.

Building on the GPT4All dataset, we curated

the GPT4All-J dataset by augmenting the origi-

nal 400k GPT4All examples with new samples

encompassing additional multi-turn QA samples

and creative writing such as poetry, rap, and short

stories. We designed prompt templates to create

different scenarios for creative writing. The cre-

ative prompt template was inspired by Mad Libs

style variations of ‘Write a [creative story type]

about [NOUN] in the style of [PERSON]‘. In ear-

lier versions of GPT4All, we found that rather than

writing actual creative content, the model would

discuss how it would go about writing the content.

Training on this new dataset allows GPT4All-J to

write poems, songs, and plays with increased com-

petence.

We used Atlas to inform our data cleaning and

curation efforts. We started with a collection of ap-

proximately 1,000,000 points. Several data cura-

tion iterations produced our ﬁnal GPT4All-J train-

ing set. Among other changes, we removed exact

duplicate prompts and responses characterized by

homogeneous clusters in the Atlas map. We also

removed prompts that were less than 10 characters

such as single words like ‘The‘, ‘And‘, as well as

poorly formatted examples.

Interactively explore the cleaned dataset in At-

las:

• GPT4All-J Curated Training Set Map

2 Model Training

We trained several models ﬁnetuned from both

LLaMA 7B (Touvron et al., 2023) and GPT-J

(Wang and Komatsuzaki, 2021) checkpoints. The

model associated with our initial public release

is trained with LoRA (Hu et al., 2021) on the

437,605 post-processed examples for four epochs

while the ﬁnetuned GPT-J was trained for one

epoch. Detailed model hyper-parameters and

training code can be found in the associated repos-

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余2页未读，立即下载

评论收藏

内容反馈

2301_81884845

粉丝: 0
资源: 1

2023_GPT4All-J_Technical_Report_2.pdf

最新资源

2023_GPT4All-J_Technical_Report_2.pdf

2016 Congestion Managment Process(CMP)Technical Report(Draft)-PPT-报告资料.pdf

2023麦肯锡中国消费者报告：韧性时代2023 McKinsey China Consumer Report.pdf

No_Fear_Report-2nd-Qtr.-2023-1.pdf

Technical Report On C++ Performance(TR18015:2006).pdf

REN21：Renewables 2023 Global Status Report.pdf

GPT2-Chinese.zip_gpt-2_gpt2 小模型_gpt2 模型下载_gpt2-Chinese_gpt2代码

DS918+_6.2.3-25426_UEFI_GPT_EXPAND.zip

PyPI 官网下载 | gpt_utils-1.1.0-py3-none-any.whl

PyPI 官网下载 | gpt3_simple_primer-0.0.25-py3-none-any.whl

IGU：2023 World LNG Report.pdf

SEG Research：2023 Annual SaaS Report.pdf

IEA：Gas Market Report Q2 2023.pdf

WTO：WTO 2023 Annual Report.pdf

2023 State EdTech Trends Report_Final.pdf

GPT-4震撼来袭：发布全文（GPT-4翻译）_MarsBit 2023-5-26 133857 1.pdf

emmcdl.exe 编译版

GPT2_1W.zip_CGPT2_1w_World_GPT2_GPT2_1w_gpt2气象参数_对流层

gpt4all安装包，包含windows,linux,macos三大平台

GPT将如何影响我们的工作？-东北证券-2023.3.23-25页.pdf

【Design in Tech Report】设计与人工智能 科技中的设计报告2023.pdf

手机刷机_刷机包_小米刷机_xiaomi-mi-note-lte_5.12.4_4.4.4_2019.11.25

文心一言、GPT3.5及GPT-4的应用测评对比.rar

EFI标准中的GPT分区的实现

GPT指令实战课，学会使用ChatGPT，解决工作学习中一个个具体问题，真正提高效率-教程网盘链接提取码下载.txt

gpt2-ml-master（GPT2 多语言支持, 15亿参数中文预训练模型）.zip

GPT4V-System-Card.pdf

openwrt_x86-64-0205_4.19.123_uefi-gpt_sta_Lenyu.img.gz

gpt2w.rar_GPT2w_GPT2w_5_et2w_对流层_对流层延迟

最新资源

【Design in Tech Report】设计与人工智能科技中的设计报告2023.pdf