<h1 align="center">FlagEmbedding</h1>
<p align="center">
<a href="https://github.com/FlagOpen/FlagEmbedding">
<img alt="Build" src="https://img.shields.io/badge/Contribution-Welcome-blue">
</a>
<a href="https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE">
<img alt="License" src="https://img.shields.io/badge/LICENSE-MIT-green">
</a>
<a href="https://huggingface.co/C-MTEB">
<img alt="Build" src="https://img.shields.io/badge/C_MTEB-ð¤-yellow">
</a>
<a href="https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding">
<img alt="Build" src="https://img.shields.io/badge/FlagEmbedding-1.1-red">
</a>
</p>
<h4 align="center">
<p>
<a href=#news>News</a> |
<a href="#projects">Projects</a> |
<a href=#model-list>Model List</a> |
<a href="#contributor">Contributor</a> |
<a href="#citation">Citation</a> |
<a href="#license">License</a>
<p>
</h4>
[English](README.md) | [ä¸æ](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md)
FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following projects currently:
- **Long-Context LLM**: [Activation Beacon](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/activation_beacon), [LongLLM QLoRA](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/longllm_qlora)
- **Fine-tuning of LM** : [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail)
- **Embedding Model**: [Visualized-BGE](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/visual), [BGE-M3](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3), [LLM Embedder](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder), [BGE Embedding](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding)
- **Reranker Model**: [llm rerankers](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker), [BGE Reranker](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker)
- **Benchmark**: [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB)
## News
- 4/30/2024: Release [Llama-3-8B-Instruct-80K-QLoRA](https://huggingface.co/namespace-Pt/Llama-3-8B-Instruct-80K-QLoRA), extending the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA training on a few synthesized long-context data. The model achieves remarkable performance on various long-context benchmarks. [Code](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/longllm_qlora) :fire:
- 3/18/2024: Release new [rerankers](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker), built upon powerful M3 and LLM (GEMMA and MiniCPM, not so large actually :smiley:) backbones, supporitng multi-lingual processing and larger inputs, massive improvements of ranking performances on BEIR, C-MTEB/Retrieval, MIRACL, LlamaIndex Evaluation :fire:
- 3/18/2024: Release [Visualized-BGE](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/visual), equipping BGE with visual capabilities. Visualized-BGE can be utilized to generate embeddings for hybrid image-text data. :fire:
- 1/30/2024: Release **BGE-M3**, a new member to BGE model series! M3 stands for **M**ulti-linguality (100+ languages), **M**ulti-granularities (input length up to 8192), **M**ulti-Functionality (unification of dense, lexical, multi-vec/colbert retrieval).
It is the first embedding model which supports all three retrieval methods, achieving new SOTA on multi-lingual (MIRACL) and cross-lingual (MKQA) benchmarks.
[Technical Report](https://arxiv.org/pdf/2402.03216.pdf) and [Code](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3). :fire:
- 1/9/2024: Release [Activation-Beacon](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/activation_beacon), an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM. [Technical Report](https://arxiv.org/abs/2401.03462)
- 12/24/2023: Release **LLaRA**, a LLaMA-7B based dense retriever, leading to state-of-the-art performances on MS MARCO and BEIR. Model and code will be open-sourced. Please stay tuned. [Technical Report](https://arxiv.org/abs/2312.15503)
- 11/23/2023: Release [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail), a method to maintain general capabilities during fine-tuning by merging multiple language models. [Technical Report](https://arxiv.org/abs/2311.13534)
- 10/12/2023: Release [LLM-Embedder](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder), a unified embedding model to support diverse retrieval augmentation needs for LLMs. [Technical Report](https://arxiv.org/pdf/2310.07554.pdf)
- 09/15/2023: The [technical report](https://arxiv.org/pdf/2309.07597.pdf) of BGE has been released
- 09/15/2023: The [massive training data](https://data.baai.ac.cn/details/BAAI-MTP) of BGE has been released
- 09/12/2023: New models:
- **New reranker model**: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models.
- **update embedding model**: release `bge-*-v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction.
<details>
<summary>More</summary>
<!-- ### More -->
- 09/07/2023: Update [fine-tune code](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md): Add script to mine hard negatives and support adding instruction during fine-tuning.
- 08/09/2023: BGE Models are integrated into **Langchain**, you can use it like [this](#using-langchain); C-MTEB **leaderboard** is [available](https://huggingface.co/spaces/mteb/leaderboard).
- 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size ð¤**
- 08/02/2023: Release `bge-large-*`(short for BAAI General Embedding) Models, **rank 1st on MTEB and C-MTEB benchmark!** :tada: :tada:
- 08/01/2023: We release the [Chinese Massive Text Embedding Benchmark](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB) (**C-MTEB**), consisting of 31 test dataset.
</details>
## Projects
### BGE-M3([Paper](https://arxiv.org/pdf/2402.03216.pdf), [Code](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3))
In this project, we introduce BGE-M3, the first embedding model which supports multiple retrieval modesãmultilingual and multi-granularity retrieval.
- Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
- Multi-Linguality: It can support more than 100 working languages.
- Multi-Granularity: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens.
We propose a novel self-knowledge distillation approach to improve the performance of single retrieval mode.
We optimize the batching strategy, enabling a large batch size, which can used simply when fine-tuning with long text or large language model.
We also construct a dataset for document retrieval and propose a simple strategy to improve the ability to model long text.
**The training code and fine-tuning data will be open-sourced in the near future.**
### [Visualized-BGE](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/visual)
In this project, we introduce Visualized-BGE, which integrating image token embedding into the BGE Text Embedding framework. Visualized-BGE can be used for various hybrid modal retrieval tasks, such as Multi-Modal K
没有合适的资源?快使用搜索试试~ 我知道了~
专为大语言模型各种检索增强任务设计的向量模型
共396个文件
py:230个
txt:50个
json:39个
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 103 浏览量
2024-05-06
15:01:41
上传
评论
收藏 13.7MB ZIP 举报
温馨提示
在本项目中,为了提高单一检索模式的性能,提出了一种新的自知识蒸馏方法。 我们优化了批处理策略,支持大批处理大小,这可以在对长文本或大型语言模型进行向量微调时简单使用。 我们还构建了一个用于文档检索的数据集,并提出了一个简单的策略来提高长文本的建模能力。
资源推荐
资源详情
资源评论
收起资源包目录
专为大语言模型各种检索增强任务设计的向量模型 (396个子文件)
.DS_Store 6KB
.DS_Store 6KB
.DS_Store 6KB
.DS_Store 6KB
.DS_Store 6KB
.gitignore 2KB
.gitignore 24B
bpe_simple_vocab_16e6.txt.gz 1.29MB
mkqa.jpg 594KB
long.jpg 474KB
miracl.jpg 437KB
wiki_candi_2.jpg 176KB
nqa.jpg 155KB
wiki_candi_1.jpg 102KB
bm25.jpg 67KB
embedder_examples.json 4.56MB
infbench.json 382KB
convsearch.json 373KB
qa.json 311KB
lrlm.json 261KB
tool.json 242KB
narrativeqa.json 137KB
narrativeqa.json 116KB
icl.json 79KB
chat.json 14KB
llm_examples.json 13KB
stage3-offload.json 2KB
stage3-offload-optim.json 2KB
stage3-offload-all.json 2KB
stage3.json 1KB
stage3-offload-optim.json 1KB
stage3.json 1KB
stage2-offload.json 1KB
stage2_offload.json 1KB
stage2-offload.json 1KB
stage1.json 1KB
stage2.json 1KB
stage2.json 1KB
stage2_small.json 1KB
stage2.json 1KB
stage3.json 1KB
ds_config.json 960B
ds_config.json 960B
stage0.json 960B
stage0.json 960B
stage1.json 960B
EVA02-CLIP-B-16.json 681B
EVA02-CLIP-L-14-336.json 654B
EVA02-CLIP-L-14.json 650B
EVA02-CLIP-bigE-14-plus.json 564B
EVA02-CLIP-bigE-14.json 563B
EVA01-CLIP-g-14.json 525B
EVA01-CLIP-g-14-plus.json 524B
EVA01-CLIP-B-16.json 398B
toy_finetune_data.jsonl 6KB
toy_finetune_data.jsonl 4KB
toy_train_data1.jsonl 4KB
toy_train_data2.jsonl 4KB
toy_finetune_data.jsonl 4KB
toy_pretrain_data.jsonl 4KB
LICENSE 1KB
README.md 25KB
README.md 23KB
README_zh.md 19KB
README.md 17KB
README.md 15KB
README.md 15KB
README.md 13KB
README.md 12KB
README.md 12KB
README.md 11KB
README.md 10KB
fine-tune.md 9KB
evaluation.md 8KB
README.md 7KB
evaluation.md 7KB
README.md 6KB
README.md 6KB
README.md 5KB
README.md 4KB
evaluation.md 4KB
README.md 4KB
README.md 3KB
training.md 3KB
readme.md 3KB
training.md 2KB
README.md 1KB
README.md 1KB
BGE_M3.pdf 643KB
needle.png 1.82MB
passkey.png 1.72MB
1.png 946KB
topic.png 945KB
cir_candi_2.png 880KB
pic.png 472KB
activation-beacon.png 431KB
impress.png 371KB
SFT-CIRR.png 150KB
cir_query.png 149KB
zs-performance.png 123KB
共 396 条
- 1
- 2
- 3
- 4
资源评论
汀、人工智能
- 粉丝: 7w+
- 资源: 379
下载权益
C知道特权
VIP文章
课程特权
开通VIP
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功