专为大语言模型各种检索增强任务设计的向量模型

共396个文件

py：230个

txt：50个

json：39个

版权申诉

语言模型

103 浏览量 2024-05-06 15:01:41 上传评论收藏 13.7MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

专为大语言模型各种检索增强任务设计的向量模型（396个子文件）

.DS_Store 6KB

.gitignore 2KB

.gitignore 24B

bpe_simple_vocab_16e6.txt.gz 1.29MB

mkqa.jpg 594KB

long.jpg 474KB

miracl.jpg 437KB

wiki_candi_2.jpg 176KB

nqa.jpg 155KB

wiki_candi_1.jpg 102KB

bm25.jpg 67KB

embedder_examples.json 4.56MB

infbench.json 382KB

convsearch.json 373KB

qa.json 311KB

lrlm.json 261KB

tool.json 242KB

narrativeqa.json 137KB

narrativeqa.json 116KB

icl.json 79KB

chat.json 14KB

llm_examples.json 13KB

stage3-offload.json 2KB

stage3-offload-optim.json 2KB

stage3-offload-all.json 2KB

stage3.json 1KB

stage3-offload-optim.json 1KB

stage3.json 1KB

stage2-offload.json 1KB

stage2_offload.json 1KB

stage2-offload.json 1KB

stage1.json 1KB

stage2.json 1KB

stage2_small.json 1KB

stage2.json 1KB

stage3.json 1KB

ds_config.json 960B

stage0.json 960B

stage1.json 960B

EVA02-CLIP-B-16.json 681B

EVA02-CLIP-L-14-336.json 654B

EVA02-CLIP-L-14.json 650B

EVA02-CLIP-bigE-14-plus.json 564B

EVA02-CLIP-bigE-14.json 563B

EVA01-CLIP-g-14.json 525B

EVA01-CLIP-g-14-plus.json 524B

EVA01-CLIP-B-16.json 398B

toy_finetune_data.jsonl 6KB

toy_finetune_data.jsonl 4KB

toy_train_data1.jsonl 4KB

toy_train_data2.jsonl 4KB

toy_finetune_data.jsonl 4KB

toy_pretrain_data.jsonl 4KB

LICENSE 1KB

README.md 25KB

README.md 23KB

README_zh.md 19KB

README.md 17KB

README.md 15KB

README.md 13KB

README.md 12KB

README.md 11KB

README.md 10KB

fine-tune.md 9KB

evaluation.md 8KB

README.md 7KB

evaluation.md 7KB

README.md 6KB

README.md 5KB

README.md 4KB

evaluation.md 4KB

README.md 4KB

README.md 3KB

training.md 3KB

readme.md 3KB

training.md 2KB

README.md 1KB

BGE_M3.pdf 643KB

needle.png 1.82MB

passkey.png 1.72MB

1.png 946KB

topic.png 945KB

cir_candi_2.png 880KB

pic.png 472KB

activation-beacon.png 431KB

impress.png 371KB

SFT-CIRR.png 150KB

cir_query.png 149KB

zs-performance.png 123KB

共 396 条

<h1 align="center">FlagEmbedding</h1> <p align="center"> <a href="https://github.com/FlagOpen/FlagEmbedding"> <img alt="Build" src="https://img.shields.io/badge/Contribution-Welcome-blue"> </a> <a href="https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE"> <img alt="License" src="https://img.shields.io/badge/LICENSE-MIT-green"> </a> <a href="https://huggingface.co/C-MTEB"> <img alt="Build" src="https://img.shields.io/badge/C_MTEB-ð¤-yellow"> </a> <a href="https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding"> <img alt="Build" src="https://img.shields.io/badge/FlagEmbedding-1.1-red"> </a> </p> <h4 align="center"> <p> <a href=#news>News</a> | <a href="#projects">Projects</a> | <a href=#model-list>Model List</a> | <a href="#contributor">Contributor</a> | <a href="#citation">Citation</a> | <a href="#license">License</a> <p> </h4> [English](README.md) | [ä¸æ](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md) FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following projects currently: - **Long-Context LLM**: [Activation Beacon](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/activation_beacon), [LongLLM QLoRA](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/longllm_qlora) - **Fine-tuning of LM** : [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail) - **Embedding Model**: [Visualized-BGE](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/visual), [BGE-M3](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3), [LLM Embedder](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder), [BGE Embedding](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding) - **Reranker Model**: [llm rerankers](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker), [BGE Reranker](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker) - **Benchmark**: [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) ## News - 4/30/2024: Release [Llama-3-8B-Instruct-80K-QLoRA](https://huggingface.co/namespace-Pt/Llama-3-8B-Instruct-80K-QLoRA), extending the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA training on a few synthesized long-context data. The model achieves remarkable performance on various long-context benchmarks. [Code](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/longllm_qlora) :fire: - 3/18/2024: Release new [rerankers](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker), built upon powerful M3 and LLM (GEMMA and MiniCPM, not so large actually :smiley:) backbones, supporitng multi-lingual processing and larger inputs, massive improvements of ranking performances on BEIR, C-MTEB/Retrieval, MIRACL, LlamaIndex Evaluation :fire: - 3/18/2024: Release [Visualized-BGE](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/visual), equipping BGE with visual capabilities. Visualized-BGE can be utilized to generate embeddings for hybrid image-text data. :fire: - 1/30/2024: Release **BGE-M3**, a new member to BGE model series! M3 stands for **M**ulti-linguality (100+ languages), **M**ulti-granularities (input length up to 8192), **M**ulti-Functionality (unification of dense, lexical, multi-vec/colbert retrieval). It is the first embedding model which supports all three retrieval methods, achieving new SOTA on multi-lingual (MIRACL) and cross-lingual (MKQA) benchmarks. [Technical Report](https://arxiv.org/pdf/2402.03216.pdf) and [Code](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3). :fire: - 1/9/2024: Release [Activation-Beacon](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/activation_beacon), an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM. [Technical Report](https://arxiv.org/abs/2401.03462) - 12/24/2023: Release **LLaRA**, a LLaMA-7B based dense retriever, leading to state-of-the-art performances on MS MARCO and BEIR. Model and code will be open-sourced. Please stay tuned. [Technical Report](https://arxiv.org/abs/2312.15503) - 11/23/2023: Release [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail), a method to maintain general capabilities during fine-tuning by merging multiple language models. [Technical Report](https://arxiv.org/abs/2311.13534) - 10/12/2023: Release [LLM-Embedder](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder), a unified embedding model to support diverse retrieval augmentation needs for LLMs. [Technical Report](https://arxiv.org/pdf/2310.07554.pdf) - 09/15/2023: The [technical report](https://arxiv.org/pdf/2309.07597.pdf) of BGE has been released - 09/15/2023: The [massive training data](https://data.baai.ac.cn/details/BAAI-MTP) of BGE has been released - 09/12/2023: New models: - **New reranker model**: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models. - **update embedding model**: release `bge-*-v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction. <details> <summary>More</summary>  - 09/07/2023: Update [fine-tune code](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md): Add script to mine hard negatives and support adding instruction during fine-tuning. - 08/09/2023: BGE Models are integrated into **Langchain**, you can use it like [this](#using-langchain); C-MTEB **leaderboard** is [available](https://huggingface.co/spaces/mteb/leaderboard). - 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size ð¤** - 08/02/2023: Release `bge-large-*`(short for BAAI General Embedding) Models, **rank 1st on MTEB and C-MTEB benchmark!** :tada: :tada: - 08/01/2023: We release the [Chinese Massive Text Embedding Benchmark](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB) (**C-MTEB**), consisting of 31 test dataset. </details> ## Projects ### BGE-M3([Paper](https://arxiv.org/pdf/2402.03216.pdf), [Code](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3)) In this project, we introduce BGE-M3, the first embedding model which supports multiple retrieval modesãmultilingual and multi-granularity retrieval. - Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval. - Multi-Linguality: It can support more than 100 working languages. - Multi-Granularity: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens. We propose a novel self-knowledge distillation approach to improve the performance of single retrieval mode. We optimize the batching strategy, enabling a large batch size, which can used simply when fine-tuning with long text or large language model. We also construct a dataset for document retrieval and propose a simple strategy to improve the ability to model long text. **The training code and fine-tuning data will be open-sourced in the near future.** ### [Visualized-BGE](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/visual) In this project, we introduce Visualized-BGE, which integrating image token embedding into the BGE Text Embedding framework. Visualized-BGE can be used for various hybrid modal retrieval tasks, such as Multi-Modal K

评论收藏

内容反馈

版权申诉