<p align="center">
<br>
<img src="https://raw.githubusercontent.com/huggingface/datasets/master/docs/source/imgs/datasets_logo_name.jpg" width="400"/>
<br>
<p>
<p align="center">
<a href="https://circleci.com/gh/huggingface/datasets">
<img alt="Build" src="https://img.shields.io/circleci/build/github/huggingface/datasets/master">
</a>
<a href="https://github.com/huggingface/datasets/blob/master/LICENSE">
<img alt="GitHub" src="https://img.shields.io/github/license/huggingface/datasets.svg?color=blue">
</a>
<a href="https://huggingface.co/docs/datasets/index.html">
<img alt="Documentation" src="https://img.shields.io/website/http/huggingface.co/docs/datasets/index.html.svg?down_color=red&down_message=offline&up_message=online">
</a>
<a href="https://github.com/huggingface/datasets/releases">
<img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/datasets.svg">
</a>
<a href="https://huggingface.co/datasets/">
<img alt="Number of datasets" src="https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen">
</a>
<a href="CODE_OF_CONDUCT.md">
<img alt="Contributor Covenant" src="https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg">
</a>
<a href="https://zenodo.org/badge/latestdoi/250213286"><img src="https://zenodo.org/badge/250213286.svg" alt="DOI"></a>
</p>
ð¤ Datasets is a lightweight library providing **two** main features:
- **one-line dataloaders for many public datasets**: one liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets (in 467 languages and dialects!) provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX),
- **efficient data pre-processing**: simple, fast and reproducible data pre-processing for the above public datasets as well as your own local datasets in CSV/JSON/text. With simple commands like `tokenized_dataset = dataset.map(tokenize_example)`, efficiently prepare the dataset for inspection and ML model evaluation and training.
[ð **Documentation**](https://huggingface.co/docs/datasets/) [ð¹ **Colab tutorial**](https://colab.research.google.com/github/huggingface/datasets/blob/master/notebooks/Overview.ipynb)
[ð **Find a dataset in the Hub**](https://huggingface.co/datasets) [ð **Add a new dataset to the Hub**](https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md)
<h3 align="center">
<a href="https://hf.co/course"><img src="https://raw.githubusercontent.com/huggingface/datasets/master/docs/source/imgs/course_banner.png"></a>
</h3>
ð¤ Datasets also provides access to +15 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics.
ð¤ Datasets has many additional interesting features:
- Thrive on large datasets: ð¤ Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped using an efficient zero-serialization cost backend (Apache Arrow).
- Smart caching: never wait for your data to process several times.
- Lightweight and fast with a transparent and pythonic API (multi-processing/caching/memory-mapping).
- Built-in interoperability with NumPy, pandas, PyTorch, Tensorflow 2 and JAX.
ð¤ Datasets originated from a fork of the awesome [TensorFlow Datasets](https://github.com/tensorflow/datasets) and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between ð¤ Datasets and `tfds` can be found in the section [Main differences between ð¤ Datasets and `tfds`](#main-differences-between-ð¤-datasets-and-tfds).
# Installation
## With pip
ð¤ Datasets can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance)
```bash
pip install datasets
```
## With conda
ð¤ Datasets can be installed using conda as follows:
```bash
conda install -c huggingface -c conda-forge datasets
```
Follow the installation pages of TensorFlow and PyTorch to see how to install them with conda.
For more details on installation, check the installation page in the documentation: https://huggingface.co/docs/datasets/installation.html
## Installation to use with PyTorch/TensorFlow/pandas
If you plan to use ð¤ Datasets with PyTorch (1.0+), TensorFlow (2.2+) or pandas, you should also install PyTorch, TensorFlow or pandas.
For more details on using the library with NumPy, pandas, PyTorch or TensorFlow, check the quick tour page in the documentation: https://huggingface.co/docs/datasets/quicktour.html
# Usage
ð¤ Datasets is made to be very simple to use. The main methods are:
- `datasets.list_datasets()` to list the available datasets
- `datasets.load_dataset(dataset_name, **kwargs)` to instantiate a dataset
- `datasets.list_metrics()` to list the available metrics
- `datasets.load_metric(metric_name, **kwargs)` to instantiate a metric
Here is a quick example:
```python
from datasets import list_datasets, load_dataset, list_metrics, load_metric
# Print all the available datasets
print(list_datasets())
# Load a dataset and print the first example in the training set
squad_dataset = load_dataset('squad')
print(squad_dataset['train'][0])
# List all the available metrics
print(list_metrics())
# Load a metric
squad_metric = load_metric('squad')
# Process the dataset - add a column with the length of the context texts
dataset_with_length = squad_dataset.map(lambda x: {"length": len(x["context"])})
# Process the dataset - tokenize the context texts (using a tokenizer from the ð¤ Transformers library)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
tokenized_dataset = squad_dataset.map(lambda x: tokenizer(x['context']), batched=True)
```
For more details on using the library, check the quick tour page in the documentation: https://huggingface.co/docs/datasets/quicktour.html and the specific pages on:
- Loading a dataset https://huggingface.co/docs/datasets/loading_datasets.html
- What's in a Dataset: https://huggingface.co/docs/datasets/exploring.html
- Processing data with ð¤ Datasets: https://huggingface.co/docs/datasets/processing.html
- Writing your own dataset loading script: https://huggingface.co/docs/datasets/add_dataset.html
- etc.
Another introduction to ð¤ Datasets is the tutorial on Google Colab here:
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/datasets/blob/master/notebooks/Overview.ipynb)
# Add a new dataset to the Hub
We have a very detailed step-by-step guide to add a new dataset to the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) datasets already provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets).
You will find [the step-by-step guide here](https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md) to add a dataset to this repository.
You can also have your own repository for your dataset on the Hub under your or your organization's namespace and share it with the community. More information in [the documentation section about dataset sharing](https://huggingface.co/docs/datasets/share_dataset.html).
# Main differences between ð¤ Datasets and `tfds`
If you are familiar with the great TensorFlow Datasets, here are the main differences betwe
没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
收起资源包目录
PyPI 官网下载 | datasets-1.13.3.tar.gz (105个子文件)
setup.cfg 857B
languages.json 30KB
licenses.json 25KB
tasks.json 3KB
creators.json 274B
multilingualities.json 211B
size_categories.json 185B
LICENSE 11KB
README.md 11KB
not-zip-safe 2B
PKG-INFO 4KB
PKG-INFO 4KB
arrow_dataset.py 182KB
load.py 81KB
builder.py 62KB
dataset_dict.py 45KB
features.py 43KB
table.py 37KB
search.py 30KB
arrow_writer.py 26KB
arrow_reader.py 25KB
file_utils.py 25KB
metric.py 24KB
dummy_data.py 23KB
splits.py 22KB
formatting.py 21KB
py_utils.py 21KB
iterable_dataset.py 21KB
fingerprint.py 19KB
metadata.py 18KB
info.py 17KB
inspect.py 15KB
filelock.py 14KB
streaming_download_manager.py 13KB
data_files.py 13KB
download_manager.py 12KB
readme.py 12KB
beam_utils.py 11KB
combine.py 10KB
mock_download_manager.py 9KB
convert.py 8KB
setup.py 8KB
test.py 8KB
json.py 8KB
csv.py 7KB
config.py 7KB
run_beam.py 6KB
extract.py 6KB
compression.py 6KB
logging.py 5KB
__init__.py 5KB
json.py 5KB
version.py 4KB
parquet.py 4KB
translation.py 4KB
tqdm_utils.py 4KB
info_utils.py 4KB
tf_formatter.py 4KB
csv.py 4KB
hf_api.py 3KB
keyhash.py 3KB
parquet.py 3KB
jax_formatter.py 3KB
audio.py 3KB
naming.py 3KB
streaming.py 3KB
text.py 3KB
__init__.py 3KB
torch_formatter.py 3KB
s3filesystem.py 2KB
patching.py 2KB
text.py 2KB
__init__.py 1KB
__init__.py 1KB
deprecation_utils.py 1KB
__init__.py 1KB
text_classification.py 1KB
__init__.py 1KB
pandas.py 1KB
image_classification.py 1KB
datasets_cli.py 1KB
env.py 1KB
question_answering.py 1KB
base.py 931B
abc.py 909B
automatic_speech_recognition.py 876B
summarization.py 712B
doc_utils.py 422B
__init__.py 325B
__init__.py 292B
typing.py 184B
__init__.py 0B
__init__.py 0B
__init__.py 0B
__init__.py 0B
__init__.py 0B
__init__.py 0B
__init__.py 0B
SOURCES.txt 3KB
requires.txt 2KB
共 105 条
- 1
- 2
资源评论
挣扎的蓝藻
- 粉丝: 13w+
- 资源: 15万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- Python 手写实现 iD3 决策树算法-根据信息增益公式.zip
- 411675952289057车联助手-小窗版(三星)3.5.1.apk
- 三种快速排序方法合并在一个文件中以便直接运行的Python代码示例
- 937712277954201实习5.word
- 2程序语言基础知识pdf1_1716337722703.jpeg
- 简单的Python示例,演示了如何使用TCP/IP协议进行基本的客户端和服务器通信
- 考试.sql
- keil2 + proteus + 8051.exe
- 1961ee27df03bd4595d28e24b00dde4e_744c805f7e4fb4d40fa3f695bfbab035_8(1).c
- mediapipe-0.9.0.1-cp37-cp37m-win-amd64.whl.zip
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功