PyPI官网下载|datasets-1.13.3.tar.gz_pythondatasets库下载资源-CSDN文库

版权申诉

127 浏览量 2022-01-27 03:28:01 上传评论收藏 246KB GZ 举报

共105个文件

py：86个

json：6个

txt：5个

《PyPI官网下载 | datasets-1.13.3.tar.gz 深度解析》在信息技术领域，Python编程语言以其简洁的语法和强大的库支持而广受青睐。PyPI（Python Package Index）是Python软件的官方仓库，提供了一个丰富的生态系统，其中包含了大量的第三方库，方便开发者下载和分享代码。在本文中，我们将详细探讨PyPI上下载的"datasets-1.13.3.tar.gz"资源，以及与之相关的分布式系统、云原生和Python库的知识点。 "datasets-1.13.3.tar.gz"是一个压缩包，它包含了Python库"datasets"的版本1.13.3。这个库主要用于数据科学和机器学习项目，提供了大量的预处理和结构化的数据集。在Python开发中，这样的库能极大地提高数据处理的效率和便捷性。解压这个tar.gz文件后，开发者可以找到源代码和其他相关文件，以便于安装和使用"datasets"库。接下来，我们转向"zookeeper"这一标签。ZooKeeper是一个分布式协调服务，广泛应用于大型分布式系统中，为数据一致性、集群管理等任务提供解决方案。在使用Python库时，如果涉及到分布式计算或跨节点的数据同步，可能需要利用ZooKeeper来实现可靠的通信和数据存储。"datasets"库可能不直接集成ZooKeeper，但开发者可以结合ZooKeeper来管理和分发大规模数据集，特别是在分布式环境中的数据科学项目。再来看"分布式"这一概念。在当今的云计算时代，分布式系统已经成为处理大数据和高并发问题的重要手段。"datasets"库虽然主要关注数据处理，但在处理大型数据集时，可以与其他分布式框架（如Spark或Hadoop）结合，实现分布式数据加载和计算，提升数据处理的性能。通过Python的接口，可以轻松地在分布式环境中操作和管理数据。 "云原生"（cloud native）是一种构建和运行应用程序的方法，强调利用容器、微服务、持续交付和声明式API等技术，以实现快速迭代和弹性扩展。在云原生环境中，"datasets"库可以作为数据分析的基础组件，支持云上的数据科学家快速获取、处理和分析数据。同时，云服务通常提供强大的计算资源，使得处理大规模数据集成为可能。 "Python库"是Python编程中的重要组成部分。Python生态系统中存在无数的库，涵盖了从Web开发到科学计算的各种领域。"datasets"库的出现，证明了Python在数据科学领域的强大能力。它简化了数据获取、转换和清洗的流程，使得开发者能够专注于模型构建和分析，从而提升工作效率。总结来说，"PyPI官网下载 | datasets-1.13.3.tar.gz"资源代表了Python在数据科学领域的实用工具，与分布式系统、云原生理念相结合，可以在复杂的技术环境中发挥重要作用。了解并掌握这些相关知识，对于现代数据驱动的项目开发至关重要。

资源推荐

资源详情

资源评论

收起资源包目录

PyPI 官网下载 | datasets-1.13.3.tar.gz （105个子文件）

setup.cfg 857B

languages.json 30KB

licenses.json 25KB

tasks.json 3KB

creators.json 274B

multilingualities.json 211B

size_categories.json 185B

LICENSE 11KB

README.md 11KB

not-zip-safe 2B

PKG-INFO 4KB

arrow_dataset.py 182KB

load.py 81KB

builder.py 62KB

dataset_dict.py 45KB

features.py 43KB

table.py 37KB

search.py 30KB

arrow_writer.py 26KB

arrow_reader.py 25KB

file_utils.py 25KB

metric.py 24KB

dummy_data.py 23KB

splits.py 22KB

formatting.py 21KB

py_utils.py 21KB

iterable_dataset.py 21KB

fingerprint.py 19KB

metadata.py 18KB

info.py 17KB

inspect.py 15KB

filelock.py 14KB

streaming_download_manager.py 13KB

data_files.py 13KB

download_manager.py 12KB

readme.py 12KB

beam_utils.py 11KB

combine.py 10KB

mock_download_manager.py 9KB

convert.py 8KB

setup.py 8KB

test.py 8KB

json.py 8KB

csv.py 7KB

config.py 7KB

run_beam.py 6KB

extract.py 6KB

compression.py 6KB

logging.py 5KB

__init__.py 5KB

json.py 5KB

version.py 4KB

parquet.py 4KB

translation.py 4KB

tqdm_utils.py 4KB

info_utils.py 4KB

tf_formatter.py 4KB

csv.py 4KB

hf_api.py 3KB

keyhash.py 3KB

parquet.py 3KB

jax_formatter.py 3KB

audio.py 3KB

naming.py 3KB

streaming.py 3KB

text.py 3KB

__init__.py 3KB

torch_formatter.py 3KB

s3filesystem.py 2KB

patching.py 2KB

text.py 2KB

__init__.py 1KB

deprecation_utils.py 1KB

__init__.py 1KB

text_classification.py 1KB

__init__.py 1KB

pandas.py 1KB

image_classification.py 1KB

datasets_cli.py 1KB

env.py 1KB

question_answering.py 1KB

base.py 931B

abc.py 909B

automatic_speech_recognition.py 876B

summarization.py 712B

doc_utils.py 422B

__init__.py 325B

__init__.py 292B

typing.py 184B

__init__.py 0B

SOURCES.txt 3KB

requires.txt 2KB

共 105 条

<p align="center"> <br> <img src="https://raw.githubusercontent.com/huggingface/datasets/master/docs/source/imgs/datasets_logo_name.jpg" width="400"/> <br> <p> <p align="center"> <a href="https://circleci.com/gh/huggingface/datasets"> <img alt="Build" src="https://img.shields.io/circleci/build/github/huggingface/datasets/master"> </a> <a href="https://github.com/huggingface/datasets/blob/master/LICENSE"> <img alt="GitHub" src="https://img.shields.io/github/license/huggingface/datasets.svg?color=blue"> </a> <a href="https://huggingface.co/docs/datasets/index.html"> <img alt="Documentation" src="https://img.shields.io/website/http/huggingface.co/docs/datasets/index.html.svg?down_color=red&down_message=offline&up_message=online"> </a> <a href="https://github.com/huggingface/datasets/releases"> <img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/datasets.svg"> </a> <a href="https://huggingface.co/datasets/"> <img alt="Number of datasets" src="https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen"> </a> <a href="CODE_OF_CONDUCT.md"> <img alt="Contributor Covenant" src="https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg"> </a> <a href="https://zenodo.org/badge/latestdoi/250213286"><img src="https://zenodo.org/badge/250213286.svg" alt="DOI"></a> </p> ð¤ Datasets is a lightweight library providing **two** main features: - **one-line dataloaders for many public datasets**: one liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets (in 467 languages and dialects!) provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX), - **efficient data pre-processing**: simple, fast and reproducible data pre-processing for the above public datasets as well as your own local datasets in CSV/JSON/text. With simple commands like `tokenized_dataset = dataset.map(tokenize_example)`, efficiently prepare the dataset for inspection and ML model evaluation and training. [ð **Documentation**](https://huggingface.co/docs/datasets/) [ð¹ **Colab tutorial**](https://colab.research.google.com/github/huggingface/datasets/blob/master/notebooks/Overview.ipynb) [ð **Find a dataset in the Hub**](https://huggingface.co/datasets) [ð **Add a new dataset to the Hub**](https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md) <h3 align="center"> <a href="https://hf.co/course"><img src="https://raw.githubusercontent.com/huggingface/datasets/master/docs/source/imgs/course_banner.png"></a> </h3> ð¤ Datasets also provides access to +15 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. ð¤ Datasets has many additional interesting features: - Thrive on large datasets: ð¤ Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped using an efficient zero-serialization cost backend (Apache Arrow). - Smart caching: never wait for your data to process several times. - Lightweight and fast with a transparent and pythonic API (multi-processing/caching/memory-mapping). - Built-in interoperability with NumPy, pandas, PyTorch, Tensorflow 2 and JAX. ð¤ Datasets originated from a fork of the awesome [TensorFlow Datasets](https://github.com/tensorflow/datasets) and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between ð¤ Datasets and `tfds` can be found in the section [Main differences between ð¤ Datasets and `tfds`](#main-differences-between-ð¤-datasets-and-tfds). # Installation ## With pip ð¤ Datasets can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance) ```bash pip install datasets ``` ## With conda ð¤ Datasets can be installed using conda as follows: ```bash conda install -c huggingface -c conda-forge datasets ``` Follow the installation pages of TensorFlow and PyTorch to see how to install them with conda. For more details on installation, check the installation page in the documentation: https://huggingface.co/docs/datasets/installation.html ## Installation to use with PyTorch/TensorFlow/pandas If you plan to use ð¤ Datasets with PyTorch (1.0+), TensorFlow (2.2+) or pandas, you should also install PyTorch, TensorFlow or pandas. For more details on using the library with NumPy, pandas, PyTorch or TensorFlow, check the quick tour page in the documentation: https://huggingface.co/docs/datasets/quicktour.html # Usage ð¤ Datasets is made to be very simple to use. The main methods are: - `datasets.list_datasets()` to list the available datasets - `datasets.load_dataset(dataset_name, **kwargs)` to instantiate a dataset - `datasets.list_metrics()` to list the available metrics - `datasets.load_metric(metric_name, **kwargs)` to instantiate a metric Here is a quick example: ```python from datasets import list_datasets, load_dataset, list_metrics, load_metric # Print all the available datasets print(list_datasets()) # Load a dataset and print the first example in the training set squad_dataset = load_dataset('squad') print(squad_dataset['train'][0]) # List all the available metrics print(list_metrics()) # Load a metric squad_metric = load_metric('squad') # Process the dataset - add a column with the length of the context texts dataset_with_length = squad_dataset.map(lambda x: {"length": len(x["context"])}) # Process the dataset - tokenize the context texts (using a tokenizer from the ð¤ Transformers library) from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('bert-base-cased') tokenized_dataset = squad_dataset.map(lambda x: tokenizer(x['context']), batched=True) ``` For more details on using the library, check the quick tour page in the documentation: https://huggingface.co/docs/datasets/quicktour.html and the specific pages on: - Loading a dataset https://huggingface.co/docs/datasets/loading_datasets.html - What's in a Dataset: https://huggingface.co/docs/datasets/exploring.html - Processing data with ð¤ Datasets: https://huggingface.co/docs/datasets/processing.html - Writing your own dataset loading script: https://huggingface.co/docs/datasets/add_dataset.html - etc. Another introduction to ð¤ Datasets is the tutorial on Google Colab here: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/datasets/blob/master/notebooks/Overview.ipynb) # Add a new dataset to the Hub We have a very detailed step-by-step guide to add a new dataset to the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) datasets already provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). You will find [the step-by-step guide here](https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md) to add a dataset to this repository. You can also have your own repository for your dataset on the Hub under your or your organization's namespace and share it with the community. More information in [the documentation section about dataset sharing](https://huggingface.co/docs/datasets/share_dataset.html). # Main differences between ð¤ Datasets and `tfds` If you are familiar with the great TensorFlow Datasets, here are the main differences betwe

评论收藏

内容反馈

版权申诉