[![PyPI version](https://badge.fury.io/py/cloud-files.svg)](https://badge.fury.io/py/cloud-files) [![Test Suite](https://github.com/seung-lab/cloud-files/workflows/Test%20Suite/badge.svg)](https://github.com/seung-lab/cloud-files/actions?query=workflow%3A%22Test+Suite%22)
CloudFiles: Fast access to cloud storage and local FS.
========
```python
from cloudfiles import CloudFiles, dl
results = dl(["gs://bucket/file1", "gs://bucket2/file2", ... ]) # shorthand
cf = CloudFiles('gs://bucket', progress=True) # s3://, https://, and file:// also supported
results = cf.get(['file1', 'file2', 'file3', ..., 'fileN']) # threaded
results = cf.get(paths, parallel=2) # threaded and two processes
file1 = cf['file1']
part = cf['file1', 0:30] # first 30 bytes of file1
cf.put('filename', content)
cf.put_json('filename', content)
cf.puts([{
'path': 'filename',
'content': content,
}, ... ]) # automatically threaded
cf.puts(content, parallel=2) # threaded + two processes
cf.puts(content, storage_class="NEARLINE") # apply vendor-specific storage class
cf.put_jsons(...) # same as puts
cf['filename'] = content
for fname in cf.list(prefix='abc123'):
print(fname)
list(cf) # same as list(cf.list())
cf.delete('filename')
del cf['filename']
cf.delete([ 'filename_1', 'filename_2', ... ]) # threaded
cf.delete(paths, parallel=2) # threaded + two processes
boolean = cf.exists('filename')
results = cf.exists([ 'filename_1', ... ]) # threaded
```
CloudFiles was developed to access files from object storage without ever touching disk. The goal was to reliably and rapidly access a petabyte of image data broken down into tens to hundreds of millions of files being accessed in parallel across thousands of cores. The predecessor of CloudFiles, `CloudVolume.Storage`, the core of which is retained here, has been used to processes dozens of images, many of which were in the hundreds of terabyte range. Storage has reliably read and written tens of billions of files to date.
## Highlights
1. Fast file access with transparent threading and optionally multi-process.
2. Google Cloud Storage, Amazon S3, local filesystems, and arbitrary web servers making hybrid or multi-cloud easy.
3. Robust to flaky network connections. Uses exponential random window retries to avoid network collisions on a large cluster. Validates md5 for gcs and s3.
4. gzip, brotli, and zstd compression.
5. Supports HTTP Range reads.
6. Supports green threads, which are important for achieving maximum performance on virtualized servers.
7. High efficiency transfers that avoid compression/decompression cycles.
8. High speed gzip decompression using libdeflate (compared with zlib).
9. Bundled CLI tool.
10. Accepts iterator and generator input.
## Installation
```bash
pip install cloud-files
pip install cloud-files[test] # to enable testing with pytest
```
If you run into trouble installing dependenies, make sure you're using at least Python3.6 and you have updated pip. On Linux, some dependencies require manylinux2010 or manylinux2014 binaries which earlier versions of pip do not search for. MacOS, Linux, and Windows are supported platforms.
### Credentials
You may wish to install credentials under `~/.cloudvolume/secrets`. CloudFiles is descended from CloudVolume, and for now we'll leave the same configuration structure in place.
You need credentials only for the services you'll use. The local filesystem doesn't need any. Google Storage ([setup instructions here](https://github.com/seung-lab/cloud-volume/wiki/Setting-up-Google-Cloud-Storage)) will attempt to use default account credentials if no service account is provided.
If neither of those two conditions apply, you need a service account credential. `google-secret.json` is a service account credential for Google Storage, `aws-secret.json` is a service account for S3, etc. You can support multiple projects at once by prefixing the bucket you are planning to access to the credential filename. `google-secret.json` will be your defaut service account, but if you also want to also access bucket ABC, you can provide `ABC-google-secret.json` and you'll have simultaneous access to your ordinary buckets and ABC. The secondary credentials are accessed on the basis of the bucket name, not the project name.
```bash
mkdir -p ~/.cloudvolume/secrets/
mv aws-secret.json ~/.cloudvolume/secrets/ # needed for Amazon
mv google-secret.json ~/.cloudvolume/secrets/ # needed for Google
mv matrix-secret.json ~/.cloudvolume/secrets/ # needed for Matrix
```
#### `aws-secret.json` and `matrix-secret.json`
Create an [IAM user service account](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users.html) that can read, write, and delete objects from at least one bucket.
```json
{
"AWS_ACCESS_KEY_ID": "$MY_AWS_ACCESS_KEY_ID",
"AWS_SECRET_ACCESS_KEY": "$MY_SECRET_ACCESS_TOKEN",
"AWS_DEFAULT_REGION": "$MY_AWS_REGION" // defaults to us-east-1
}
```
#### `google-secret.json`
You can create the `google-secret.json` file [here](https://console.cloud.google.com/iam-admin/serviceaccounts). You don't need to manually fill in JSON by hand, the below example is provided to show you what the end result should look like. You should be able to read, write, and delete objects from at least one bucket.
```json
{
"type": "service_account",
"project_id": "$YOUR_GOOGLE_PROJECT_ID",
"private_key_id": "...",
"private_key": "...",
"client_email": "...",
"client_id": "...",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://accounts.google.com/o/oauth2/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": ""
}
```
## API Documentation
Note that the "Cloud Costs" mentioned below are current as of June 2020 and are subject to change. As of this writing, S3 and Google use identical cost structures for these operations.
### Constructor
```python
# import gevent.monkey # uncomment when using green threads
# gevent.monkey.patch_all(thread=False)
from cloudfiles import CloudFiles
cf = CloudFiles(
cloudpath, progress=False,
green=False, secrets=None, num_threads=20,
use_https=False, endpoint=None, request_payer=None
)
# cloudpath examples:
cf = CloudFiles('gs://bucket/') # google cloud storage
cf = CloudFiles('s3://bucket/') # Amazon S3
cf = CloudFiles('s3://https://s3emulator.com/coolguy/') # alternate s3 endpoint
cf = CloudFiles('file:///home/coolguy/') # local filesystem
cf = CloudFiles('mem:///home/coolguy/') # in memory
cf = CloudFiles('https://website.com/coolguy/') # arbitrary web server
```
* cloudpath: The path to the bucket you are accessing. The path is formatted as `$PROTOCOL://BUCKET/PATH`. Files will then be accessed relative to the path. The protocols supported are `gs` (GCS), `s3` (AWS S3), `file` (local FS), `mem` (RAM), and `http`/`https`.
* progress: Whether to display a progress bar when processing multiple items simultaneously.
* green: Use green threads. For this to work properly, you must uncomment the top two lines.
* secrets: Provide secrets dynamically rather than fetching from the credentials directory `$HOME/.cloudvolume/secrets`.
* num_threads: Number of simultaneous requests to make. Usually 20 per core is pretty close to optimal unless file sizes are extreme.
* use_https: `gs://` and `s3://` require credentials to access their files. However, each has a read-only https endpoint that sometimes requires no credentials. If True, automatically convert `gs://` to `https://storage.googleapis.com/` and `s3://` to `https://s3.amazonaws.com/`.
* endpoint: (s3 only) provide an alternate endpoint than the official Amazon servers. This is useful for accessing the various S3 emulators offered by on-premises deployments of object storage.
* request_payer: Specify the account that should be charged for requests towards the bucket, rather than the bucket owner.
* `gs://`: `request_payer` can be any Google Cloud project id. Please refer to the do