## Pandasticsearch
[![Build Status](https://travis-ci.org/onesuper/pandasticsearch.svg?branch=master)](https://travis-ci.org/onesuper/pandasticsearch) [![PyPI](https://img.shields.io/pypi/v/pandasticsearch.svg)](https://pypi.python.org/pypi/pandasticsearch)
Pandasticsearch is an Elasticsearch client for data-analysis purpose.
It provides table-like access to Elasticsearch documents, similar
to the Python Pandas library and R DataFrames.
To install:
```
pip install pandasticsearch
# if you intent to export Pandas DataFrame
pip install pandasticsearch[pandas]
```
Elasticsearch is skilled in real-time indexing, search and data-analysis.
Pandasticsearch can convert the analysis results (e.g. multi-level nested aggregation)
into [Pandas](http://pandas.pydata.org) DataFrame objects for subsequent data analysis.
Checkout the API doc: [http://pandasticsearch.readthedocs.io/en/latest/](http://pandasticsearch.readthedocs.io/en/latest/).
## Usage
### DataFrame API
A `DataFrame` object accesses Elasticsearch data with high level operations.
It is type-safe, easy-to-use and Pandas-flavored.
```python
# Create a DataFrame object
from pandasticsearch import DataFrame
df = DataFrame.from_es(url='http://localhost:9200', index='people')
# Print the schema(mapping) of the index
df.print_schema()
# company
# |-- employee
# |-- name: {'index': 'not_analyzed', 'type': 'string'}
# |-- age: {'type': 'integer'}
# |-- gender: {'index': 'not_analyzed', 'type': 'string'}
# Inspect the columns
df.columns
#['name', 'age', 'gender']
# Denote a column
df.name
# Column('name')
df['age']
# Column('age')
# Projection
df.filter(df.age < 25).select('name', 'age').collect()
# [Row(age=12,name='Alice'), Row(age=11,name='Bob'), Row(age=13,name='Leo')]
# Print the rows into console
df.filter(df.age < 25).select('name').show(3)
# +------+
# | name |
# +------+
# | Alice|
# | Bob |
# | Leo |
# +------+
# Convert to Pandas object for subsequent analysis
df[df.gender == 'male'].agg(df.age.avg).to_pandas()
# avg(age)
# 0 12
# Translate the DataFrame to an ES query (dictionary)
df[df.gender == 'male'].agg(df.age.avg).to_dict()
# {'query': {'filtered': {'filter': {'term': {'gender': 'male'}}}}, 'aggregations': {'avg(birthYear)':
# {'avg': {'field': 'birthYear'}}}, 'size': 0}
```
### Filter
```python
# Filter by a boolean condition
df.filter(df.age < 13).collect()
# [Row(age=12,gender='female',name='Alice'), Row(age=11,gender='male',name='Bob')]
# Filter by a set of boolean conditions
df.filter(df.age < 13 & df.gender == 'male').collect()
# Row(age=11,gender='male',name='Bob')]
# Filter by a wildcard (sql `like`)
df.filter(df.name.like('A*')).collect()
# [Row(age=12,gender='female',name='Alice')]
# Filter by a regular expression (sql `rlike`)
df.filter(df.name.rlike('A.l.e')).collect()
# [Row(age=12,gender='female',name='Alice')]
# Filter by a prefixed string pattern
df.filter(df.name.startswith('Al')).collect()
# [Row(age=12,gender='female',name='Alice')]
# Filter by a script
from pandasticsearch.operators import ScriptFilter
df.filter(ScriptFilter('2016 - doc["age"].value > 1995')).collect()
# [Row(age=12,name='Alice'), Row(age=13,name='Leo')]
```
**5.0 compatibility**: By default, pandasticsearch use `filtered` query (deprecated since 5.0).
To use pandasticsearch against the latest ES version, a `compat` arg can be passed to `from_es`:
```
df = DataFrame.from_es(url='http://localhost:9200', index='people', compat=5)
```
### Aggregation
```python
# Aggregation
df[df.gender == 'male'].agg(df.age.avg).collect()
# [Row(avg(age)=12)]
# Metric alias
df[df.gender == 'male'].agg(df.age.avg.alias('avg_age')).collect()
# [Row(avg_age=12)]
# Groupby only (will give the `doc_count`)
df.groupby('gender').collect()
# [Row(doc_count=1), Row(doc_count=2)]
# Groupby and then aggregate
df.groupby('gender').agg(df.age.max).collect()
# [Row(doc_count=1, max(age)=12), Row(doc_count=2, max(age)=13)]
# Group by a set of ranges
df.groupby(df.age.ranges([10,12,14])).to_pandas()
# doc_count
# range(10,12,14)
# 10.0-12.0 2
# 12.0-14.0 1
# Advanced ES aggregation
df.groupby(df.gender).agg(df.age.stats).to_pandas()
df.agg(df.age.extended_stats).to_pandas()
df.agg(df.age.percentiles).to_pandas()
df.groupby(df.date.date_interval('1d')).to_pandas()
# Customized aggregation terms
df.groupby(df.age.terms(size=5, include=[1, 2, 3]))
```
### Sort
```python
# Sort
df.sort(df.age.asc).select('name', 'age').collect()
# [Row(age=11,name='Bob'), Row(age=12,name='Alice'), Row(age=13,name='Leo')]
# Sort by a script
from pandasticsearch.operators import ScriptSorter
df.sort(ScriptSorter('doc["age"].value * 2')).collect()
# [Row(age=11,name='Bob'), Row(age=12,name='Alice'), Row(age=13,name='Leo')]
```
## Use with Another Python Client
Pandasticsearch can also be used with another full featured Python client:
* [elasticsearch-py](https://github.com/elastic/elasticsearch-py) (Official)
* [Elasticsearch-SQL](https://github.com/NLPchina/elasticsearch-sql)
* [pyelasticsearch](https://github.com/pyelasticsearch/pyelasticsearch)
* [pyes](https://github.com/aparo/pyes)
### Build query
```Python
from pandasticsearch import DataFrame
body = df[df['gender'] == 'male'].agg(df['age'].avg).to_dict()
from elasticsearch import Elasticsearch
result_dict = es.search(index="recruit", body=body)
```
### Parse result
```python
from elasticsearch import Elasticsearch
es = Elasticsearch('http://localhost:9200')
result_dict = es.search(index="recruit", body={"query": {"match_all": {}}})
from pandasticsearch import Select
pandas_df = Select.from_dict(result_dict).to_pandas()
```
## Related Articles
* [Spark and Elasticsearch for real-time data analysis](https://spark-summit.org/2015-east/wp-content/uploads/2015/03/SSE15-35-Leau.pdf)
## LICENSE
MIT
没有合适的资源?快使用搜索试试~ 我知道了~
pandasticsearch-0.4.3.tar.gz
需积分: 1 0 下载量 4 浏览量
2024-03-11
16:21:09
上传
评论
收藏 245KB GZ 举报
温馨提示
共127个文件
py:116个
txt:5个
md:2个
Python库是一组预先编写的代码模块,旨在帮助开发者实现特定的编程任务,无需从零开始编写代码。这些库可以包括各种功能,如数学运算、文件操作、数据分析和网络编程等。Python社区提供了大量的第三方库,如NumPy、Pandas和Requests,极大地丰富了Python的应用领域,从数据科学到Web开发。Python库的丰富性是Python成为最受欢迎的编程语言之一的关键原因之一。这些库不仅为初学者提供了快速入门的途径,而且为经验丰富的开发者提供了强大的工具,以高效率、高质量地完成复杂任务。例如,Matplotlib和Seaborn库在数据可视化领域内非常受欢迎,它们提供了广泛的工具和技术,可以创建高度定制化的图表和图形,帮助数据科学家和分析师在数据探索和结果展示中更有效地传达信息。
资源推荐
资源详情
资源评论
收起资源包目录
pandasticsearch-0.4.3.tar.gz (127个子文件)
setup.cfg 38B
MANIFEST.in 111B
README.md 6KB
CHANGELOG.md 770B
PKG-INFO 284B
PKG-INFO 284B
dtcompat.py 86KB
mock.py 82KB
testpatch.py 55KB
testmock.py 49KB
multiprocess.py 34KB
__init__.py 30KB
test_packaging.py 28KB
testhelpers.py 28KB
util.py 26KB
base.py 25KB
loader.py 25KB
packaging.py 25KB
config.py 25KB
suite.py 22KB
util.py 20KB
version.py 18KB
test_setup.py 18KB
doctests.py 17KB
dataframe.py 17KB
testmagicmethods.py 16KB
manager.py 15KB
test_version.py 13KB
plugintest.py 13KB
case.py 13KB
core.py 13KB
cover.py 11KB
xunit.py 11KB
test_integration.py 11KB
testwith.py 10KB
git.py 10KB
conf.py 10KB
testid.py 10KB
attrib.py 9KB
logcapture.py 9KB
types.py 9KB
selector.py 9KB
builddoc.py 9KB
base.py 8KB
operators.py 8KB
pyversion.py 7KB
test_operators.py 7KB
errorclass.py 7KB
test_queries.py 7KB
inspector.py 7KB
proxy.py 7KB
queries.py 7KB
result.py 7KB
commands.py 6KB
test_dataframe.py 6KB
core.py 6KB
__init__.py 6KB
importer.py 6KB
pluginopts.py 6KB
twistedtools.py 5KB
test_wsgi.py 5KB
testr_command.py 5KB
test_core.py 5KB
prof.py 5KB
testcallable.py 4KB
nontrivial.py 4KB
test_hooks.py 4KB
isolate.py 4KB
files.py 4KB
client.py 3KB
capture.py 3KB
main.py 3KB
collect.py 3KB
test_commands.py 3KB
test_util.py 3KB
test_files.py 3KB
util.py 3KB
commands.py 2KB
conf.py 2KB
options.py 2KB
_setup_hooks.py 2KB
debug.py 2KB
skip.py 2KB
allmodules.py 2KB
failuredetail.py 2KB
deprecated.py 2KB
test_types.py 1KB
wsgi.py 1KB
failure.py 1KB
pbr_json.py 1KB
trivial.py 1KB
backwards.py 1KB
extra_files.py 1KB
__init__.py 1KB
metadata.py 1KB
find_package.py 1KB
base.py 1KB
builtin.py 1021B
__init__.py 985B
testsentinel.py 976B
共 127 条
- 1
- 2
资源评论
程序员Chino的日记
- 粉丝: 2832
- 资源: 3万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功