# Bqueryd
A companion library to [bquery](https://github.com/visualfabriq/bquery/) to make distributed bquery calls possible. Think of it as a far more rudimentary alternative to [Hadoop](http://hadoop.apache.org/) or [Dask](https://dask.pydata.org/en/latest/)
## The Idea
Web applications or client analysis tools do not perform the heavy lifting of calculations over large sets of data themselves, the data is stored on a collection of other servers which respond to queries over the network. Data files that are used in computations are stored in [bcolz](http://bcolz.blosc.org/en/latest/) files.
For _really_ large datasets, the bcolz files can also be split up into 'shards' over several servers, and a query can then be performed over several servers and the results combined to the calling function by the bqueryd library.
## Getting started
Make sure you have Python virtualenv installed first.
As a start we need some interesting data, that is reasonably large in size. Download some Taxi data from the [NYC Taxi & Limousine Commission](http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml)
virtualenv bqueryd_getting_started
cd bqueryd_getting_started
. bin/activate
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2016-01.csv
pip install bqueryd pandas
We are only downloading the data for one month, a more interesting test is of course download the data for an entire year. But this is a good start. The data for one month is already 10 million records.
Run ipython, and let's convert the CSV file to a bcolz file.
import bcolz
import pandas as pd
data = pd.read_csv('yellow_tripdata_2016-01.csv', parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'])
ct = bcolz.ctable.fromdataframe(data, rootdir='tripdata_2016-01.bcolz')
Now we have a bcolz file on disk that can be queried using [bquery](https://github.com/visualfabriq/bquery/). But we also want to show how to use the distributed functionality of bqueryd, so we split the file that we have just created into some smaller chunks.
import bcolz, bquery
ct = bquery.open(rootdir='tripdata_2016-01.bcolz')
NR_SHARDS = 10
step = len(ct) / NR_SHARDS
remainder = len(ct) % NR_SHARDS
count = 0
for idx in range(0, len(ct), step):
if idx == len(ct)*(NR_SHARDS-1):
step = step + remainder
print 'Creating file tripdata_2016-01-%s.bcolzs'%count
ct_shard = bcolz.fromiter(
ct.iter(idx, idx+step),
ct.dtype,
step,
rootdir='tripdata_2016-01-%s.bcolzs'%count,
mode='w'
)
ct_shard.flush()
count += 1
## Running bqueryd
Now to test using bqueryd. If the bqueryd was successfully installed using pip, and your virtualenvironment is activated, you should now have a script named ```bqueryd``` on your path. You can start up a controller. Before starting bqueryd, also make sure that you have a locally running [Redis](https://redis.io/) server.
bqueryd controller &
If you already have a controller running, you can now also run ```bqueryd``` without any arguments and it will try and connect to your controller and then drop you into an [IPython](https://ipython.org/) shell to communicate with your bqueryd cluster.
bqueryd
From the ipython prompt you have access to a variable named 'rpc'. (if you had at least one running controller). From the rpc variable you can execute commands to the bqueryd cluster. For example:
>>> rpc.info()
Will show status information on your current cluster, with only one controller node running there is not so much info yet. First exist your ipython session to the shell.
Lets also start two worker nodes:
bqueryd worker --data_dir=`pwd` &
bqueryd worker --data_dir=`pwd` &
At this point you should have a controller and two workers running in the background. When you execute ```bqueryd``` again and do:
>>> rpc.info()
There should be more information on the running controller plus the two worker nodes. By default worker nodes check for bcolz files in the ```/srv/bcolz/``` directory. In the previous section we ran some worker nodes with a command line argument --data_dir to use the bcolz files in the current directory.
So what kind of other commands can we send to the nodes? Here are some things to try:
>>> rpc.ping()
>>> rpc.sleep(10)
>>> rpc.loglevel('debug')
>>> rpc.sleep(2)
>>> rpc.loglevel('info')
>>> rpc.killworkers()
Notice the last command sent, this kills all the workers connected to all running controllers in the network. The controllers still keep on running. In typical setup the nodes will have been started up and kept running by a tool like [Supervisor](http://supervisord.org/) By using the 'killworkers' command it effectively re-boots all your workers.
The 'sleep' call is just for testing to see if any workers are responding. The call is not performed on the caller or the connecting node, but perfomeed by a worker chosen at random.
It is possible to stop all workers and all controllers in the bqueryd network by issuing the command:
>>> rpc.killall()
## Configuration
There is minimally **one** thing to configure to use bqueryd on a network, assuming that all other defaults are chosen. **The address of the Redis server used**.
This is set in the file named ```/etc/bqueryd.cfg```
You could create this file and change the line to read, for example:
redis_url = redis://127.0.0.1:6379/0
And change the IP address to the address of your running Redis instance. This needs to be done on every machine that you plan on running a bqueryd node.
As a convenience there is also en example configuration file for running a bquery installation using Supervisor in [misc/supervisor.conf](misc/supervisor.conf)
## Doing calculations
The whole point of having a bqueryd cluster running is doing some calculations. So once you have a controller with some worker nodes running and connected, you can drop into the bqueryd ipython shell, and for example do:
>>> rpc.groupby(['tripdata_2016-01.bcolz'], ['payment_type'], ['fare_amount'], [])
But we can also use the sharded data to do the same calculation:
>>> import os
>>> bcolzs_files = [x for x in os.listdir('.') if x.endswith('.bcolzs')]
>>> rpc.groupby(bcolzs_files, ['payment_type'], [['fare_amount', 'sum', 'fare_amount']], [], aggregate=True)
To see how long a rpc call took, you can check:
>>> rpc.last_call_duration
The sharded version actually takes longer to run than the version using the bcolz file only. But if we start up more workers, the call is speeded up. For relatively small files like in this example, the speedup is small, but for larger datasets the overhead is worthwhile. Start up a few more workers, and run the query above.
## Executing arbitrary code
It is possible to your bqueryd workers to import and execute arbitrary Python code. **This is a potentially huge big security risk if you do not run your nodes on trusted servers behind a good firewall** Make sure you know what you are doing before just starting up and running bqueryd nodes. With that being said, if you have a cluster running, try something like:
>>> rpc.execute_code(function='os.listdir', args=['.'], wait=True)
This should pick a random worker from those connected to the controller and run the Python listdir command in the args specified. The use of this is to run code to execute other functions from the built-in bquery/bcolz aggregation logic. This enables one to perform other business specific operations over the netwok using bqueryd nodes.
## Distributing bcolz files
If your system is properly configured to use [boto](https://github.com/boto/boto3) for communication with Amazon Web Services, you can use bqueryd to automically distribute collections of files to all nodes in the bqueryd cluster.
Create some bcolz files in the default bqueryd directory ```/srv/bcolz/``` (or mo
没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
资源分类:Python库 所属语言:Python 资源全名:bqueryd-0.3.0.tar.gz 资源来源:官方 安装方法:https://lanzao.blog.csdn.net/article/details/101784059
资源推荐
资源详情
资源评论
收起资源包目录
bqueryd-0.3.0.tar.gz (24个子文件)
bqueryd-0.3.0
MANIFEST.in 167B
PKG-INFO 12KB
bqueryd
rpc.py 9KB
util.py 3KB
__init__.py 765B
node.py 1KB
controller.py 24KB
tool.py 576B
worker.py 23KB
messages.py 3KB
RELEASE_NOTES.rst 245B
LICENSE 1KB
setup.cfg 88B
VERSION 5B
requirements.txt 108B
setup.py 4KB
README.md 9KB
bqueryd.egg-info
PKG-INFO 12KB
requires.txt 129B
SOURCES.txt 448B
entry_points.txt 47B
top_level.txt 8B
dependency_links.txt 1B
zip-safe 1B
共 24 条
- 1
资源评论
挣扎的蓝藻
- 粉丝: 14w+
- 资源: 15万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 城镇老旧小区改造(加装电梯)考评内容和评价标准表.docx
- 城镇老旧小区改造及既有住宅加装电梯赋分权重.docx
- 底板隐蔽前监理检查记录.docx
- 出差审批单(表格模板).docx
- 第三方技术服务机构消防验收项目情况工作月汇报表.docx
- 电梯质量安全风险管控清单(安装(含修理).docx
- 飞机舱位代码表.docx
- 顶板隐蔽前监理检查记录表.docx
- 高危妊娠产前评分标准表.docx
- 高温中暑病例报告卡表格.docx
- 个体工商户营业执照颁发及归档记录表.doc
- 更换输液流程表.docx
- 公务接待审批单(表格模板).docx
- 古今地名对照表.docx
- 固定资产验收单、移交清单、处置清单.docx
- 骨关节损伤鉴定标准条款表.docx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功