# BiggerQuery — Python library for BigQuery
BiggerQuery is a Python library that simplifies working with BigQuery datasets. It wraps BigQuery client, providing elegant
API for most common use cases. It also provides API that simplifies creating a dataflow pipelines.
## Installation
`pip install biggerquery`
## Compatibility
BiggerQuery is compatible with Python 2.7.
## Tutorial
### Task definition
To guide you through all the features that BiggerQuery provides, we prepared a simple task. There is a table **transactions**, which looks like this:
| user_id | transaction_value | partition_timestamp |
|---------|-------------------|---------------------|
| john123 | 800 | 2019-01-01 00:00:00 |
| smith99 | 10000 | 2019-01-01 00:00:00 |
| smith99 | 30000 | 2019-01-01 00:00:00 |
The table contains all transactions that users make on a specific day. Your task is to calculate two metrics for each user:
daily user transaction value and daily user transaction count.
Final result should be table **user_transaction_metrics**:
| user_id | metric_value | metric_type | partition_timestamp |
|---------|--------------|------------------------|---------------------|
| john123 | 800 | USER_TRANSACTION_VALUE | 2019-01-01 00:00:00 |
| smith99 | 40000 | USER_TRANSACTION_VALUE | 2019-01-01 00:00:00 |
| john123 | 1 | USER_TRANSACTION_COUNT | 2019-01-01 00:00:00 |
| smith99 | 2 | USER_TRANSACTION_COUNT | 2019-01-01 00:00:00 |
### Setting up test environment
Before you start working with BiggerQuery, you need to install [Google Cloud SDK](https://cloud.google.com/sdk/gcloud/).
With gcloud installed, set up default gcloud credentials:
`gcloud auth application-default login`
Next, set up virtualenv with BiggerQuery:
```
mkdir test_biggerquery
cd test_biggerquery
pip install virtualenv
virtualenv -p /usr/bin/python2.7 venv
source venv/bin/activate
pip install biggerquery
```
Then, prepare datasets. Start by creating a new Python module:
`touch user_transaction_metrics.py`
Edit created the module with your favorite editor and add the following lines:
```python
from biggerquery import create_dataset_manager
PROJECT_ID = 'your-project-id'
USER_TRANSACTION_METRICS_DATASET_NAME = 'user_transaction_metrics'
TRANSACTION_DATASET_NAME = 'transactions'
def setup_test_transactions_table(project_id, dataset_name):
dataset_id, dataset_manager = create_dataset_manager(
project_id,
'2019-01-01',
dataset_name,
internal_tables=['transactions'])
dataset_manager.create_table("""
CREATE TABLE IF NOT EXISTS transactions (
user_id STRING,
transaction_value FLOAT64,
partition_timestamp TIMESTAMP)
PARTITION BY DATE(partition_timestamp)""")
dataset_manager.write_truncate('transactions', """
SELECT 'john123' as user_id, 800.0 as transaction_value, TIMESTAMP('2019-01-01') as partition_timestamp
""")
dataset_manager.write_append('transactions', """
SELECT 'smith99' as user_id, 10000.0 as transaction_value, TIMESTAMP('2019-01-01') as partition_timestamp
""")
dataset_manager.write_append('transactions', """
SELECT 'smith99' as user_id, 30000.0 as transaction_value, TIMESTAMP('2019-01-01') as partition_timestamp
""")
return '{dataset_id}.transactions'.format(dataset_id=dataset_id)
TRANSACTIONS_TABLE_ID = setup_test_transactions_table(PROJECT_ID, TRANSACTION_DATASET_NAME)
user_transaction_dataset_id, user_transaction_metrics_dataset_manager = create_dataset_manager(
project_id=PROJECT_ID,
runtime='2019-01-01',
dataset_name=USER_TRANSACTION_METRICS_DATASET_NAME,
internal_tables=['user_transaction_metrics'],
external_tables={
'transactions': TRANSACTIONS_TABLE_ID
})
```
This code creates 2 datasets:
* **transactions** dataset which contains a source data table to be processed,
* **user_transaction_metrics** dataset which contains result tables of our processing.
### Creating dataset manager
Dataset manager is an object that allows you to manipulate tables present in a given dataset, using basic operations: `write_truncate`,
`write_append`, `create_table`, `collect`, `write_tmp`. Let's go through a few examples to illustrate each of those operations.
Start with creating a dataset manager object. Parameters `project_id` and `dataset_name` define the dataset you want to work with.
Parameter `internal_tables` specifies tables that are **inside** dataset specified by `project_id` and `dataset_name`.
Parameter `external_tables` specifies tables that are **outside** dataset specified by `project_id` and `dataset_name`.
External tables have to be described by full table id, for example:
```python
external_tables = {
'transactions': 'dataset.id.transactions',
'some_external_table': 'dataset.id2.external_table'
}
```
Parameter `runtime` is used to determine partition being processed.
```python
user_transaction_dataset_id, user_transaction_metrics_dataset_manager = create_dataset_manager(
project_id=PROJECT_ID,
runtime='2019-01-01',
dataset_name=USER_TRANSACTION_METRICS_DATASET_NAME,
internal_tables=['user_transaction_metrics'],
external_tables={
'transactions': TRANSACTIONS_TABLE_ID
})
```
### Create table
Now, create a table that you can use to store your metrics. You can use plain SQL to create this table. Add the following lines to `user_transaction_metrics.py`:
```python
user_transaction_metrics_dataset_manager.create_table("""
CREATE TABLE IF NOT EXISTS user_transaction_metrics (
user_id STRING,
metric_value FLOAT64,
metric_type STRING,
partition_timestamp TIMESTAMP)
PARTITION BY DATE(partition_timestamp)
""")
```
### Write truncate
Next, calculate the first metric — `USER_TRANSACTION_VALUE`. Add the following lines:
```python
user_transaction_metrics_dataset_manager.write_truncate('user_transaction_metrics', """
SELECT user_id,
sum(transaction_value) as metric_value,
'USER_TRANSACTION_VALUE' as metric_type,
TIMESTAMP('{dt}') as partition_timestamp
FROM `{transactions}`
WHERE DATE(partition_timestamp) = '{dt}'
GROUP BY user_id
""")
```
Result:
| user_id | metric_value | metric_type | partition_timestamp |
|---------|--------------|------------------------|---------------------|
| john123 | 800 | USER_TRANSACTION_VALUE | 2019-01-01 00:00:00 |
| smith99 | 40000 | USER_TRANSACTION_VALUE | 2019-01-01 00:00:00 |
The `write_truncate` function writes the result of the provided query to a specified table, in this case `user_transaction_metrics`.
This function removes all data from a given table before writing new data.
Inside the query, you don't have to write full table ids. You can use the names provided in the parameters `internal_tables` and `external_tables`.
Parameter `runtime` is also available inside queries as `{dt}`.
### Write append
So what about adding data to a table? Calculate another metric — `USER_TRANSACTION_COUNT`. Add the following lines:
```python
user_transaction_metrics_dataset_manager.write_append('user_transaction_metrics', """
SELECT user_id,
count(transaction_value) * 1.0 as metric_value,
'USER_TRANSACTION_COUNT' as metric_type,
TIMESTAMP('{dt}') as partition_timestamp
FROM `{transactions}`
WHERE DATE(partition_timestamp) = '{dt}'
GROUP BY user_id
""")
```
Result:
| user_id | metric_value | metric_type | partition_timestamp |
|---------|--------------|------------------------|---------------------|
| john123 | 800 | USER_TRANSACTION_VALUE | 2019-01-01 00:00:00 |
| smith99 | 40000 | USER_TRANSACTION_VALUE | 2019-01-01 00:00:00 |
| john123 | 1 | USER_TRANSACTION_COUNT | 2019-01-01 00:00:00 |
| smith99 | 2 | USER_TRANSACTION_COUNT | 2019-01-01 00:00:00 |
The difference between `write_truncate` and `wri
没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
资源分类:Python库 所属语言:Python 资源全名:biggerquery-0.3.dev6.tar.gz 资源来源:官方 安装方法:https://lanzao.blog.csdn.net/article/details/101784059
资源推荐
资源详情
资源评论
收起资源包目录
biggerquery-0.3.dev6.tar.gz (22个子文件)
biggerquery-0.3.dev6
PKG-INFO 27KB
test
test_beam_manager.py 10KB
test_dataset_manager.py 15KB
setup.cfg 38B
biggerquery.egg-info
PKG-INFO 27KB
requires.txt 156B
SOURCES.txt 530B
top_level.txt 12B
dependency_links.txt 1B
setup.py 926B
biggerquery
test_utils.py 2KB
job.py 8KB
utils.py 2KB
deployment.py 4KB
beam_manager.py 7KB
configuration.py 5KB
__init__.py 866B
workflow.py 951B
gcp_defaults.py 94B
dataset_manager.py 10KB
interactive.py 9KB
README.md 22KB
共 22 条
- 1
资源评论
挣扎的蓝藻
- 粉丝: 14w+
- 资源: 15万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功