Python库|biggerquery-0.3.dev6.tar.gz资源-CSDN文库

版权申诉

138 浏览量 2022-04-06 20:02:07 上传评论收藏 28KB GZ 举报

共22个文件

py：14个

txt：4个

pkg-info：2个

资源推荐

资源详情

资源评论

收起资源包目录

biggerquery-0.3.dev6.tar.gz （22个子文件）

biggerquery-0.3.dev6

PKG-INFO 27KB

test

test_beam_manager.py 10KB

test_dataset_manager.py 15KB

setup.cfg 38B

biggerquery.egg-info

PKG-INFO 27KB

requires.txt 156B

SOURCES.txt 530B

top_level.txt 12B

dependency_links.txt 1B

setup.py 926B

biggerquery

test_utils.py 2KB

job.py 8KB

utils.py 2KB

deployment.py 4KB

beam_manager.py 7KB

configuration.py 5KB

__init__.py 866B

workflow.py 951B

gcp_defaults.py 94B

dataset_manager.py 10KB

interactive.py 9KB

README.md 22KB

# BiggerQuery — Python library for BigQuery BiggerQuery is a Python library that simplifies working with BigQuery datasets. It wraps BigQuery client, providing elegant API for most common use cases. It also provides API that simplifies creating a dataflow pipelines. ## Installation `pip install biggerquery` ## Compatibility BiggerQuery is compatible with Python 2.7. ## Tutorial ### Task definition To guide you through all the features that BiggerQuery provides, we prepared a simple task. There is a table **transactions**, which looks like this: | user_id | transaction_value | partition_timestamp | |---------|-------------------|---------------------| | john123 | 800 | 2019-01-01 00:00:00 | | smith99 | 10000 | 2019-01-01 00:00:00 | | smith99 | 30000 | 2019-01-01 00:00:00 | The table contains all transactions that users make on a specific day. Your task is to calculate two metrics for each user: daily user transaction value and daily user transaction count. Final result should be table **user_transaction_metrics**: | user_id | metric_value | metric_type | partition_timestamp | |---------|--------------|------------------------|---------------------| | john123 | 800 | USER_TRANSACTION_VALUE | 2019-01-01 00:00:00 | | smith99 | 40000 | USER_TRANSACTION_VALUE | 2019-01-01 00:00:00 | | john123 | 1 | USER_TRANSACTION_COUNT | 2019-01-01 00:00:00 | | smith99 | 2 | USER_TRANSACTION_COUNT | 2019-01-01 00:00:00 | ### Setting up test environment Before you start working with BiggerQuery, you need to install [Google Cloud SDK](https://cloud.google.com/sdk/gcloud/). With gcloud installed, set up default gcloud credentials: `gcloud auth application-default login` Next, set up virtualenv with BiggerQuery: ``` mkdir test_biggerquery cd test_biggerquery pip install virtualenv virtualenv -p /usr/bin/python2.7 venv source venv/bin/activate pip install biggerquery ``` Then, prepare datasets. Start by creating a new Python module: `touch user_transaction_metrics.py` Edit created the module with your favorite editor and add the following lines: ```python from biggerquery import create_dataset_manager PROJECT_ID = 'your-project-id' USER_TRANSACTION_METRICS_DATASET_NAME = 'user_transaction_metrics' TRANSACTION_DATASET_NAME = 'transactions' def setup_test_transactions_table(project_id, dataset_name): dataset_id, dataset_manager = create_dataset_manager( project_id, '2019-01-01', dataset_name, internal_tables=['transactions']) dataset_manager.create_table(""" CREATE TABLE IF NOT EXISTS transactions ( user_id STRING, transaction_value FLOAT64, partition_timestamp TIMESTAMP) PARTITION BY DATE(partition_timestamp)""") dataset_manager.write_truncate('transactions', """ SELECT 'john123' as user_id, 800.0 as transaction_value, TIMESTAMP('2019-01-01') as partition_timestamp """) dataset_manager.write_append('transactions', """ SELECT 'smith99' as user_id, 10000.0 as transaction_value, TIMESTAMP('2019-01-01') as partition_timestamp """) dataset_manager.write_append('transactions', """ SELECT 'smith99' as user_id, 30000.0 as transaction_value, TIMESTAMP('2019-01-01') as partition_timestamp """) return '{dataset_id}.transactions'.format(dataset_id=dataset_id) TRANSACTIONS_TABLE_ID = setup_test_transactions_table(PROJECT_ID, TRANSACTION_DATASET_NAME) user_transaction_dataset_id, user_transaction_metrics_dataset_manager = create_dataset_manager( project_id=PROJECT_ID, runtime='2019-01-01', dataset_name=USER_TRANSACTION_METRICS_DATASET_NAME, internal_tables=['user_transaction_metrics'], external_tables={ 'transactions': TRANSACTIONS_TABLE_ID }) ``` This code creates 2 datasets: * **transactions** dataset which contains a source data table to be processed, * **user_transaction_metrics** dataset which contains result tables of our processing. ### Creating dataset manager Dataset manager is an object that allows you to manipulate tables present in a given dataset, using basic operations: `write_truncate`, `write_append`, `create_table`, `collect`, `write_tmp`. Let's go through a few examples to illustrate each of those operations. Start with creating a dataset manager object. Parameters `project_id` and `dataset_name` define the dataset you want to work with. Parameter `internal_tables` specifies tables that are **inside** dataset specified by `project_id` and `dataset_name`. Parameter `external_tables` specifies tables that are **outside** dataset specified by `project_id` and `dataset_name`. External tables have to be described by full table id, for example: ```python external_tables = { 'transactions': 'dataset.id.transactions', 'some_external_table': 'dataset.id2.external_table' } ``` Parameter `runtime` is used to determine partition being processed. ```python user_transaction_dataset_id, user_transaction_metrics_dataset_manager = create_dataset_manager( project_id=PROJECT_ID, runtime='2019-01-01', dataset_name=USER_TRANSACTION_METRICS_DATASET_NAME, internal_tables=['user_transaction_metrics'], external_tables={ 'transactions': TRANSACTIONS_TABLE_ID }) ``` ### Create table Now, create a table that you can use to store your metrics. You can use plain SQL to create this table. Add the following lines to `user_transaction_metrics.py`: ```python user_transaction_metrics_dataset_manager.create_table(""" CREATE TABLE IF NOT EXISTS user_transaction_metrics ( user_id STRING, metric_value FLOAT64, metric_type STRING, partition_timestamp TIMESTAMP) PARTITION BY DATE(partition_timestamp) """) ``` ### Write truncate Next, calculate the first metric — `USER_TRANSACTION_VALUE`. Add the following lines: ```python user_transaction_metrics_dataset_manager.write_truncate('user_transaction_metrics', """ SELECT user_id, sum(transaction_value) as metric_value, 'USER_TRANSACTION_VALUE' as metric_type, TIMESTAMP('{dt}') as partition_timestamp FROM `{transactions}` WHERE DATE(partition_timestamp) = '{dt}' GROUP BY user_id """) ``` Result: | user_id | metric_value | metric_type | partition_timestamp | |---------|--------------|------------------------|---------------------| | john123 | 800 | USER_TRANSACTION_VALUE | 2019-01-01 00:00:00 | | smith99 | 40000 | USER_TRANSACTION_VALUE | 2019-01-01 00:00:00 | The `write_truncate` function writes the result of the provided query to a specified table, in this case `user_transaction_metrics`. This function removes all data from a given table before writing new data. Inside the query, you don't have to write full table ids. You can use the names provided in the parameters `internal_tables` and `external_tables`. Parameter `runtime` is also available inside queries as `{dt}`. ### Write append So what about adding data to a table? Calculate another metric — `USER_TRANSACTION_COUNT`. Add the following lines: ```python user_transaction_metrics_dataset_manager.write_append('user_transaction_metrics', """ SELECT user_id, count(transaction_value) * 1.0 as metric_value, 'USER_TRANSACTION_COUNT' as metric_type, TIMESTAMP('{dt}') as partition_timestamp FROM `{transactions}` WHERE DATE(partition_timestamp) = '{dt}' GROUP BY user_id """) ``` Result: | user_id | metric_value | metric_type | partition_timestamp | |---------|--------------|------------------------|---------------------| | john123 | 800 | USER_TRANSACTION_VALUE | 2019-01-01 00:00:00 | | smith99 | 40000 | USER_TRANSACTION_VALUE | 2019-01-01 00:00:00 | | john123 | 1 | USER_TRANSACTION_COUNT | 2019-01-01 00:00:00 | | smith99 | 2 | USER_TRANSACTION_COUNT | 2019-01-01 00:00:00 | The difference between `write_truncate` and `wri

评论收藏

内容反馈

版权申诉