# Analytics DB With 'Open Data' from Zurich, Switzerland
Capstone Project Data Engineer Nanodegree, June 2020
## Introduction
### Project Background
When starting this project, I wanted to work with the largest publicly available _local_ datasets I could get. I turned out that there is not so much of impressive size around here, but at least [Open Data Zurich](https://data.stadt-zuerich.ch/) is providing some multi-million-row sets (although they are far away from being "big data").
In terms of content, my goal was to combine different data sets in a distributed AWS Redshift database in such a way that they would be easy to extract for analytical purposes.
### Project Overview
In the end I worked with two data sets:
1. **Traffic count for non-motorized traffic (pedestrians and bicycles)**, which has been collected every quarter of an hour since the end of 2009 at a total of over 130 stations.
2. **Weather measurements (temperature, humidity, wind, precipitation etc.)** which have been collected every ten minutes since 2007 (one station only).
Use Case:
Data scientists investigating how the traffic at the various locations has developed over the years and what influence the weather may have on it.
Requirements:
To provide the data in an way that is easy to the handle but at the same time allows flexibility, scalability and the possibility to add more dimensions and facts in the future (e.g. counts for motorized traffic, data on air pollution).
Solution:
To reate a DB containing a fact table each for the traffic counts and the weather data. Linking them with a dim_date and a dim_time table to make queries for comparative analysis easy. (And thereby taking into account that the measurements are not in sync, hence the special design of the dim_time table.)
Project Flow:
![Steps](resources/data_steps.JPG)
## Data
### Data Model
The data model contains five tables:
![model](resources/data_model.JPG)
`fact_count` is by far the largest table, with more than 9 Mio rows at the time of completing the project:
![row_count](resources/table_rows.JPG)
### Data Sources
- traffic counts: CSV file per year, these are downloaded programatically.
- weather data:
- Single CSV file for all data up to 2019, this is downloaded programatically.
- Data for actual year is requested from an [API](https://tecdottir.herokuapp.com/docs/).
- traffic locations: JSON file, has to be downloaded manually (request by email).
- date an time dimensions: Because Redshift does not support all necessary data types, these tables have been developed locally in postgreSQL and were then copied into Redshift (see Acknowledgements section at the bottom of this file.)
### Data Quality Checks
Quality checks were incorporated into the ETL process. The script tests for missing values (and handles them appropriately). It also test for duplicate records and eliminates them prior to loading into the database, as Redshift does not enforce uniqueness and other constraints.
## Other Scenarios
**If the database size was increased by 100X**: Even though the actual data fits neatly into a DB on a local machine, it is stored in a distributed Redshift database running on 2 nodes. This setup is highly scalable, you can work with higher performing nodes or add many new nodes if required. - But to maintain high query performance in such a scenario it might be worth considering moving to a NoSQL database structure such as Cassandra, with each table optimized for a specific query.
**If the database is updated every morning at 7am**: This would make perfect sense because the weather data can be requested in real time and the CSV with the actual traffic counts is updated every day on the Open Data Zurich site. Best thing might be utilizing a pipeline scheduling application such as Airflow, with the uploading tasks implemented using Airflow hooks to AWS S3 buckets and Redshift. Transform tasks could be implemented using python callables with fairly limited modifications to the existing ETL script, especially in the case of the weather data which has an update function implemented already.
**If the database needed to be accessed by 100+ people**: Redshift supports concurrency scaling. In this case Redshift adds clusters as needed to support increases in demand for concurrent querying of the database. But there are a number of technical requirements for concurrency scaling such as node type, sort key type (cannot use interleaved sorting) and query type (e.g. read-only) that must be met. The existing data model and cluster configuration would have to be reviewed to meet these requirements.
## Run
Script to create the the redshift cluster:
``` sh
create_redshift_cluster.py
```
Script to create the database tables or to reset the database:
``` sh
create_tables.py
```
Script to retrieve the data, preprocess it locally and upload it to s3 (ETL, pt. 1):
``` sh
prepare_data.py
```
ETL pipeline to populate the data on s3 into the database tables (ETL, pt. 2):
``` sh
etl.py
```
## Acknowledgements
Ressources which helped me to develop date and time dimensions in postgreSQL:
- [wiki on postgresql.org](https://wiki.postgresql.org/wiki/Date_and_Time_dimensions)
- [blogpost from nicholasduffy.com](https://nicholasduffy.com/posts/postgresql-date-dimension/)
没有合适的资源?快使用搜索试试~ 我知道了~
毕业设计:Udacity数据工程纳米学位的参考代码和项目.zip
共73个文件
py:31个
ipynb:20个
md:7个
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 157 浏览量
2024-03-26
13:49:51
上传
评论
收藏 1.52MB ZIP 举报
温馨提示
毕业设计:Udacity数据工程纳米学位的参考代码和项目
资源推荐
资源详情
资源评论
收起资源包目录
毕业设计:Udacity数据工程纳米学位的参考代码和项目.zip (73个子文件)
Data-Engineering-Nanodegree-master
spark-cluster-uda-engineering.ppk 1KB
3_cassandra_data_modeling_project
pics
image_event_datafile_new.jpg 360KB
modeling_with_cassandra.ipynb 16KB
README.md 1KB
2_postgres_from_3NF_to_star_to_cubes_demo
1-From_3NF_to_Star.ipynb 57KB
2-Cubes_and_Grouping-Sets.ipynb 19KB
pics
pagila-3nf.png 524KB
pagila-star.png 292KB
README.md 1KB
6_Apache_Airflow_Pipeline_project
plugins
__init__.py 439B
operators
__init__.py 349B
load_fact.py 1KB
stage_redshift.py 2KB
load_dimension.py 2KB
data_quality.py 2KB
helpers
sql_queries.py 2KB
__init__.py 137B
data_checks.py 579B
redshift_setup
AWS_create_Redshift_Cluster.ipynb 26KB
create_tables.sql 2KB
DAG_graph_view.JPG 48KB
dags
sparkify_dag.py 4KB
README.md 3KB
4_AWS_Redshift_DWH_project
sql_queries.py 11KB
2-DEMO-Efficient_Load_into_Redshift.ipynb 11KB
dwh.cfg.example 362B
3-DEMO-Optimizing_Table_Design.ipynb 59KB
1-Create_Redshift_Cluster.ipynb 23KB
etl.py 621B
create_tables.py 622B
README.md 3KB
Song_ERD.png 129KB
1_postgres_data_modeling_project
sql_queries.py 5KB
z_dev_notebooks
sql_queries.py 2KB
test.ipynb 3KB
etl.ipynb 28KB
create_tables.py 1KB
environment.yml 2KB
etl.py 4KB
requirements.txt 655B
db_reset.py 1KB
README.md 4KB
db_connect.py 2KB
7_capstone_open_data_ZH
sql_queries.py 13KB
resources
table_rows.JPG 16KB
data_model.JPG 85KB
data_steps.JPG 28KB
prepare_data.py 13KB
etl.py 1KB
dev_postgres_local
db_connect_local.py 2KB
create_tables_local.py 659B
sql_queries_local.py 9KB
create_tables.py 1KB
README.md 5KB
create_redshift_cluster.py 6KB
z_trash
s3_access.py 3KB
Upload_to_S3.ipynb 8KB
EDA.ipynb 126KB
db_connect.py 2KB
z_Spark_Code_Examples
2a_SparkSQL_Demo_Basic_Wrangling.ipynb 15KB
5_Connect_to_S3_Work_With_Schemas.ipynb 11KB
2b_SparKSQL_Query_Examples_Quiz.ipynb 9KB
4_PySpark_Demo_Basic_NLP.ipynb 57KB
1b_PySpark_Query_Examples_Quiz.ipynb 9KB
3_PySpark_Demo_Parsing_JSON_WebLog.ipynb 42KB
1a_PySpark_Demo_Basic_Wrangling.ipynb 45KB
foxyproxy.xml 2KB
0_tutorials
z_old
l2-1-creating-normalized-tables-postgres.ipynb 23KB
l2-2-creating-denormalized-tables-postgres.ipynb 22KB
5_AWS_EMR_Spark_DataLake_project
etl.py 7KB
schemas.py 2KB
README.md 3KB
dl.cfg.example.cfg 51B
共 73 条
- 1
资源评论
专家-百锦再@新空间代码工作室
- 粉丝: 9632
- 资源: 806
下载权益
C知道特权
VIP文章
课程特权
开通VIP
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功