毕业设计：Udacity数据工程纳米学位的参考代码和项目.zip资源-CSDN文库

共73个文件

py：31个

ipynb：20个

md：7个

版权申诉

185 浏览量 2024-03-26 13:49:51 上传评论收藏 1.52MB ZIP 举报

"Udacity数据工程纳米学位的参考代码和项目.zip" 提供的是一个学习资源，主要针对完成Udacity的数据工程纳米学位课程的学生。这个压缩包可能包含了一系列的项目和参考代码，旨在帮助学生理解和实践数据工程的核心概念、技术和工具。在数据工程领域，这通常涉及到以下几个关键知识点： 1. **大数据处理框架**：可能涵盖Apache Hadoop和Spark，它们是用于处理大规模数据的主要工具。Hadoop是分布式文件系统，适合批量处理；Spark则以其内存计算和快速处理能力受到青睐。 2. **数据存储**：包括关系型数据库（如MySQL）和非关系型数据库（如HBase或MongoDB）。理解何时使用哪种类型的数据存储是数据工程师的重要技能。 3. **ETL（提取、转换、加载）流程**：这是数据工程的基础，涉及从不同源获取数据，清洗和转换数据，然后将数据加载到仓库或湖中。 4. **数据仓库与数据湖**：例如Amazon Redshift（一种云数据仓库服务）和AWS S3（用于存储大数据的简单存储服务）。学习如何设计和优化这些系统以支持高效的数据分析是课程的一部分。 5. **Python与SQL**：Python是数据科学和工程中常用的编程语言，而SQL用于查询和管理关系数据库。学生可能需要编写Python脚本来自动化数据处理任务，用SQL进行数据分析。 6. **数据可视化**：可能使用工具如Tableau或Python的matplotlib和seaborn库，帮助理解并呈现数据。 7. **云平台**：如AWS或Google Cloud Platform，这些云服务提供了许多数据工程工具和解决方案，学生需要学会如何在云端部署和管理数据工程解决方案。 8. **数据安全与隐私**：了解如何在处理和存储数据时确保其安全性和合规性，比如GDPR等法规的要求。 9. **版本控制**：如Git，是协作开发和代码管理的重要工具，学生应熟悉如何使用Git来追踪和合并代码更改。 10. **项目管理**：可能包含如何规划和实施数据工程项目，以及如何使用敏捷开发方法。："毕业设计：Udacity数据工程纳米学位的参考代码和项目" 表明这个资源是围绕一个具体的项目或一系列项目展开的，这些项目可能是课程中的最终评估部分。通过实际操作，学生可以将理论知识应用于实际场景，提升解决问题的能力。 Udacity的这个数据工程纳米学位课程涵盖了数据工程领域的广泛知识，从基础概念到实际应用，为学生提供了全面的学习体验。通过深入研究这些项目和参考代码，学生不仅能掌握理论知识，还能增强实际操作技能，为未来的职业生涯做好准备。

资源推荐

资源详情

资源评论

收起资源包目录

毕业设计：Udacity数据工程纳米学位的参考代码和项目.zip （73个子文件）

Data-Engineering-Nanodegree-master

spark-cluster-uda-engineering.ppk 1KB

3_cassandra_data_modeling_project

pics

image_event_datafile_new.jpg 360KB

modeling_with_cassandra.ipynb 16KB

README.md 1KB

2_postgres_from_3NF_to_star_to_cubes_demo

1-From_3NF_to_Star.ipynb 57KB

2-Cubes_and_Grouping-Sets.ipynb 19KB

pics

pagila-3nf.png 524KB

pagila-star.png 292KB

README.md 1KB

6_Apache_Airflow_Pipeline_project

plugins

__init__.py 439B

operators

__init__.py 349B

load_fact.py 1KB

stage_redshift.py 2KB

load_dimension.py 2KB

data_quality.py 2KB

helpers

sql_queries.py 2KB

__init__.py 137B

data_checks.py 579B

redshift_setup

AWS_create_Redshift_Cluster.ipynb 26KB

create_tables.sql 2KB

DAG_graph_view.JPG 48KB

dags

sparkify_dag.py 4KB

README.md 3KB

4_AWS_Redshift_DWH_project

sql_queries.py 11KB

2-DEMO-Efficient_Load_into_Redshift.ipynb 11KB

dwh.cfg.example 362B

3-DEMO-Optimizing_Table_Design.ipynb 59KB

1-Create_Redshift_Cluster.ipynb 23KB

etl.py 621B

create_tables.py 622B

README.md 3KB

Song_ERD.png 129KB

1_postgres_data_modeling_project

sql_queries.py 5KB

z_dev_notebooks

sql_queries.py 2KB

test.ipynb 3KB

etl.ipynb 28KB

create_tables.py 1KB

environment.yml 2KB

etl.py 4KB

requirements.txt 655B

db_reset.py 1KB

README.md 4KB

db_connect.py 2KB

7_capstone_open_data_ZH

sql_queries.py 13KB

resources

table_rows.JPG 16KB

data_model.JPG 85KB

data_steps.JPG 28KB

prepare_data.py 13KB

etl.py 1KB

dev_postgres_local

db_connect_local.py 2KB

create_tables_local.py 659B

sql_queries_local.py 9KB

create_tables.py 1KB

README.md 5KB

create_redshift_cluster.py 6KB

z_trash

s3_access.py 3KB

Upload_to_S3.ipynb 8KB

EDA.ipynb 126KB

db_connect.py 2KB

z_Spark_Code_Examples

2a_SparkSQL_Demo_Basic_Wrangling.ipynb 15KB

5_Connect_to_S3_Work_With_Schemas.ipynb 11KB

2b_SparKSQL_Query_Examples_Quiz.ipynb 9KB

4_PySpark_Demo_Basic_NLP.ipynb 57KB

1b_PySpark_Query_Examples_Quiz.ipynb 9KB

3_PySpark_Demo_Parsing_JSON_WebLog.ipynb 42KB

1a_PySpark_Demo_Basic_Wrangling.ipynb 45KB

foxyproxy.xml 2KB

0_tutorials

z_old

l2-1-creating-normalized-tables-postgres.ipynb 23KB

l2-2-creating-denormalized-tables-postgres.ipynb 22KB

5_AWS_EMR_Spark_DataLake_project

etl.py 7KB

schemas.py 2KB

README.md 3KB

dl.cfg.example.cfg 51B

# Analytics DB With 'Open Data' from Zurich, Switzerland Capstone Project Data Engineer Nanodegree, June 2020 ## Introduction ### Project Background When starting this project, I wanted to work with the largest publicly available _local_ datasets I could get. I turned out that there is not so much of impressive size around here, but at least [Open Data Zurich](https://data.stadt-zuerich.ch/) is providing some multi-million-row sets (although they are far away from being "big data"). In terms of content, my goal was to combine different data sets in a distributed AWS Redshift database in such a way that they would be easy to extract for analytical purposes. ### Project Overview In the end I worked with two data sets: 1. **Traffic count for non-motorized traffic (pedestrians and bicycles)**, which has been collected every quarter of an hour since the end of 2009 at a total of over 130 stations. 2. **Weather measurements (temperature, humidity, wind, precipitation etc.)** which have been collected every ten minutes since 2007 (one station only). Use Case: Data scientists investigating how the traffic at the various locations has developed over the years and what influence the weather may have on it. Requirements: To provide the data in an way that is easy to the handle but at the same time allows flexibility, scalability and the possibility to add more dimensions and facts in the future (e.g. counts for motorized traffic, data on air pollution). Solution: To reate a DB containing a fact table each for the traffic counts and the weather data. Linking them with a dim_date and a dim_time table to make queries for comparative analysis easy. (And thereby taking into account that the measurements are not in sync, hence the special design of the dim_time table.) Project Flow: ![Steps](resources/data_steps.JPG) ## Data ### Data Model The data model contains five tables: ![model](resources/data_model.JPG) `fact_count` is by far the largest table, with more than 9 Mio rows at the time of completing the project: ![row_count](resources/table_rows.JPG) ### Data Sources - traffic counts: CSV file per year, these are downloaded programatically. - weather data: - Single CSV file for all data up to 2019, this is downloaded programatically. - Data for actual year is requested from an [API](https://tecdottir.herokuapp.com/docs/). - traffic locations: JSON file, has to be downloaded manually (request by email). - date an time dimensions: Because Redshift does not support all necessary data types, these tables have been developed locally in postgreSQL and were then copied into Redshift (see Acknowledgements section at the bottom of this file.) ### Data Quality Checks Quality checks were incorporated into the ETL process. The script tests for missing values (and handles them appropriately). It also test for duplicate records and eliminates them prior to loading into the database, as Redshift does not enforce uniqueness and other constraints. ## Other Scenarios **If the database size was increased by 100X**: Even though the actual data fits neatly into a DB on a local machine, it is stored in a distributed Redshift database running on 2 nodes. This setup is highly scalable, you can work with higher performing nodes or add many new nodes if required. - But to maintain high query performance in such a scenario it might be worth considering moving to a NoSQL database structure such as Cassandra, with each table optimized for a specific query. **If the database is updated every morning at 7am**: This would make perfect sense because the weather data can be requested in real time and the CSV with the actual traffic counts is updated every day on the Open Data Zurich site. Best thing might be utilizing a pipeline scheduling application such as Airflow, with the uploading tasks implemented using Airflow hooks to AWS S3 buckets and Redshift. Transform tasks could be implemented using python callables with fairly limited modifications to the existing ETL script, especially in the case of the weather data which has an update function implemented already. **If the database needed to be accessed by 100+ people**: Redshift supports concurrency scaling. In this case Redshift adds clusters as needed to support increases in demand for concurrent querying of the database. But there are a number of technical requirements for concurrency scaling such as node type, sort key type (cannot use interleaved sorting) and query type (e.g. read-only) that must be met. The existing data model and cluster configuration would have to be reviewed to meet these requirements. ## Run Script to create the the redshift cluster: ``` sh create_redshift_cluster.py ``` Script to create the database tables or to reset the database: ``` sh create_tables.py ``` Script to retrieve the data, preprocess it locally and upload it to s3 (ETL, pt. 1): ``` sh prepare_data.py ``` ETL pipeline to populate the data on s3 into the database tables (ETL, pt. 2): ``` sh etl.py ``` ## Acknowledgements Ressources which helped me to develop date and time dimensions in postgreSQL: - [wiki on postgresql.org](https://wiki.postgresql.org/wiki/Date_and_Time_dimensions) - [blogpost from nicholasduffy.com](https://nicholasduffy.com/posts/postgresql-date-dimension/)

评论收藏

内容反馈

版权申诉