# pdpc-decisions
[![GitHub last commit](https://img.shields.io/github/last-commit/houfu/pdpc-decisions)](https://github.com/houfu/pdpc-decisions)
[![Build Status](https://travis-ci.com/houfu/pdpc-decisions.svg?branch=master)](https://travis-ci.com/houfu/pdpc-decisions)
[![Docker Cloud Automated build](https://img.shields.io/docker/cloud/automated/houfu/pdpc-decisions)](https://hub.docker.com/r/houfu/pdpc-decisions)
[![PyPI version](https://badge.fury.io/py/pdpc-decisions.svg)](https://badge.fury.io/py/pdpc-decisions)
This package contains utilities which allow you to create a corpus of decisions from the
Personal Data Protection Commission of Singapore's
[Data Protection Enforcement Cases](https://www.pdpc.gov.sg/Commissions-Decisions/Data-Protection-Enforcement-Cases).
The primary use of such a corpus is for studying, possibly using data science tools such as
natural language processing.
It currently has the following features:
* Visit the Personal Data Protection Commission of Singapore's
[Data Protection Enforcement Cases](https://www.pdpc.gov.sg/Commissions-Decisions/Data-Protection-Enforcement-Cases)
and compile a table of decisions with information from the summaries provided by the PDPC for each case.
* Save this table of decisions as CSV
* Download all the PDF files of the decisions from the PDPC's website.
If the decision is not a PDF, collects the information provided on the decision web page and saves it as a text file.
* Convert the PDF files into text files
## Features provided by scraper
* Published date
* Respondent
* Title
* Summary
* URL of PDF of decision
The features are discovered by passing `--extras` to the command.
* **[Extras]** Citation
* **[Extras]** Basic enforcement information (Financial penalty, warning, directions)
* **[Extras]** References (referred by, referring to)
## What pdpc-decisions uses
* Python 3
* PDF Miner
* Selenium
* Chrome
* spaCy
## Installation
### Docker Image
I dockerised the application for my personal ease of use.
It is probably the easiest and most straight-forward way
to use the application and I recommend it too.
The dockerised application also contains all pre-requisites
so there is no need for any manual installs.
You need to have docker installed.
Pull the image from [docker hub](https://hub.docker.com/r/houfu/pdpc-decisions).
```shell script
docker pull houfu/pdpc-decisions
```
After that you can run the image and pass commands and arguments to it.
For example, if you would like the application to do all actions.
```shell script
docker run houfu/pdpc-decisions all
```
This isn't clever because downloads will be stored in the docker image
and not easily accessed. Bind a volume in your filesystem and
use the `--root` option to direct the application
to save the files there. For example:
```shell script
docker run \
--mount type=bind,source="$(pwd)"/target,target=/code/download \ # Target directory must exist!
houfu/pdpc-decisions \
all \
--root /code/download/
```
### Local install
* Install via PIP
```shell script
pip install pdpc-decisions
```
* Once the package is installed, used the command line tool `pdpc-decisions` to use the script.
* If necessary, install [Chrome](https://www.google.com/chrome/)
and [ChromeDriver](https://sites.google.com/a/chromium.org/chromedriver/) for Selenium to work.
The main entry point for the script is `pdpcdecision.py`
## Usage
The script accepts the following actions and options:
Accepts the following actions.
"`all`" Does all the actions (scraping the website, saving a csv,
downloading all files and creating a corpus).
"`corpus`" After downloading all the decisions from the website, converts
them into text files.
"`csv`" Save the items gathered by the scraper as a csv file.
"`files`" Downloads all the decisions from the PDPC website into a
folder.
Options:
`--csv FILE` Filename for saving the items gathered by scraper as a
csv file. [default: scrape_results.csv]
`--download DIRECTORY` Destination folder for downloads of all PDF/web pages
of PDPC decisions [default: download/]
`--corpus DIRECTORY` Destination folder for PDPC decisions converted to
text files [default: corpus/]
`-r, --root DIRECTORY` Root directory for downloads and files [default:
Your current working directory]
`--extras/--no-extras` Add extra features to the data collected. This increases processing time. This feature is ignored if action is `files` or `downloads`.
(Experimental and requires reading of actual decisions)
[default: *False*, '--no-extras']
`--help` Show this message and exit.
## Contact
Feel free to let me have your suggestions, comments or issues using the issue tracker or by
[emailing me](mailto:houfu@outlook.sg).
It would also be nice to hear how you have used this corpus by using the above contacts.
挣扎的蓝藻
- 粉丝: 14w+
- 资源: 15万+
最新资源
- sensors-18-03721.pdf
- Facebook.apk
- 推荐一款JTools的call-this-method插件
- json的合法基色来自红包东i请各位
- 项目采用YOLO V4算法模型进行目标检测,使用Deep SORT目标跟踪算法 .zip
- 针对实时视频流和静态图像实现的对象检测和跟踪算法 .zip
- 部署 yolox 算法使用 deepstream.zip
- 基于webmagic、springboot和mybatis的MagicToe Java爬虫设计源码
- 通过实时流协议 (RTSP) 使用 Yolo、OpenCV 和 Python 进行深度学习的对象检测.zip
- 基于Python和HTML的tb商品列表查询分析设计源码
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈