[![License](http://img.shields.io/:license-Apache_v2-blue.svg)](https://github.com/maropu/spark-data-repair-plugin/blob/master/LICENSE)
[![Build and test](https://github.com/maropu/spark-data-repair-plugin/workflows/Build%20and%20tests/badge.svg)](https://github.com/maropu/spark-data-repair-plugin/actions?query=workflow%3A%22Build+and+tests%22)
<!---
[![Coverage Status](https://coveralls.io/repos/github/maropu/spark-data-repair-plugin/badge.svg?branch=master)](https://coveralls.io/github/maropu/spark-data-repair-plugin?branch=master)
-->
This is an experimental prototype for building a statistical model to repair tabular data errors on [Apache Spark](https://spark.apache.org/)
which is a parallel and distributed framework for large-scale data processing.
Clean and consistent data is one of major interests for downstream analytics;
clean data makes machine learning and BI reporting more accurate and
consistent data with constraints (e.g., functional dependences) is important for efficient query plans.
Therefore, data repairing is a first step for a reliable analytics pipeline.
## How to Repair Error Cells
```
$ git clone https://github.com/maropu/spark-data-repair-plugin.git
$ cd spark-data-repair-plugin
# This repository includes a simple wrapper script `bin/python` to create
# a conda virtual environment to resolve the required dependencies
# (e.g., Python 3.7 and PySpark 3.2), and then
# launch a Python VM with our plugin.
$ ./bin/python
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.2.0
/_/
Using Python version 3.7.11 (default, Jul 27 2021 07:03:16)
SparkSession available as 'spark'.
Delphi APIs (version 0.1.0-spark3.2-EXPERIMENTAL) available as 'delphi'.
# Loads CSV data having seven NULL cells
>>> spark.read.option("header", True).csv("./testdata/adult.csv").createOrReplaceTempView("adult")
>>> spark.table("adult").show()
+---+-----+------------+-----------------+-------------+------+-------------+-----------+
|tid| Age| Education| Occupation| Relationship| Sex| Country| Income|
+---+-----+------------+-----------------+-------------+------+-------------+-----------+
| 0|31-50|Some-college| Craft-repair| Husband| Male|United-States|LessThan50K|
| 1| >50|Some-college| Exec-managerial| Own-child|Female|United-States|LessThan50K|
| 2|31-50| Bachelors| Sales| Husband| Male|United-States|LessThan50K|
| 3|22-30| HS-grad| Craft-repair| Own-child| null|United-States|LessThan50K|
| 4|22-30| HS-grad| Farming-fishing| Husband|Female|United-States|LessThan50K|
| 5| null|Some-college| Craft-repair| Husband| Male|United-States| null|
| 6|31-50| HS-grad| Prof-specialty|Not-in-family|Female|United-States|LessThan50K|
| 7|31-50| Prof-school| Prof-specialty| Husband| null| India|MoreThan50K|
| 8|18-21|Some-college| Adm-clerical| Own-child|Female|United-States|LessThan50K|
| 9| >50| HS-grad| Farming-fishing| Husband| Male|United-States|LessThan50K|
| 10| >50| Assoc-voc| Prof-specialty| Husband| Male|United-States|LessThan50K|
| 11| >50| HS-grad| Sales| Husband|Female|United-States|MoreThan50K|
| 12| null| Bachelors| Exec-managerial| Husband| null|United-States|MoreThan50K|
| 13|22-30| HS-grad| Craft-repair|Not-in-family| Male|United-States|LessThan50K|
| 14|31-50| Assoc-acdm| Exec-managerial| Unmarried| Male|United-States|LessThan50K|
| 15|22-30|Some-college| Sales| Own-child| Male|United-States|LessThan50K|
| 16| >50|Some-college| Exec-managerial| Unmarried|Female|United-States| null|
| 17|31-50| HS-grad| Adm-clerical|Not-in-family|Female|United-States|LessThan50K|
| 18|31-50| 10th|Handlers-cleaners| Husband| Male|United-States|LessThan50K|
| 19|31-50| HS-grad| Sales| Husband| Male| Iran|MoreThan50K|
+---+-----+------------+-----------------+-------------+------+-------------+-----------+
# Runs a job to compute repair updates for the seven NULL cells above in `dirty_df`
# A `repaired` column represents proposed updates to repiar them
>>> from repair.errors import NullErrorDetector
>>> repair_updates_df = delphi.repair \
... .setInput("adult") \
... .setRowId("tid") \
... .setErrorDetectors([NullErrorDetector()]) \
... .run()
>>> repair_updates_df.show()
+---+---------+-------------+-----------+
|tid|attribute|current_value| repaired|
+---+---------+-------------+-----------+
| 7| Sex| null| Female|
| 12| Age| null| 18-21|
| 12| Sex| null| Female|
| 3| Sex| null| Female|
| 5| Age| null| 18-21|
| 5| Income| null|MoreThan50K|
| 16| Income| null|MoreThan50K|
+---+---------+-------------+-----------+
# You need to set `True` to `repair_data` for getting repaired data directly
>>> clean_df = delphi.repair \
... .setInput("adult") \
... .setRowId("tid") \
... .setErrorDetectors([NullErrorDetector()]) \
... .run(repair_data=True)
>>> clean_df.show()
+---+-----+------------+-----------------+-------------+------+-------------+-----------+
|tid| Age| Education| Occupation| Relationship| Sex| Country| Income|
+---+-----+------------+-----------------+-------------+------+-------------+-----------+
| 0|31-50|Some-college| Craft-repair| Husband| Male|United-States|LessThan50K|
| 1| >50|Some-college| Exec-managerial| Own-child|Female|United-States|LessThan50K|
| 2|31-50| Bachelors| Sales| Husband| Male|United-States|LessThan50K|
| 3|22-30| HS-grad| Craft-repair| Own-child| Male|United-States|LessThan50K|
| 4|22-30| HS-grad| Farming-fishing| Husband|Female|United-States|LessThan50K|
| 5|31-50|Some-college| Craft-repair| Husband| Male|United-States|LessThan50K|
| 6|31-50| HS-grad| Prof-specialty|Not-in-family|Female|United-States|LessThan50K|
| 7|31-50| Prof-school| Prof-specialty| Husband| Male| India|MoreThan50K|
| 8|18-21|Some-college| Adm-clerical| Own-child|Female|United-States|LessThan50K|
| 9| >50| HS-grad| Farming-fishing| Husband| Male|United-States|LessThan50K|
| 10| >50| Assoc-voc| Prof-specialty| Husband| Male|United-States|LessThan50K|
| 11| >50| HS-grad| Sales| Husband|Female|United-States|MoreThan50K|
| 12|31-50| Bachelors| Exec-managerial| Husband| Male|United-States|MoreThan50K|
| 13|22-30| HS-grad| Craft-repair|Not-in-family| Male|United-States|LessThan50K|
| 14|31-50| Assoc-acdm| Exec-managerial| Unmarried| Male|United-States|LessThan50K|
| 15|22-30|Some-college| Sales| Own-child| Male|United-States|LessThan50K|
| 16| >50|Some-college| Exec-managerial| Unmarried|Female|United-States|LessThan50K|
| 17|31-50| HS-grad| Adm-clerical|Not-in-family|Female|United-States|LessThan50K|
| 18|31-50| 10th|Handlers-cleaners| Husband| Male|United-States|LessThan50K|
| 19|31-50| HS-grad| Sales| Husband| Male| Iran|MoreThan50K|
+---+-----+------------+-----------------+-------------+------+-------------+-----------+
# Or, you can merge the computed repair updates with the input table as follows
>>> repair_updates_df.createOrReplaceTempView("predicted")
>>> clean_df = delphi.misc.options({"repair_updates": "predicted", "table_name": "adult", "row_id": "tid"}).repair()
>>> clean_df.show()
<the same output above>
```
For more running examples, please check Python scripts in the [resources/examples](./resources/examples) folder.
NOTE: There are many types of errors on dirty data [9], but our purpose is to repair the data
whose attribute already has correct values against their errors.
For instance, in the `Sex` column i
没有合适的资源?快使用搜索试试~ 我知道了~
提供构建统计模型的功能,以修复Spark中的脏表格数据_Jupyter Notebook_Python_下载.zip
共147个文件
csv:33个
py:33个
scala:20个
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 182 浏览量
2023-04-28
14:10:16
上传
评论
收藏 27.58MB ZIP 举报
温馨提示
提供构建统计模型的功能,以修复Spark中的脏表格数据_Jupyter Notebook_Python_下载.zip
资源推荐
资源详情
资源评论
收起资源包目录
提供构建统计模型的功能,以修复Spark中的脏表格数据_Jupyter Notebook_Python_下载.zip (147个子文件)
tax_clean.csv 56.6MB
tax.csv 15.18MB
movies_clean.csv 6.37MB
movies.csv 4.34MB
hospital_clean.csv 546KB
hospital_clean.csv 546KB
beers_clean.csv 488KB
rayyan_clean.csv 427KB
flights_clean.csv 375KB
hospital.csv 285KB
hospital.csv 285KB
hospital.csv 285KB
rayyan.csv 267KB
beers.csv 249KB
flights.csv 151KB
boston_clean.csv 86KB
boston_clean.csv 86KB
boston_orig.csv 35KB
boston.csv 35KB
boston.csv 34KB
hospital_error_cells.csv 16KB
iris_clean.csv 12KB
iris_clean.csv 12KB
iris_orig.csv 3KB
iris.csv 3KB
adult_clean.csv 3KB
iris.csv 3KB
adult_clean.csv 1KB
adult.csv 1KB
adult.csv 1KB
adult.csv 1KB
adult_repair.csv 160B
adult_dirty.csv 65B
RegexBase.g4 1KB
.gitignore 690B
.gitkeep 0B
.gitkeep 0B
mypy.ini 1022B
tox.ini 1018B
boston.ipynb 297KB
adult.ipynb 53KB
hospital.ipynb 46KB
spark-data-repair-plugin_2.12_spark3.2_0.1.0-EXPERIMENTAL-with-dependencies.jar 552KB
snippets.jupyterlab-settings 9KB
plugin.jupyterlab-settings 341B
commands.jupyterlab-settings 254B
tracker.jupyterlab-settings 64B
LICENSE 11KB
lint-python 9KB
Makefile 638B
README.md 20KB
README.md 849B
README.md 100B
mvn 6KB
tax.py.out 11KB
rayyan.py.out 10KB
movies.py.out 9KB
flights.py.out 8KB
beers.py.out 8KB
error-detectors.py.out 7KB
iris.py.out 5KB
hospital.py.out 5KB
boston.py.out 5KB
adult.py.out 4KB
hospital-dist.parquet 201KB
hospital-error-analysis.parquet 14KB
hospital-training-data-hist.parquet 7KB
log4j.properties 2KB
model.py 72KB
test_model.py 53KB
errors.py 26KB
test_errors.py 17KB
test_model_perf.py 15KB
misc.py 14KB
train.py 12KB
run-tests.py 11KB
utils.py 9KB
test_misc.py 8KB
test_utils.py 7KB
conda.py 7KB
testutils.py 6KB
test_costs.py 4KB
error-detectors.py 3KB
main.py 3KB
boston.py 3KB
costs.py 3KB
requirements.py 3KB
conf.py 3KB
hospital.py 2KB
beers.py 2KB
tax.py 2KB
adult.py 2KB
.startup.py 2KB
flights.py 2KB
movies.py 2KB
rayyan.py 2KB
api.py 2KB
iris.py 2KB
hospital-preprocess-blocking.py 758B
__init__.py 0B
共 147 条
- 1
- 2
资源评论
快撑死的鱼
- 粉丝: 1w+
- 资源: 9154
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功