提供构建统计模型的功能，以修复Spark中的脏表格数据_JupyterNotebook_Python

共147个文件

csv：33个

py：33个

scala：20个

版权申诉

182 浏览量 2023-04-28 14:10:16 上传评论收藏 27.58MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

提供构建统计模型的功能，以修复Spark中的脏表格数据_Jupyter Notebook_Python_下载.zip （147个子文件）

tax_clean.csv 56.6MB

tax.csv 15.18MB

movies_clean.csv 6.37MB

movies.csv 4.34MB

hospital_clean.csv 546KB

beers_clean.csv 488KB

rayyan_clean.csv 427KB

flights_clean.csv 375KB

hospital.csv 285KB

rayyan.csv 267KB

beers.csv 249KB

flights.csv 151KB

boston_clean.csv 86KB

boston_orig.csv 35KB

boston.csv 35KB

boston.csv 34KB

hospital_error_cells.csv 16KB

iris_clean.csv 12KB

iris_orig.csv 3KB

iris.csv 3KB

adult_clean.csv 3KB

iris.csv 3KB

adult_clean.csv 1KB

adult.csv 1KB

adult_repair.csv 160B

adult_dirty.csv 65B

RegexBase.g4 1KB

.gitignore 690B

.gitkeep 0B

mypy.ini 1022B

tox.ini 1018B

boston.ipynb 297KB

adult.ipynb 53KB

hospital.ipynb 46KB

spark-data-repair-plugin_2.12_spark3.2_0.1.0-EXPERIMENTAL-with-dependencies.jar 552KB

snippets.jupyterlab-settings 9KB

plugin.jupyterlab-settings 341B

commands.jupyterlab-settings 254B

tracker.jupyterlab-settings 64B

LICENSE 11KB

lint-python 9KB

Makefile 638B

README.md 20KB

README.md 849B

README.md 100B

mvn 6KB

tax.py.out 11KB

rayyan.py.out 10KB

movies.py.out 9KB

flights.py.out 8KB

beers.py.out 8KB

error-detectors.py.out 7KB

iris.py.out 5KB

hospital.py.out 5KB

boston.py.out 5KB

adult.py.out 4KB

hospital-dist.parquet 201KB

hospital-error-analysis.parquet 14KB

hospital-training-data-hist.parquet 7KB

log4j.properties 2KB

model.py 72KB

test_model.py 53KB

errors.py 26KB

test_errors.py 17KB

test_model_perf.py 15KB

misc.py 14KB

train.py 12KB

run-tests.py 11KB

utils.py 9KB

test_misc.py 8KB

test_utils.py 7KB

conda.py 7KB

testutils.py 6KB

test_costs.py 4KB

error-detectors.py 3KB

main.py 3KB

boston.py 3KB

costs.py 3KB

requirements.py 3KB

conf.py 3KB

hospital.py 2KB

beers.py 2KB

tax.py 2KB

adult.py 2KB

.startup.py 2KB

flights.py 2KB

movies.py 2KB

rayyan.py 2KB

api.py 2KB

iris.py 2KB

hospital-preprocess-blocking.py 758B

__init__.py 0B

共 147 条

[![License](http://img.shields.io/:license-Apache_v2-blue.svg)](https://github.com/maropu/spark-data-repair-plugin/blob/master/LICENSE) [![Build and test](https://github.com/maropu/spark-data-repair-plugin/workflows/Build%20and%20tests/badge.svg)](https://github.com/maropu/spark-data-repair-plugin/actions?query=workflow%3A%22Build+and+tests%22)  This is an experimental prototype for building a statistical model to repair tabular data errors on [Apache Spark](https://spark.apache.org/) which is a parallel and distributed framework for large-scale data processing. Clean and consistent data is one of major interests for downstream analytics; clean data makes machine learning and BI reporting more accurate and consistent data with constraints (e.g., functional dependences) is important for efficient query plans. Therefore, data repairing is a first step for a reliable analytics pipeline. ## How to Repair Error Cells ``` $ git clone https://github.com/maropu/spark-data-repair-plugin.git $ cd spark-data-repair-plugin # This repository includes a simple wrapper script `bin/python` to create # a conda virtual environment to resolve the required dependencies # (e.g., Python 3.7 and PySpark 3.2), and then # launch a Python VM with our plugin. $ ./bin/python Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.2.0 /_/ Using Python version 3.7.11 (default, Jul 27 2021 07:03:16) SparkSession available as 'spark'. Delphi APIs (version 0.1.0-spark3.2-EXPERIMENTAL) available as 'delphi'. # Loads CSV data having seven NULL cells >>> spark.read.option("header", True).csv("./testdata/adult.csv").createOrReplaceTempView("adult") >>> spark.table("adult").show() +---+-----+------------+-----------------+-------------+------+-------------+-----------+ |tid| Age| Education| Occupation| Relationship| Sex| Country| Income| +---+-----+------------+-----------------+-------------+------+-------------+-----------+ | 0|31-50|Some-college| Craft-repair| Husband| Male|United-States|LessThan50K| | 1| >50|Some-college| Exec-managerial| Own-child|Female|United-States|LessThan50K| | 2|31-50| Bachelors| Sales| Husband| Male|United-States|LessThan50K| | 3|22-30| HS-grad| Craft-repair| Own-child| null|United-States|LessThan50K| | 4|22-30| HS-grad| Farming-fishing| Husband|Female|United-States|LessThan50K| | 5| null|Some-college| Craft-repair| Husband| Male|United-States| null| | 6|31-50| HS-grad| Prof-specialty|Not-in-family|Female|United-States|LessThan50K| | 7|31-50| Prof-school| Prof-specialty| Husband| null| India|MoreThan50K| | 8|18-21|Some-college| Adm-clerical| Own-child|Female|United-States|LessThan50K| | 9| >50| HS-grad| Farming-fishing| Husband| Male|United-States|LessThan50K| | 10| >50| Assoc-voc| Prof-specialty| Husband| Male|United-States|LessThan50K| | 11| >50| HS-grad| Sales| Husband|Female|United-States|MoreThan50K| | 12| null| Bachelors| Exec-managerial| Husband| null|United-States|MoreThan50K| | 13|22-30| HS-grad| Craft-repair|Not-in-family| Male|United-States|LessThan50K| | 14|31-50| Assoc-acdm| Exec-managerial| Unmarried| Male|United-States|LessThan50K| | 15|22-30|Some-college| Sales| Own-child| Male|United-States|LessThan50K| | 16| >50|Some-college| Exec-managerial| Unmarried|Female|United-States| null| | 17|31-50| HS-grad| Adm-clerical|Not-in-family|Female|United-States|LessThan50K| | 18|31-50| 10th|Handlers-cleaners| Husband| Male|United-States|LessThan50K| | 19|31-50| HS-grad| Sales| Husband| Male| Iran|MoreThan50K| +---+-----+------------+-----------------+-------------+------+-------------+-----------+ # Runs a job to compute repair updates for the seven NULL cells above in `dirty_df` # A `repaired` column represents proposed updates to repiar them >>> from repair.errors import NullErrorDetector >>> repair_updates_df = delphi.repair \ ... .setInput("adult") \ ... .setRowId("tid") \ ... .setErrorDetectors([NullErrorDetector()]) \ ... .run() >>> repair_updates_df.show() +---+---------+-------------+-----------+ |tid|attribute|current_value| repaired| +---+---------+-------------+-----------+ | 7| Sex| null| Female| | 12| Age| null| 18-21| | 12| Sex| null| Female| | 3| Sex| null| Female| | 5| Age| null| 18-21| | 5| Income| null|MoreThan50K| | 16| Income| null|MoreThan50K| +---+---------+-------------+-----------+ # You need to set `True` to `repair_data` for getting repaired data directly >>> clean_df = delphi.repair \ ... .setInput("adult") \ ... .setRowId("tid") \ ... .setErrorDetectors([NullErrorDetector()]) \ ... .run(repair_data=True) >>> clean_df.show() +---+-----+------------+-----------------+-------------+------+-------------+-----------+ |tid| Age| Education| Occupation| Relationship| Sex| Country| Income| +---+-----+------------+-----------------+-------------+------+-------------+-----------+ | 0|31-50|Some-college| Craft-repair| Husband| Male|United-States|LessThan50K| | 1| >50|Some-college| Exec-managerial| Own-child|Female|United-States|LessThan50K| | 2|31-50| Bachelors| Sales| Husband| Male|United-States|LessThan50K| | 3|22-30| HS-grad| Craft-repair| Own-child| Male|United-States|LessThan50K| | 4|22-30| HS-grad| Farming-fishing| Husband|Female|United-States|LessThan50K| | 5|31-50|Some-college| Craft-repair| Husband| Male|United-States|LessThan50K| | 6|31-50| HS-grad| Prof-specialty|Not-in-family|Female|United-States|LessThan50K| | 7|31-50| Prof-school| Prof-specialty| Husband| Male| India|MoreThan50K| | 8|18-21|Some-college| Adm-clerical| Own-child|Female|United-States|LessThan50K| | 9| >50| HS-grad| Farming-fishing| Husband| Male|United-States|LessThan50K| | 10| >50| Assoc-voc| Prof-specialty| Husband| Male|United-States|LessThan50K| | 11| >50| HS-grad| Sales| Husband|Female|United-States|MoreThan50K| | 12|31-50| Bachelors| Exec-managerial| Husband| Male|United-States|MoreThan50K| | 13|22-30| HS-grad| Craft-repair|Not-in-family| Male|United-States|LessThan50K| | 14|31-50| Assoc-acdm| Exec-managerial| Unmarried| Male|United-States|LessThan50K| | 15|22-30|Some-college| Sales| Own-child| Male|United-States|LessThan50K| | 16| >50|Some-college| Exec-managerial| Unmarried|Female|United-States|LessThan50K| | 17|31-50| HS-grad| Adm-clerical|Not-in-family|Female|United-States|LessThan50K| | 18|31-50| 10th|Handlers-cleaners| Husband| Male|United-States|LessThan50K| | 19|31-50| HS-grad| Sales| Husband| Male| Iran|MoreThan50K| +---+-----+------------+-----------------+-------------+------+-------------+-----------+ # Or, you can merge the computed repair updates with the input table as follows >>> repair_updates_df.createOrReplaceTempView("predicted") >>> clean_df = delphi.misc.options({"repair_updates": "predicted", "table_name": "adult", "row_id": "tid"}).repair() >>> clean_df.show() <the same output above> ``` For more running examples, please check Python scripts in the [resources/examples](./resources/examples) folder. NOTE: There are many types of errors on dirty data [9], but our purpose is to repair the data whose attribute already has correct values against their errors. For instance, in the `Sex` column i

评论收藏

内容反馈

版权申诉