# pandas_dq
Analyze and clean your data in a single line of code with a Scikit-Learn compatible Transformer.
# Table of Contents
<ul>
<li><a href="#introduction">What is pandas_dq</a></li>
<li><a href="#uses">How to use pandas_dq</a></li>
<li><a href="#install">How to install pandas_dq</a></li>
<li><a href="#usage">Usage</a></li>
<li><a href="#api">API</a></li>
<li><a href="#maintainers">Maintainers</a></li>
<li><a href="#contributing">Contributing</a></li>
<li><a href="#license">License</a></li>
</ul>
<p>
## Introduction
`pandas_dq` is a new python library for data quality analysis and improvement. It is fast, efficient and scalable.
<b>Alert!</b>: If you are using `pandas version 2.0` ("the new pandas"), beware that weird errors are popping up in all kinds of libraries that use pandas underneath. Our `pandas_dq` library is no exception. So if you plan to use `pandas_dq` with `pandas version 2.0`, beware that you may see weird errors and we can't and won't fix them!
### What is pandas_dq?
`pandas_dq` is a new python library for automatically cleaning your dirty dataset using pandas scikit_learn functions. You can analyze your dataset and fix them - all in a single line of code! Recent addition: pandas_dq can check your dataset data types against a specific schema.
![pandas_dq](./images/pandas_dq_logo.png)
## Uses
`pandas_dq` has multiple important modules: `dq_report`, `Fix_DQ` and now `DataSchemeChecker`. <br>
### 1. dq_report function
![dq_report_code](./images/find_dq_screenshot.png)
<p>`dq_report` prints a data quality report after it analyzes your dataset for the issues:
<ol>
<li>It detects ID columns</li>
<li>It detects zero-variance columns </li>
<li>It identifies rare categories (less than 5% of categories in a column)</li>
<li>It finds infinite values in a column</li>
<li>It detects mixed data types (i.e. a column that has more than a single data type)</li>
<li>It detects outliers (i.e. a float column that is beyond the Inter Quartile Range)</li>
<li>It detects high cardinality features (i.e. a feature that has more than 100 categories)</li>
<li>It detects highly correlated features (i.e. two features that have an absolute correlation higher than 0.8)</li>
<li>It detects duplicate rows (i.e. the same row occurs more than once in the dataset)</li>
<li>It detects duplicate columns (i.e. the same column occurs twice or more in the dataset)</li>
<li>It detects skewed distributions (i.e. a feature that has a skew more than 1.0) </li>
<li>It detects imbalanced classes (i.e. target variable has one class more than other in a significant way) </li>
<li>It detects feature leakage (i.e. a feature that is highly correlated to target with correlation > 0.8)</li>
</ol>
Notice that for large datasets, this report generation may take time. So please be patient while it analyzes your dataset!
### 2. Fix_DQ class: a scikit_learn transformer which can detect data quality issues and clean them all in one line of code
![fix_dq](./images/fix_dq_screenshot.png)
<p>`Fix_DQ` is a great way to clean an entire train data set and apply the same steps in an MLOps pipeline to a test dataset. `Fix_DQ` can be used to detect most issues in your data (similar to dq_report but without the target related steps) in one step (during `fit` method). This transformer can then be saved (or "pickled") for applying the same steps on test data either at the same time or later.<br>
<p>Fix_DQ will perform following data quality cleaning steps:
<ol>
<li>It removes ID columns from further processing</li>
<li>It removes zero-variance columns from further processing</li>
<li>It identifies rare categories and groups them into a single category called "Rare"</li>
<li>It finds infinite values and replaces them with an upper bound based on Inter Quartile Range</li>
<li>It detects mixed data types and drops those mixed-type columns from further processing</li>
<li>It detects outliers and suggests to remove them or use robust statistics.</li>
<li>It detects high cardinality features but leaves them as it is.</li>
<li>It detects highly correlated features and drops one of them (whichever comes first in the column sequence)</li>
<li>It detects duplicate rows and drops one of them or keeps only one copy of duplicate rows</li>
<li>It detects duplicate columns and drops one of them or keeps only one copy</li>
<li>It detects skewed distributions and applies log or box-cox transformations on them </li>
<li>It detects imbalanced classes and leaves them as it is </li>
<li>It detects feature leakage and drops one of those features if they are highly correlated to target </li>
</ol>
<b>How can we use Fix_DQ in GridSearchCV to find the best model pipeline?</b>
<p>This is another way to find the best data cleaning steps for your train data and then use the cleaned data in hyper parameter tuning using GridSearchCV or RandomizedSearchCV along with a LightGBM or an XGBoost or a scikit-learn model.<br>
### 3. DataSchemaChecker class: a scikit_learn transformer that can check if a pandas dataframe conforms to a given schema and transform it.
The class has two methods: fit and transform. You need to initialize the class with a schema that you want to compare your data's dtypes against. A schema is a dictionary that maps column names to data types.
```
Example of a schema: all python dtypes must be surrounded by quote strings.
{'name': 'string',
'age': 'float32',
'gender': 'object',
'income': 'float64',
'target': 'integer'}
```
The fit method takes a dataframe as an argument and checks if it matches the schema. The fit method first checks if the number of columns in the dataframe and the schema are equal. If not, it creates an exception. Finally, the fit method prints a table of exceptions it found in your data against the given schema.
The transform method takes a dataframe as n argument and based on the given schema and the exceptions, converts all the exception data columns to the given schema. If not, it skips the column and prints out an error message.
![dq_ds](./images/data_schema_checker.png)
## Install
<p>
**Prerequsites:**
<ol>
<li><b>pandas_dq is built using pandas, numpy and scikit-learn - that's all.</b> It should run on almost all Python3 Anaconda installations without additional installs. You won't have to import any special libraries.</li>
</ol>
The best method to install pandas_dq is to use pip:<p>
```
pip install pandas_dq
```
To install from source:
```
cd <pandas_dq_Destination>
git clone git@github.com:AutoViML/pandas_dq.git
```
or download and unzip https://github.com/AutoViML/pandas_dq/archive/master.zip
```
conda create -n <your_env_name> python=3.7 anaconda
conda activate <your_env_name> # ON WINDOWS: `source activate <your_env_name>`
cd pandas_dq
pip install -r requirements.txt
```
## Usage
<p>
You can invoke `Fix_DQ` as a scikit-learn compatible fit and transform object. See syntax below.<p>
```
from pandas_dq import Fix_DQ
# Call the transformer to print data quality issues
# as well as clean your data - all in one step
# Create an instance of the fix_data_quality transformer with default parameters
fdq = Fix_DQ()
# Fit the transformer on X_train and transform it
X_train_transformed = fdq.fit_transform(X_train)
# Transform X_test using the fitted transformer
X_test_transformed = fdq.transform(X_test)
```
### if you are not using the Transformer, you can simply call the function, dq_report
```
from pandas_dq import dq_report
dq_report(data, target=target, csv_engine="pandas", verbose=1)
```
It prints out a data quality report like this:
![dq_report](./images/dq_report_screenshot.png)
## API
<p>
pandas_dq has a very simple API with just two modules to import: one will find data quality issues in your data and the other will fix it. Simple!
**Arguments**
`dq_report` has only 4 arguments:<br>
<b>Caution:</b> For very large data sets, we randomly sample 100K rows from your CSV file to speed up rep
pandas_dq-1.9.tar.gz
69 浏览量
2024-03-12
18:22:51
上传
评论
收藏 19KB GZ 举报
程序员Chino的日记
- 粉丝: 3029
- 资源: 4万+
最新资源
- python爬虫代码详解
- 力道图数据,是一篇论文里的
- 无损压缩音频格式ERAC
- google-chrome-stable_current_x86_64 (1)(1).rpm
- Java源代码案例 - 使用正则表达式解析sql语法树.zip
- Dijkstra算法和Floyd算法 C++源代码案例.zip
- visualstudio安装教程
- 基于Selenium的Java爬虫实战(内含谷歌浏览器Chrom和Chromedriver版本123.0.6272.2)
- 小熊派BearPi-Pico H2821 SLE通信OLED显示.zip
- Anaconda3-2023.09-0-Linux-aarch64.sh
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈