# pandas_dq
`pandas-dq` is the ultimate data quality toolkit for pandas dataframes.
![pandas_dq](./images/pandas_dq_logo.png)
# Table of Contents
<ul>
<li><a href="#introduction">What is pandas_dq</a></li>
<li><a href="#Components">What are its main components</a></li>
<li><a href="#uses">How to use pandas_dq</a></li>
<li><a href="#install">How to install pandas_dq</a></li>
<li><a href="#usage">Usage</a></li>
<li><a href="#api">API</a></li>
<li><a href="#maintainers">Maintainers</a></li>
<li><a href="#contributing">Contributing</a></li>
<li><a href="#license">License</a></li>
</ul>
<p>
## Introduction
`pandas_dq` is a new python library for data quality analysis and improvement. It is fast, efficient and scalable. `pandas-dq` is:
- A smart and simple way to clean and improve your pandas dataframes.
- A powerful way to boost your data analysis with high-quality pandas dataframes.
- A powerful and flexible library for data quality management in pandas.
### Data quality made easy with pandas and scikit-learn transformers
The new `pandas_dq` library in Python is a great addition to the `pandas` ecosystem. It provides a set of tools for data quality assessment, which can be used to identify and address potential problems with data sets. This can help to improve the quality of data analysis and ensure that results are reliable.
The `pandas_dq` library is still under development, but it already includes a number of useful features. These include:
- <b>Data profiling</b>: pandas_dq displays a report either in-line or in HTML to give you a quick overview of your data, including its features, feature types, their null and unique value percentages, their maximum and minimum values.
- <b>Train Test comparison</b>: pandas_dq displays a comparison report either in-line or in HTML to give you a quick comparison of your train and test dataasets, including their distributional differences (using the KS Test), and comparing their null and unique value percentages.
- <b>Data cleaning</b>: pandas_dq allows you to quickly identify and remove data quality issues and inconsistencies in your data set.
- <b>Data imputation</b>: pandas_dq allows you to fill missing values with your own choice of values for each feature in your data. For example, you can have one default for `age` feature and another for `income` feature.
- <b>Data transformation</b>: pandas_dq allows you to transform skewed features into a more normal-like distribution.
The `pandas_dq` library is a valuable tool for anyone who works with data. It can help you to improve the quality of your data analysis and ensure that your results are reliable.
Here are some of the benefits of using the pandas_dq library:
- It can help you to identify and address potential problems with data sets before modeling.
- It can fix data quality issues and improve the quality of your data.
- It is easy to use and can be integrated with other `scikit-learn` pipelines.
<b>Alert!</b>: If you are using `pandas version 2.0` ("the new pandas"), beware that weird errors are popping up in all kinds of libraries that use pandas underneath. Our `pandas_dq` library is no exception. So if you plan to use `pandas_dq` with `pandas version 2.0`, beware that you may see weird errors and we can't and won't fix them!
## Components
`pandas_dq` has the following main modules:
<li><b>dq_report</b>: The data quality report displays a data quality report either inline or in HTML after it analyzes your dataset for various issues, such as missing values, outliers, duplicates, correlations, etc. It also checks the relationship between the features and the target variable (if provided) to detect data leakage.</li>
<li><b>dc_report</b>: The data comparison report displays a comparison report between train and test datasets either inline or in HTML after it analyzes both datasets for various issues, such as missing values, unique values, min and max, etc. It also checks provides a Statistical Test (KS test) to compare the distribitional differences of numeric features to detect data drift. You can exclude target column(s) from comparison between train and test.</li>
<li><b>Fix_DQ</b>: This class is a scikit-learn compatible transformer that can detect and fix data quality issues in one line of code. It can remove ID columns, zero-variance columns, rare categories, infinite values, mixed data types, outliers, high cardinality features, highly correlated features, duplicate rows and columns, skewed distributions and imbalanced classes.</li>
<li><b>DataSchemaChecker</b>: This class can check your dataset data types against a specific schema and report any mismatches or errors.</li>
`pandas_dq` is designed to provide you the cleanest features with the fewest steps.
## Uses
`pandas_dq` has multiple important modules: `dq_report`, `Fix_DQ` and now `DataSchemaChecker`. <br>
### 1. dq_report function
![dq_report_code](./images/find_dq_screenshot.png)
<p>`dq_report` displays a data quality report (inline or HTML) after it analyzes your dataset looking for these issues:
<ol>
<li>It detects ID columns</li>
<li>It detects zero-variance columns </li>
<li>It identifies rare categories (less than 5% of categories in a column)</li>
<li>It finds infinite values in a column</li>
<li>It detects mixed data types (i.e. a column that has more than a single data type)</li>
<li>It detects outliers (i.e. a float column that is beyond the Inter Quartile Range)</li>
<li>It detects high cardinality features (i.e. a feature that has more than 100 categories)</li>
<li>It detects highly correlated features (i.e. two features that have an absolute correlation higher than 0.8)</li>
<li>It detects duplicate rows (i.e. the same row occurs more than once in the dataset)</li>
<li>It detects duplicate columns (i.e. the same column occurs twice or more in the dataset)</li>
<li>It detects skewed distributions (i.e. a feature that has a skew more than 1.0) </li>
<li>It detects imbalanced classes (i.e. target variable has one class more than other in a significant way) </li>
<li>It detects feature leakage (i.e. a feature that is highly correlated to target with correlation > 0.8)</li>
</ol>
Notice that for large datasets, this report generation may take time, hence we read a 100K sample from your CSV file. If you want us to read the whole data, then send it in as a dataframe.
### 2. dc_report function
![dc_report_code](./images/dc_report.png)
`dc_report` is a data comparison tool that accepts two pandas dataframes as input and returns a report highlighting any differences between them. For example:
<ol>
<li>The function uses our function `dqr = dq_report(df)` to generate a data quality report for each dataframe and compares the results using the column names from the report.</li>
<li>It also computes the Kolmogorov-Smirnov test statistic to measure the distribution difference for numeric columns with low cardinality.</li>
<li>It also compares the Missing Values% and Unique Values% between the two dataframes and adds a comment in the "Distribution Difference" column if the two percentages are different.</li>
<li>You can exclude target column(s) from comparison between train and test.</li>
- Notice that for large datasets, this report generation may take time. So make sure you take a sample of your train and test data before calling this report!
</ol>
### 3. Fix_DQ class: a scikit_learn transformer which can detect data quality issues and fix them all in one line of code
![fix_dq](./images/fix_dq_screenshot.png)
<p>`Fix_DQ` is a great way to clean an entire train data set and apply the same steps in an MLOps pipeline to a test dataset. `Fix_DQ` can be used to detect most issues in your data (similar to dq_report but without the `target` related issues) in one step. Then it fixes those issues it finds during the `fit` method by the `transform` method. This transformer can then be saved (or "pickled") for applying the same steps on test data either at the same time or later.<br>