Pandas Profiling
Documentation | Slack | Stack Overflow | Latest changelog
Generates profile reports from a pandas DataFrame
.
The pandas df.describe()
function is great but a little basic for serious exploratory data analysis.
pandas_profiling
extends the pandas DataFrame with df.profile_report()
for quick data analysis.
For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:
- Type inference: detect the types of columns in a dataframe.
- Essentials: type, unique values, missing values
- Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
- Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
- Most frequent values
- Histogram
- Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
- Missing values matrix, count, heatmap and dendrogram of missing values
- Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.
- File and Image analysis extract file sizes, creation dates and dimensions and scan for truncated images or those containing EXIF information.
Announcements
Spark backend in progress: We can happily announce that we're nearing v1 for the Spark backend for generating profile reports. Beta testers wanted! The Spark backend will be released as a pre-release for this package.
Monitoring time series?: I'd like to draw your attention to popmon. Whereas pandas-profiling allows you to explore patterns in a single dataset, popmon allows you to uncover temporal patterns. It's worth checking out!
Support pandas-profiling
The development of pandas-profiling
relies completely on contributions.
If you find value in the package, we welcome you to support the project directly through GitHub Sponsors!
Please help me to continue to support this package.
Find more information: Sponsor the project on GitHub
Contents: Examples | Installation | Documentation | Large datasets | Command line usage | Advanced usage | Support | Go beyond | Support the project | Types | How to contribute | Editor Integration | Dependencies
Examples
The following examples can give you an impression of what the package can do:
- Census Income (US Adult Census data relating income)
- NASA Meteorites (comprehensive set of meteorite landings)
- Titanic (the "Wonderwall" of datasets)
- NZA (open data from the Dutch Healthcare Authority)
- Stata Auto (1978 Automobile data)
- Vektis (Vektis Dutch Healthcare data)
- Colors (a simple colors dataset)
- UCI Bank Dataset (banking marketing dataset)
- RDW (RDW, the Dutch DMV's vehicle registration 10 million rows, 71 features)
Specific features:
- Russian Vocabulary (demonstrates text analysis)
- Cats and Dogs (demonstrates image analysis from the file system)
- Celebrity Faces (demonstrates image analysis with EXIF information)
- Website Inaccessibility (demonstrates URL analysis)
- Orange prices and Coal prices (showcases report themes)
Tutorials:
- Tutorial: report structure using Kaggle data (advanced) (modify the report's structure) [
](https://colab.research.