pandas-profiling generates profile reports from a pandas DataFrame. The pandas df.describe() function is handy yet a little basic for exploratory data analysis. pandas-profiling extends pandas DataFrame with df.profile_report(), which automatically generates a standardized univariate and multivariate report for data understanding.
For each column, the following information (whenever relevant for the column type) is presented in an interactive HTML report:
Type inference: detect the types of columns in a DataFrame
Essentials: type, unique values, indication of missing values
Descriptive statistics: mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
Most frequent and extreme values
Histograms: categorical and numerical
Correlations: high correlation warnings, based on different correlation metrics (Spearman, Pearson, Kendall, Cramér’s V, Phik)
Missing values: through counts, matrix, heatmap and dendrograms
Duplicate rows: list of the most common duplicated rows
Text analysis: most common categories (uppercase, lowercase, separator), scripts (Latin, Cyrillic) and blocks (ASCII, Cyrilic)
File and Image analysis: file sizes, creation dates, dimensions, indication of truncated images and existance of EXIF metadata
The report contains three additional sections:
Overview: mostly global details about the dataset (number of records, number of variables, overall missigness and duplicates, memory footprint)
Alerts: a comprehensive and automatic list of potential data quality issues (high correlation, skewness, uniformity, zeros, missing values, constant values, between others)
Reproduction: technical details about the analysis (time, version and configuration)
⚡ Looking for a Spark backend to profile large datasets? It's work in progress.
⌛ Interested in uncovering temporal patterns? Check out popmon.
▶️ Quickstart
Start by loading your pandas DataFrame as you normally would, e.g. by using:
There are two interfaces to consume the report inside a Jupyter notebook: through widgets and through an embedded HTML report.
The above is achieved by simply displaying the report as a set of widgets. In a Jupyter Notebook, run:
profile.to_widgets()
The HTML report can be directly embedded in a cell in a similar fashion:
profile.to_notebook_iframe()
Exporting the report to a file
To generate a HTML report file, save the ProfileReport to an object and use the to_file() function:
profile.to_file("your_report.html")
Alternatively, the report's data can be obtained as a JSON file:
# As a JSON stringjson_data=profile.to_json()
# As a fileprofile.to_file("your_report.json")
Using in the command line
For standard formatted CSV files (which can be read directly by pandas without additional settings), the pandas_profiling executable can be used in the command line. The example below generates a report named Example Profiling Report, using a configuration file called default.yaml, in the file report.html by processing a data.csv dataset.
请发表评论