autoEDA-resources A list of software and papers related to automated Exploratory Data Analysis, including
fast data exploration and visualization,
visualization recommendation and other tools that speed up data exploration (visual exploration in particular).
Pull requests with software, paper and conference presentations are welcome. Software R packages My summary of R packages is in R Journal Complete Packages
dataMaid (CRAN package) - automated checks of data validity.
DataExplorer (CRAN package) - automated data exploration (including univariate and bivariate plots, PCA) and treatment.
funModeling (CRAN package) - automated EDA, simple feature engineering and outlier detection.
auto-EDA (GitHub package) - uni- and bivariate plots for data exploration in regression and classification problem. The package cleans data automatically to improve the plots. Another version of Xander Horn's package.
visdat (CRAN package) - 6 exploratory/diagnostic plots for initial data analysis.
dlookr (CRAN package) - tools for data quality diagnosis, basic exploration and feature transformations.
arsenal (CRAN package) - statistical summaries (models and exploration) and quick reporting.
exploreR (CRAN package) - exploration based on univariate linear regression.
summarytools (CRAN package) - table to summarise datasets and perform simple uni- and bivariate analyses.
explore (CRAN package) - interactive Shiny app for comprehensive dataset exploration (including uni- and bivariate relationships, correlation analysis and simple modeling with decision trees) and stand-alone function for quick exploration. Examples are given in a vignette.
Packages in Development
AEDA (GitHub package) - summary statistics, correlation analysis, cluster analysis, PCA & other projections.
dataexpks (GitHub package) - quick reports with basic data summaries.
automatic-data-explorer (GitHub package) - basic EDA and creating Markdown reports from multiple R scripts.
xda (GitHub package) - basic data summaries.
EDA - stub of a package.
modeler (GitHub package) - tools for exploration and pre-processing.
IEDA (GitHub package) - EDA simplified through interactive visualization.
seda (GitHub package) - fast EDA tool in active development.
dfvis (GitHub package) - ggplot2 based implementation of tabplot.
ExPanDaR - package for interactive data visualization. Designed for longitudinal data, but can be also used with other types of data after setting an artificial time variable. Shiny apps with examples are provided on the github website of the package.
brolgar (GitHub package) - tools to assist in longitudinal data analysis
featuretoolsR (CRAN package) - R port to Python library for automated feature engineering.
report - automated modeling report generation.
FactoInvestigate (CRAN package) - has an automatic reporting module which selects best plots that summarise different projection techniques.
gtsummary (GitHub package) - presentation-ready tables summarizing data sets, regression models, and more.
clean (CRAN package) - fast data cleaning.
finalfit (CRAN package) - tables and plots to quickly visualize regression results.
modelsummary (GitHub package) - summary tables for regression models.
Python libraries Complete Packages
DataPrep (pip library) - data preparation library with an EDA package.
Dora (pip library) - data cleaning, featuring engineering and simple modeling tools.
statsModels (pip library) - collection of statistical tools, including EDA.
TPOT (pip library) - autoML tool with feature engineering module.
HoloViews (pip library) - automated visualization based on short data annotations.
pandas-profiling - popular library for quick data summaries and correlation analysis.
speedML (pip library) - large library for ML with module dedicated to fast EDA.
edaviz - Python library for fast data exploration that provides functions for dataset overviews, bivariate plots and finding good predictors. (Free version only works for small datasets).
AutoViz - Python library for automated visualization.
Packages in Development
basic-auto-EDA (GitHub library) - automatic report generation.
automated_EDA - stub of a library.
pandas-summary - simple extension to pandas.describe.
featuretools - library for automated feature engineering.
pyvtreat - Python version of the R's vtreat package.
autoimpute - easier handling of missing values.
eda - a package that produces a pdf report with all permutations of univariate and bivariate visualizations and tables. Notably, three-dimensional displays are also possible.
DIVE - MIT's tools for data exploration that tries to choose best (most informative) visualizations.
Automatic Statistician - tool for automated EDA and modeling.
auto-eda - automatic EDA with SQL.
elycite - tools for exploration and modelling available (locally) as an web application. Designed for NLP problems.
Papers and short articles Methods and tools for autoEDA
Interactive Data Exploration with “Big Data Tukey Plots” - automated visualization of big data.
A Rank-by-Feature Framework for Interactive Exploration of Multidimensional Data - A paper that describe many measures that can be used to sort 1d and 2d data displays.
Visualization recommendation frameworks
Foresight: Recommending Visual Insights - Foresight is a system that helps the user rapidly discover visual insights from large high-dimensional datasets.
DIVE: A Mixed-Initiative System Supporting Integrated Data Exploration Workflows. The web app is available on MIT website.