Chapter 53 automate eda with dataexplorer
Jing You
53.1 Overview
- This section shows how to use package DataExplorer to automate Exploratory Data Analysis (EDA) and create Data Report.
53.2 DataExplorer
- DataExplorer attempts to automate exploratory data analysis (EDA) process and offer one-click report generation to show and visualize basics about a data set. It is a user-friendly and efficient tool for first-step analysis with visualization that can avoid time consuming manual coding.
53.3 Installation
- The dataset “sleep1” will be used for the following demonstration, which has both discrete and continuous variables.
53.4 Exploratory data analysis (EDA)
- DataExplorer is a great tool for exploratory data analysis that it can handle most visualization plots without the need to manipulate data formats and data types. Every visualization plots can be done in one line function call instead of using different packages and functions.
53.4.1 Overall Information
There are many plots available in DataExplorer for preliminary data analysis to help us better understand the data set. For example,
plot_str function can be used to visualizing the basic structure about the data set with names and types specified. We can see the sleep dataset has 62 observtations, 10 variables (7 discrete and 3 continuous)
- plot_intro function can be used to describe the basic info about the data set including number of rows/cols/data type/missing values. The plot show the percentage of the data types and the comple rows. It also specify if there are column where all values are missing.
- plot_missing function can further let us take a peek into the profile of the missing values. Specifically which variables has missing values and what is the corresponding proportion.
53.4.2 Distribution
DataExplorer provides bar chart, histogram, density plot, scatterplot and boxplot, etc for the exploration of the data distribution. Those function call on the whole dataset would only plot on the according discrete/continuous variables. Thus we do not need to specify the columns for plotting. Other than that, the arguments for the plot functions are mostly consistent with those in the ggplot functions that makes it easy to use.
For example, the plot_bar function plots the distribution only for the discrete variables.
##`Life` distribution of all discrete variables
plot_bar(sleep1, with = "Life",title="Bar distribution in response to Life span")
- The plot_histogram function plots the distribution only for the continuous variables.
- The plot_qq function plots the qq-plot only for the continuous variables.
53.4.3 Correlation Analysis
- DataExplorer provides correlation heatmap plot for all non-missing features. The heatmap can also be set to plot for continuous or discrete variables only.
plot_correlation(na.omit(sleep1),type = "c",,title="Correlation heatmap of the sleep data, continuous only")
* For the discrete variable, the heatmap automatically done one hot encoding that provides further insights.
53.5 Feature Engineering
Feature engineering is often needed in the data analysis process to transform data into better representative features. DataExplorer provides mutiple functions for feature engineering including missing value filling, sparse categories grouping, one hot encoding and feature transformation.
The set_missing function can fill both the discrete and continuous variable with designated values in one line of code.
53.5.1 Missing value
- One hot encoding allows the categorical data to be more expressive. It can be done by the dummify function.
53.6 Data Report
- All the summary statistic and visualization plots of the data set can be organized into a data report in 1 step. The report automatically generates most visualization plots above. It is indeed a rough data profile but very useful for initial analysis and user-friendly for beginners
- This function is very powerful that it provides user the ability to configure based on the needs. Each section and the arguments can be rendered, reponse variable can also be added. For example, we can add boxplot and scatterplot to the report, set number of sampled row in the qq-plot, and set response variable to Life.
53.7 External Resources
https://boxuancui.github.io/DataExplorer/index.html : DataExplorer Github Page
https://rpubs.com/mark_sch7/DataExplorerPackage : package reflection