16 draw graphs using python and R

Shuyue Xu

16.1 things that need to be done before fitting models

16.1.0.0.1 check unique of the values
16.1.0.0.2 check null values, consider to drop values or adding a missing indicator
16.1.0.0.3 normalize numerical features
16.1.0.0.4 handle categorical feature using Label encoding or OneHotencoding or other…


16.2 Continuous Variables

we always care about asymmetry, outliers, multimodality, gaps, heaping, errors

16.2.1 Graphs that usually used for continuous Variables:

16.2.1.1 Histograms: R base code:hist(x, …)

16.2.1.1.1 different bin boundaries different graphs.
16.2.1.1.2 shape:
           skew to left means mean < median < mode
           skew to right means mean > median > mode

16.2.1.2 Boxplot: R base code: boxplot(x, data=),

16.2.1.2.1 multiple boxplots usually reorder by median from high to low
16.2.1.2.2 will show outliers on the graph. outliers are
           1.5 x hinge or fourth spread above upper-hinge
           1.5 x hinge or fourth spread below lower-hinge<br/> 

16.2.2 Ways to check normal distribution of the data


16.2.2.1 qqplot: check normal distribution

         qqnorm(): produces a normal QQ plot of the variable
         qqline(): adds a reference line
16.2.2.1.1 If the data is normally distributed, the points in the QQ-normal plot lie on a straight diagonal line.


16.2.2.2 density curve line:

16.2.2.2.1 compare density curve line with normal curve line can also check normal distribution


16.2.2.3 Shapiro Wilk test: code:shapiro.test(x)

16.2.2.3.1 If the p-value is smaller than alpha level, we should reject the null hypothesis that data is normally distributed and make the conclusion that data is not normally distributed.


16.3 Multivariate categorical data

16.3.0.1 nominal vs. ordinal, ordinal vs. discrete,…


### Frequency #### Bar plot Basic R code: barplot(x) ##### Sort in logical order of the categories (level1, level2, lever3..)(Ordinal data) ##### Sort from highest to lowest count (Nominal data)

16.3.0.2 Cleveland dot plot

16.3.0.3 R codes for factor data

16.3.0.3.1 don’t use" “levels(x) =”, only use “level(x)”
16.3.0.3.2 use “fct_reorder()” to assign new factor levels
16.3.0.3.3 use “fct_inorder()” to set level order to row order
16.3.0.3.4 use “fct_relevel()” to move levels to beginning to change the level order
16.3.0.3.5 use “fct_infreq()” to order the levels by decreasing frequency
16.3.0.3.6 use “fct_rev()” to reverse the order of factor levels
16.3.0.3.7 use “fct_explicit_na()” to turn NAs into a real factor level


###Proportion / Association #### Mosaic plots Code: mosaic(y~x) ##### check association between different variables ##### Chi Square Test: If the p-value is smaller than alpha level, we should reject the null hypothesis that two variables are independent and have the conclusion that they depend on each other.

16.3.0.4 Fluctation diagrams


16.3.0.5 Tidy

16.3.0.5.1 Tidy data means 1 variable per column and 1 observation per row

16.4 Find relations between two variables

16.4.0.1 Scatter Plot: Base R code: plot(x,y)

16.4.0.1.1 show gaps, clusters, outliers, boundaries, conditional relationships, associations
16.4.0.1.2 sometimes transform to log scale, square heatmap of bin counts, add density estimate contour lines

16.5 Continuous Categorical Variables

16.5.0.1 parallel coordinates plot

16.5.0.1.1 df %>% parcoords(rownames = F, brushMode = ‘1D-axes’,reorderable = TRUE,color = list(colorBy = ‘Region’,colorScale = “scaleOrdinal”,colorScheme=“schemeCategory10”),withD3 = TRUE,width = 1500,height = 800)

16.5.1 Heatmaps

16.5.1.1 can be used for continuous or categorical data

16.5.1.2 show frequency counts (2D histogram) or value of a third variable

16.5.2 Alluvial diagrams

16.5.2.1 shows flow of changes over time

16.5.2.2 color by first variable or color by last variable or

16.5.2.3 different codes to plot alluvail diagram:

ggplot(df, aes(axis1, axis2, y = Freq)) +
  geom_alluvium(color = "blue") +
  geom_stratum() +
  geom_text(stat = "stratum", aes(label = paste(after_stat(stratum)))) 
16.5.2.3.1 or change data to lodes form first
dfl <- to_lodes_form(df, axes = 1:2)
ggplot(dfl, aes(alluvium = alluvium, x = x, stratum = stratum, y = Freq)) +
  geom_alluvium(color = "blue") +
  geom_stratum() +
  geom_text(stat = "stratum", aes(label = paste(after_stat(stratum))))

16.5.3 some useful packages to plot in Python

import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.pyplot import figure

16.5.4 make plot in python

##scattler plot
plt.scatter(x,y,marker='o')
plt.ylabel()
plt.xlabel()
##histogram
plt.hist(x)
##pie chart
plt.pie(wf["Count"],labels = wf["Weak Foot"])
##boxplot
sns.boxplot(x=,y=)
##violin plot
sns.violinplot(x= ,y = )
##barplot
sns.barplot(x=, y=, data=)
##small multiple of bar plots
fig,axs = plt.subplots(3,1,figsize = (15,8))
sns.barplot(ax=axs[0],data=s1, x='workclass', hue='target', y='Count')
sns.barplot(ax=axs[1],data=s2, x='education', hue='target', y='Count')
sns.barplot(ax=axs[2],data=s3, x='sex', hue='target', y='Count')
##another way of small multiple of scatter plots
plt.subplot(3, 1, 1)
plt.scatter(auto_mpg_X['displacement'],auto_mpg_y )

plt.subplot(3,1,2)
plt.scatter(auto_mpg_X['horsepower'],auto_mpg_y )

plt.subplot(3,1,3)
plt.scatter(auto_mpg_X['weight'],auto_mpg_y )
## some codes to preprocess data
df.groupby()
df.dropna(subset=)

For this Cheat sheet, I reviewed all the slides again and collected the most important materials which I think may be used when we do EDA. I think these important materials would help students preparing their final exams and give them hints before they start fitting models on data. This Cheat sheet includes lots of different types of graphs which we can used to see distributions of our data first before future studies like classifications or predictions. Also, I add some explanations for the graphs and what information we can get from each graph. I think this can help students better analyze the graphs.
I add some basic codes to make plots in Python which gives examples for students who didn’t learn Python before. These codes include examples for basic graphs such as bar chart, pie chart, histograms, scatter plot and so on.
When making this cheat sheet, I reviewed all the things we learned from class which benefits me to have a deep understanding about this course. This memory consolidation helps me have a deep impression on knowledge. Also, I have another machine learning class which asked us to do data analysis. Before I trying to fit models on the data, we also need to do some preprocessing. I would never forget doing data cleaning after this course. Machine learning course asked us to used Python instead of R codes, so I think this Cheat sheet can save me time in future projects and improve my efficient.
Next time maybe I will provide more detailed Python codes such as what kind of package in Python can do the same thing as ‘dplyr’. I would add codes that can select rows or columns in a dataframe, change ordering of the rows or other data manipulations.