# 16 draw graphs using python and R

Shuyue Xu

## 16.2 Continuous Variables

we always care about asymmetry, outliers, multimodality, gaps, heaping, errors

### 16.2.1 Graphs that usually used for continuous Variables:

#### 16.2.1.1 Histograms: R base code:hist(x, …)

##### 16.2.1.1.2 shape:
           skew to left means mean < median < mode
skew to right means mean > median > mode

#### 16.2.1.2 Boxplot: R base code: boxplot(x, data=),

##### 16.2.1.2.2 will show outliers on the graph. outliers are
           1.5 x hinge or fourth spread above upper-hinge
1.5 x hinge or fourth spread below lower-hinge<br/> 

### 16.2.2 Ways to check normal distribution of the data

#### 16.2.2.1 qqplot: check normal distribution

         qqnorm(): produces a normal QQ plot of the variable
qqline(): adds a reference line

## 16.3 Multivariate categorical data

#### 16.3.0.1 nominal vs. ordinal, ordinal vs. discrete,…

### Frequency #### Bar plot Basic R code: barplot(x) ##### Sort in logical order of the categories (level1, level2, lever3..)(Ordinal data) ##### Sort from highest to lowest count (Nominal data)

#### 16.3.0.3 R codes for factor data

##### 16.3.0.3.7 use “fct_explicit_na()” to turn NAs into a real factor level

###Proportion / Association #### Mosaic plots Code: mosaic(y~x) ##### check association between different variables ##### Chi Square Test: If the p-value is smaller than alpha level, we should reject the null hypothesis that two variables are independent and have the conclusion that they depend on each other.

## 16.5 Continuous Categorical Variables

### 16.5.2 Alluvial diagrams

#### 16.5.2.3 different codes to plot alluvail diagram:

ggplot(df, aes(axis1, axis2, y = Freq)) +
geom_alluvium(color = "blue") +
geom_stratum() +
geom_text(stat = "stratum", aes(label = paste(after_stat(stratum)))) 
##### 16.5.2.3.1 or change data to lodes form first
dfl <- to_lodes_form(df, axes = 1:2)
ggplot(dfl, aes(alluvium = alluvium, x = x, stratum = stratum, y = Freq)) +
geom_alluvium(color = "blue") +
geom_stratum() +
geom_text(stat = "stratum", aes(label = paste(after_stat(stratum))))

### 16.5.3 some useful packages to plot in Python

import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.pyplot import figure

### 16.5.4 make plot in python

##scattler plot
plt.scatter(x,y,marker='o')
plt.ylabel()
plt.xlabel()
##histogram
plt.hist(x)
##pie chart
plt.pie(wf["Count"],labels = wf["Weak Foot"])
##boxplot
sns.boxplot(x=,y=)
##violin plot
sns.violinplot(x= ,y = )
##barplot
sns.barplot(x=, y=, data=)
##small multiple of bar plots
fig,axs = plt.subplots(3,1,figsize = (15,8))
sns.barplot(ax=axs[0],data=s1, x='workclass', hue='target', y='Count')
sns.barplot(ax=axs[1],data=s2, x='education', hue='target', y='Count')
sns.barplot(ax=axs[2],data=s3, x='sex', hue='target', y='Count')
##another way of small multiple of scatter plots
plt.subplot(3, 1, 1)
plt.scatter(auto_mpg_X['displacement'],auto_mpg_y )

plt.subplot(3,1,2)
plt.scatter(auto_mpg_X['horsepower'],auto_mpg_y )

plt.subplot(3,1,3)
plt.scatter(auto_mpg_X['weight'],auto_mpg_y )
## some codes to preprocess data
df.groupby()
df.dropna(subset=)

For this Cheat sheet, I reviewed all the slides again and collected the most important materials which I think may be used when we do EDA. I think these important materials would help students preparing their final exams and give them hints before they start fitting models on data. This Cheat sheet includes lots of different types of graphs which we can used to see distributions of our data first before future studies like classifications or predictions. Also, I add some explanations for the graphs and what information we can get from each graph. I think this can help students better analyze the graphs.
I add some basic codes to make plots in Python which gives examples for students who didn’t learn Python before. These codes include examples for basic graphs such as bar chart, pie chart, histograms, scatter plot and so on.
When making this cheat sheet, I reviewed all the things we learned from class which benefits me to have a deep understanding about this course. This memory consolidation helps me have a deep impression on knowledge. Also, I have another machine learning class which asked us to do data analysis. Before I trying to fit models on the data, we also need to do some preprocessing. I would never forget doing data cleaning after this course. Machine learning course asked us to used Python instead of R codes, so I think this Cheat sheet can save me time in future projects and improve my efficient.
Next time maybe I will provide more detailed Python codes such as what kind of package in Python can do the same thing as ‘dplyr’. I would add codes that can select rows or columns in a dataframe, change ordering of the rows or other data manipulations.