96 A brief instruction for the half semester of EDAV5702

Siyuan Sang

96.1 Introduction The Exploratory Data Analysis and Visualization(EDAV) is a very useful course for R users to get familiar with the techiques and rules we need to learn in terms of data visualization(honestly, anyone can figure out this from the name). In this instruction, instead of digging into some R vitualization skills, I’ll provide you with some useful suggestions and resources based on my first half of semester. I sincerely hope this instruction can help somebody out in the future.

96.2 The packages you should get familiar with These two lines of code will appear on the top of your every homework assignment. Get familiar with them, ggplot2 is the base of graph in this course, and tidyverse is the core package of tidyverse. Here’re two websites you can learn in advance,

96.3 Pay attention to the rules learned in class In the process of data vatualization, there are some rules that you may ignore. However, many students(including me) lost points because of this. To make it more specific, I’ll quote my work and the correct one from my homework1 here to show you how it works. My code goes like this:

ggplot(loans_full_schema, aes(x =loan_purpose, y = loan_amount)) + 
  geom_boxplot() +
  coord_flip() While the correct one should be:

ggplot(loans_full_schema, aes(x = fct_reorder(loan_purpose, loan_amount), y = loan_amount)) + 
  geom_boxplot() +
  coord_flip() See the difference? The only thing I forget to do is arrange the boxed in descending order of median. However, it turns out that this little difference can cause huge difference.

96.4 Try your best to imporve the graph As a course of data vitualization, in my view, drawing a graph should never be the end of the story. There’re many ways to make your graphs look more clear and refined. Here I give an example of the scatterplot of housing price.

cor_df <- ames %>%
  group_by(Neighborhood) %>%
  summarize(cor = cor(area, price), beta = lm(price~area)$coefficients[2], meanprice = mean(price)) %>%
  ungroup() %>%
ggplot(cor_df, aes(meanprice, beta)) +
  geom_point() +
  geom_smooth(method="lm", se= FALSE) It’s now just a set of points with a line in this graph. Now I want to improve it. Considering our goal to make an analysis. It can be a good idea to put the names of Neighborhoods inside.

ggplot(cor_df, aes(meanprice, beta, label = Neighborhood)) +
  geom_point() +
  geom_text(nudge_y = 10, size = 3) +
  geom_smooth(method="lm", se= FALSE)
#### In this graph, although a lot names are in a mess, we can easily figure out the Neighborhood name of the oulier, which is an observable features of the dataset. Here is a blog I’d like to provide, which is a wide summary of many codes to improve your graph, The author is

96.5 Analysis is important! During the first half of semester, I noticed that many students including me seem to have some misunderstanding of this course. We should always remember that the goal of data vitualization is not that graph but what we conclude from that graph. The reason to improve the graphs is not writting several meaningless words as your final analysis. As a data analyst, I think the time we need to cost on analysis the graph should always be much more than we draw it, even you’re not asked to do so.

96.6 Some other suggestions Although this is a course based on R, I still think learning how to achieve the same thing in python is very useful. When you’re asked to find a partner, do it ealier than anyone else. When you’re facing with some problems, you can also look through the instructions in other languages, there’re many high quality journals that is not in English. As an example, Google tanslater is good enough to make me understand the most ideas of a instruction in Japanese.