• CC for EDAV 2019
  • 1 Instructions
    • 1.1 Background
    • 1.2 Preparing your .Rmd file
    • 1.3 Submission steps
    • 1.4 Optional tweaks
    • 1.5 FAQ
      • 1.5.1 What should I expect after creating a pull request?
      • 1.5.2 What if I catch mistakes after my pull request is merged?
      • 1.5.3 Other questions
  • 2 Sample project
  • I Working with data
  • 3 Basic R
    • 3.1 Data types
      • 3.1.1 (1) character
      • 3.1.2 (2)numeric
      • 3.1.3 (3)Logical
    • 3.2 data structure
      • 3.2.1 (1) vector:
      • 3.2.2 (2)list:
      • 3.2.3 (3)factor
      • 3.2.4 (4)matrix
      • 3.2.5 (5) dataframe
  • 4 Data structure and cleaning 101
    • 4.1 Overview
    • 4.2 Data Structure
      • 4.2.1 Basic Data Types
      • 4.2.2 Attributes
      • 4.2.3 Vector
      • 4.2.4 Matrix
      • 4.2.5 Array
      • 4.2.6 List
      • 4.2.7 Data Frame
      • 4.2.8 Data Structure Conversion
      • 4.2.9 Functions to Check Data Structure Attributes
    • 4.3 Data Cleaning
      • 4.3.1 Import Data
      • 4.3.2 Tidy Data
  • 5 All About Dataframes
    • 5.1 Create Data Frames
    • 5.2 Get information on the dataframe
    • 5.3 Concatenate dataframes
    • 5.4 Order dataframes
    • 5.5 Subset of data tables
    • 5.6 Change dataframe shape
    • 5.7 Transforming data
    • 5.8 Dealing with duplicates and missing values
    • 5.9 group_by function
  • 6 Dplyr Relational Databases
    • 6.1 1.Overview
    • 6.2 2.Definition of Relational Databases
    • 6.3 3. R Packages
    • 6.4 4. Data description for example
      • 6.4.1 4.1 BIS Library
      • 6.4.2 4.2 Selected data sets
    • 6.5 5. Types of joins
      • 6.5.1 5.1 Left_join
      • 6.5.2 5.2. Right_join
      • 6.5.3 5.3. Inner_join
      • 6.5.4 5.4. Full_join
  • 7 Web scraping using rvest
    • 7.1 1 Overview
    • 7.2 2 An Easy Example
    • 7.3 3 HTML Basics
      • 7.3.1 3.1 Access the source code
      • 7.3.2 3.2 HTML structures
    • 7.4 4 Rvest
      • 7.4.1 4.1 html_nodes and html_node
      • 7.4.2 4.2 css and xpath
    • 7.5 5 More Examples
      • 7.5.1 5.1 Scrape links using attributes
      • 7.5.2 5.2 Scrape Table
    • 7.6 6 External Resources
  • 8 Working with data links
    • 8.1 Categorical data cheatsheet
    • 8.2 Data wrangling with R cheatsheet:
    • 8.3 Date and Time Cheatsheet in R
    • 8.4 rvest cheatsheet
    • 8.5 tidyverse cheatsheet
    • 8.6 Python vs R (video)
    • 8.7 R package writing (workshop)
    • 8.8 Regex (workshop)
    • 8.9 GitHub help session (workshop)
  • II Static Graphs
  • 9 EDAV Flowchart
    • Distribution
    • Correlation
    • Comparison
    • Patterns
    • Statistical Values (ex. Median, Range)
    • Time Related
    • Survey Data (Likert Scale)
  • 10 Tufte’s Principles of Data-Ink
    • 10.1 Overview
    • 10.2 Minimal Line Plot
    • 10.3 Range-frame (or quartile-frame) scatterplot
    • 10.4 Dot-dash (or rug) scatterplot
    • 10.5 Marginal histogram scatterplot
    • 10.6 Minimal boxplot
    • 10.7 Minimal barchart
    • 10.8 Sparklines
    • 10.9 References and external resources
  • 11 Ridgeline plots
    • 11.1 Overview
    • 11.2 tl;dr
    • 11.3 Simple examples
    • 11.4 Theory
    • 11.5 External resources
  • 12 Gantt charts
    • 12.1 Using geom_line
    • 12.2 Using the package ‘plan’
  • 13 Plotrix for complex visualizations
    • 13.1 Overview
    • 13.2 Plotrix
      • 13.2.1 barNest example
    • 13.3 zoomInPlot example
    • 13.4 fan.plot example
    • 13.5 pie3D example
    • 13.6 pyramid.plot example
    • 13.7 Sources
  • 14 Stacked Bar Charts and Treemaps
    • 14.1 1. Grouped and Stacked Bar Chart
      • 14.1.1 Overview
      • 14.1.2 ggplot2
      • 14.1.3 plotly
      • 14.1.4 Consideration
      • 14.1.5 External resources
    • 14.2 2. Treemap
      • 14.2.1 Overview
      • 14.2.2 Continent level
      • 14.2.3 Region level
      • 14.2.4 Country Level
      • 14.2.5 Consideration
      • 14.2.6 External resources
  • 15 Fluctuation plots
  • 16 Introduction to package ‘ggparty’
    • 16.1 Introdunction of class ‘party’
    • 16.2 Use ‘ggparty’ to visualize the tree
    • 16.3 Customize the tree
    • 16.4 Add plots to the tree
    • 16.5 Application
      • 16.5.1 Categorical vs Numerical
      • 16.5.2 Numerical vs Numerical
  • 17 Multi-class hexbins
  • 18 Visualization in Time Series Analysis
    • 18.1 Initiate a Time series object:
    • 18.2 Plot the data:
    • 18.3 Transformation of nonstationary:
      • 18.3.1 Stationarity:
      • 18.3.2 Operations
    • 18.4 ACF and PACF for time series
    • 18.5 Full model: Yt = T(Trend) + S(Seasonality) +C(Cycle)
      • 18.5.1 Trend(T): Linear, Quadratic, etc. For normal linear model
      • 18.5.2 Seasonality(S):
      • 18.5.3 Cycle(C):
      • 18.5.4 Summary
      • 18.5.5 Reference:
  • 19 How to plot likert data
    • 19.1 Introduction
    • 19.2 Diverging stacked bar chart using function likert()
    • 19.3 Data cleaning and preparation
    • 19.4 Stacked bar chart using ggplot()
    • 19.5 Summary
  • 20 Chart: Stacked Bar Chart (For Likert Data)
    • 20.1 Overview
      • 20.1.1 Stacked Bar Chart
      • 20.1.2 Likert Data
    • 20.2 Examples
      • 20.2.1 Simple Stacked Bar Chart
      • 20.2.2 Likert Data with Stacked Bar Chart
    • 20.3 When to Use
    • 20.4 Considerations
      • 20.4.1 Interpretation of stacked bar charts:
      • 20.4.2 Alignings in Diverging Stacked Bar Charts:
    • 20.5 External Resources & References
  • 21 Likert
    • 21.1 Overview
    • 21.2 tl;dr
    • 21.3 Simple examples
      • 21.3.1 Stacked bar chart
      • 21.3.2 Diverging stacked bar chart
    • 21.4 Stacked bar chart using ggplot
    • 21.5 Theory
    • 21.6 When to use
    • 21.7 External resources
  • 22 Likert vs. Bar Chart
  • 23 Radar plots to show multivariate continuous data
  • 24 R vs tableau plots
    • 24.1 We shall now show our plots using R studio
    • 24.2 We shall now see how to do the same data visualization tasks using Tableau.
  • 25 GeomMLBStadiums
  • 26 ggmosaic
    • 26.1 Overview
    • 26.2 Introduction
    • 26.3 Order of splits
    • 26.4 Splitting on One Variable(binned data)
    • 26.5 Splitting on One Variable(unbinned data)
    • 26.6 Splitting on Two Variables
    • 26.7 Splitting on Three Variables
    • 26.8 Adjusting the Direction of Splits
    • 26.9 Alternative approach: Conditional
    • 26.10 Alternative approach: Facetting
    • 26.11 Comparison with vcd::mosaic
  • 27 Comparative Study of vcd::mosaic and geom_mosaic
    • 27.1 1. vcd::mosaic:
    • 27.2 2. geom_mosaic:
    • 27.3 3. vcd::mosaic vs geom_mosaic – which one is better?
  • 28 Latex Visualization
    • 28.0.1 Summary
  • 29 Cheat sheet of wordcloud2 package
  • 30 Wordcloud
    • 30.1 1. Introduction
    • 30.2 2. Demo of wordcloud2 Package
      • 30.2.1 2.0 Basic Wordcloud Graph
      • 30.2.2 2.1 Font Size
      • 30.2.3 2.2 Color and Background Color
      • 30.2.4 2.3 Shape
      • 30.2.5 2.4 Rotation
      • 30.2.6 2.5 Language
      • 30.2.7 2.6 Customized shape
  • 31 Visualizing Movie Reviews in Word Cloud
    • 31.1 IMDB Reviews
    • 31.2 Cleaning the data!
    • 31.3 Word Cloud
  • 32 Data art (talk)
  • III Interactive Graphs
  • 33 Shiny
    • 33.1 Part 1 How to Build a Shiny App
    • 33.2 1. Install the shiny package
    • 33.3 2. Template for creating a shiny app
    • 33.4 3. Add elements to user interface using fluidPage()
      • 33.4.1 Input functions
      • 33.4.2 Output functions
    • 33.5 4. Build output in server instructions
      • 33.5.1 (1): Save objects you want to display to output$
      • 33.5.2 (2): Build objects with render()
      • 33.5.3 (3): Use input values with input$
    • 33.6 5. Share your app
      • 33.6.1 Save your app
      • 33.6.2 Publish your app on Shinyapps.io
    • 33.7 Part 2 How to Customize Reactions
    • 33.8 1. Reactivity
      • 33.8.1 What is reactivity?
      • 33.8.2 Reactive values
      • 33.8.3 Reactive functions (reactive toolkit)
      • 33.8.4 Modularize code with reactive()
      • 33.8.5 Prevent reactions with isolate()
      • 33.8.6 Trigger code with observeEvent()
      • 33.8.7 Delay reactions with eventReactive()
      • 33.8.8 Manage state with reactiveValues()
    • 33.9 3. Summary
  • 34 HTML, JavaScript, and D3
  • 35 Technical Analysis for Stocks using Plotly
    • 35.1 Import all libraries
    • 35.2 Download data from Alpha Vantage
      • 35.2.1 Usefull links for more information:
    • 35.3 Simple plot: 2 traces in same axis
    • 35.4 Many traces in independent axis but in same plot
    • 35.5 Aesthetics: background and margins
    • 35.6 More aesthetics: hide legends and hide X-axis slider
    • 35.7 Shortcuts to slice data by pre-fixed date ranges
  • 36 GoogleVis
    • 36.1 Overview
    • 36.2 Example: Line chart
    • 36.3 Example: Geo Chart
    • 36.4 Example: Sankey chart
    • 36.5 googleVis in RStudio
    • 36.6 Reference and Resource
  • 37 Interactive graph links
    • 37.1 Bokeh Cheatsheet
    • 37.2 SandDance (video)
    • 37.3 OpenCPU (talk)
      • 37.3.1 What is OpenCPU?
      • 37.3.2 What is this Tutorial?
      • 37.3.3 Distogram: A Working OpenCPU Example
  • IV Spatial Analysis
  • 38 Stamen maps with ggmap
    • 38.1 Mutilayerd plots with ggmaps
    • 38.2 Getting Deeper
  • 39 Mapping in R
    • 39.1 Overview
    • 39.2 What is maps?
    • 39.3 Installing maps
    • 39.4 Simple Demonstration (using maps)
    • 39.5 Simple Demonstration (using ggplot2)
    • 39.6 Mapping with geom_map
    • 39.7 Considerations
    • 39.8 External Resources
  • 40 Plotting Maps with R: An Example-Based Tutorial
    • 40.1 Plotting using base R
    • 40.2 Plotting using ggplot2
    • 40.3 Plotting interactively using leaflet
    • 40.4 Plotting using tmap
  • 41 Different Ways of Plotting U.S. Map in R
    • 41.1 Introduction
    • 41.2 Using usmap package
    • 41.3 Using ggplot2 package
    • 41.4 Using maps package
    • 41.5 Using plotly package
    • 41.6 Using mapview package
    • 41.7 Using leaflet package
    • 41.8 Using tmap package
  • 42 Using Stamen Maps for Plotting Spatial Data
  • 43 World Heatmap in Plotly
    • 43.1 INTRODUCTION
    • 43.2 DEMONSTRATION
    • 43.3 CONCLUSION
    • 43.4 REFERENCES
  • 44 Spatial data links
    • 44.1 CartoDB (video)
    • 44.2 Leaflet
  • V Modeling
  • 45 Time Series Cheatsheet
  • 46 Tutorial for Multivariable Linear Regression
    • 46.1 Motivation
    • 46.2 Connection with Single Variable Regression
    • 46.3 Collinearity and Paradox
    • 46.4 Solution Path
    • 46.5 Stepwise Model Selection
    • 46.6 Model Verification
      • 46.6.1 Outliers and Leverage
  • 47 Keras Package Tutorial
    • 47.1 Installation
    • 47.2 Obtaining a Dataset
    • 47.3 Building a model
  • 48 Time Series Modeling with ARIMA in R
    • 48.1 1. Visualize the time series
    • 48.2 2. Stationarize the Time Series
    • 48.3 3. ACF/PACF
    • 48.4 4. Build the ARIMA Model
    • 48.5 5. Make Predictions
    • 48.6 References/Additional Resources
  • 49 Modeling links
    • 49.1 Exploring Financial Models
    • 49.2 Overview of the t-SNE algorithm
  • VI Communicating Results
  • 50 Rmarkdown tutorial
    • 50.1 1. Overview
      • 50.1.1 1.1 What is R Markdown?
      • 50.1.2 1.2 Workflow
    • 50.2 2. Getting started
      • 50.2.1 2.1. Install the package
      • 50.2.2 2.2. Open file
      • 50.2.3 2.3. output format
    • 50.3 3. Markdown syntax
    • 50.4 4. Embeding code
      • 50.4.1 4.1. Inline code
      • 50.4.2 4.2. Code chunks
      • 50.4.3 4.3. Display options
    • 50.5 5. Rendering
  • 51 Python in Rmarkdown
  • 52 RStudio vs JupyterLab (talk)
  • 53 bookdown (workshop)
  • VII Case studies
  • 54 The first step to analyse a dataset
    • 54.1 Introduction
    • 54.2 A glimpse at the dataset
      • 54.2.1 How does the data look like?
      • 54.2.2 Retrive the metadata
    • 54.3 Dive into one column
      • 54.3.1 Summarise a numerical variable
      • 54.3.2 Understand a categorical variable
    • 54.4 Advanced patterns about a data set
      • 54.4.1 Locate the missing values
      • 54.4.2 Find the outlier for numeric values
      • 54.4.3 Find out the correlations among variables
  • 55 Tinder self-reflection
    • 55.1 Introduction
      • 55.1.1 For The Taken / Non-Millennial Folk
      • 55.1.2 Replicating This Analysis For Yourself
      • 55.1.3 Protecting The Innocent (and Not-So-Innocent)
      • 55.1.4 A Fun Twist
    • 55.2 Analysis
      • 55.2.1 Our Fun New Tinder Statistics: “Amourmetrics”
      • 55.2.2 All-Time Statistics & A Demographical Discovery
      • 55.2.3 “It’s Like Batting Average, But For Tinder”
      • 55.2.4 Where & When Did My Swiping Habits Change?
      • 55.2.5 A Problem With Dates
      • 55.2.6 Overall Trends
      • 55.2.7 Playing Hard To Get
      • 55.2.8 Playing The Game
      • 55.2.9 “Swipe Night, Part 2”
      • 55.2.10 For My Fellow Data Nerds, Or People Who Just Like Graphs
    • 55.3 Conclusion
      • 55.3.1 Dubious Demographics
      • 55.3.2 Love Is Bored
      • 55.3.3 Does Location Matter? Well, Maybe.
      • 55.3.4 The Cinderella Effect
      • 55.3.5 “Playing Hard To Get” May A Be Real Thing
      • 55.3.6 Can We Solve Dating Using Machine Learning?
    • 55.4 Final Thoughts
  • 56 Ice Cream Survey
    • 56.1 Overview
      • 56.1.1 Description
      • 56.1.2 Goals of this community contribution
    • 56.2 Loading packages and reading in data
    • 56.3 Understanding what cleaning is required
    • 56.4 Cleaning and prepping the data
      • 56.4.1 Country
      • 56.4.2 Flavor
      • 56.4.3 Age
    • 56.5 Visualizing the data
      • 56.5.1 Getting an overview
      • 56.5.2 Ice cream preferences by continent and age
    • 56.6 Takeaways
  • 57 “Ask A Manager” salary survey dataset
    • 57.1 Obtaining the dataset
    • 57.2 Description of fields
    • 57.3 Data cleanup process
      • 57.3.1 Industry classification
      • 57.3.2 Job Title classification
      • 57.3.3 Contributing
  • 58 Forecast of the 2020 senate election
  • VIII Chinese translations
  • 59 Intro to stringr 包入门详解
    • 59.1 stringr 包的安装与调用
      • 59.1.1 安装
      • 59.1.2 调用
    • 59.2 字符串匹配函数(Detect Matches)
      • 59.2.1 str_detect(string, pattern)
      • 59.2.2 str_which(string, pattern)
      • 59.2.3 str_count(string, pattern)
      • 59.2.4 str_locate(string, pattern)
      • 59.2.5 str_locate_all(string, pattern)
    • 59.3 字符串的截取函数(Subset Strings)
      • 59.3.1 str_sub(string, start index, end index)
      • 59.3.2 str_subset(string,pattern)
      • 59.3.3 str_extract(string,pattern)
      • 59.3.4 str_match(string, pattern)
    • 59.4 字符串长度编辑函数(Manage Lengths)
      • 59.4.1 str_length(string)
      • 59.4.2 str_pad((string, width, side = c(“left”, “right”,“both”), pad = " ")
      • 59.4.3 str_trunc(string, width, side = c(“right”, “left”,“center”), ellipsis = “…”)
      • 59.4.4 str_trim(string, side = c(“both”, “left”, “right”))
    • 59.5 字符串变换与编辑函数(Mutate Strings)
      • 59.5.1 str_sub(string,start index,end index)
      • 59.5.2 str_replace(string,pattern,replacement)
      • 59.5.3 str_replace_all(string,pattern,replacement)
      • 59.5.4 str_to_lower(string)
      • 59.5.5 str_to_upper(string)
      • 59.5.6 str_to_title(string)
    • 59.6 字符串分割与拼接函数(Join and Split)
      • 59.6.1 str_c(…, sep = "", collapse = NULL)
      • 59.6.2 str_c(…, sep = "“, collapse =”")
      • 59.6.3 str_dup(string, times)
      • 59.6.4 str_split_fixed((string, pattern, n)
      • 59.6.5 str_glue(…, .sep = "", .envir = parent.frame())
      • 59.6.6 str_glue_data(.x, …, .sep = "“, .envir = parent.frame(), .na =”NA")
    • 59.7 字符串排序(Order Strings)
      • 59.7.1 str_sort(string)
      • 59.7.2 str_order(string)
    • 59.8 字符串的编译格式与显示格式修改函数(Encode and Visualize Strings)
      • 59.8.1 str_conv(string, encoding)
      • 59.8.2 str_view(string, pattern)
      • 59.8.3 str_wrap(string,width,indent,exdent)
    • 59.9 正则表达式(Regular Expression)
      • 59.9.1 字符匹配
      • 59.9.2 替换(Alternates)
      • 59.9.3 锚点(Anchors)
      • 59.9.4 查找(Look Arounds)
      • 59.9.5 数量词的使用(Quantifiers)
      • 59.9.6 括号划分表达式并用转义号码替换
    • 59.10 参考文献(Reference)
  • 60 Likert package
  • 61 rvest package 1
  • 62 rvest package 2
    • 62.0.1 Description:
    • 62.0.2 Source
    • 62.0.3 Cheatsheet
    • 62.0.4 Encoding(乱码处理)
    • 62.0.5 google_form
    • 62.0.6 HTML
    • 62.0.7 html_form (提取表单)
    • 62.0.8 html_nodes (提取网页中指定部分)
    • 62.0.9 html_session
    • 62.0.10 html_table (提取网页数据表)
    • 62.0.11 html_text
    • 62.0.12 jump_to (提取相对或绝对链接)
    • 62.0.13 pluck
    • 62.0.14 session_history
    • 62.0.15 set_values (修改表单)
    • 62.0.16 submit_form
  • 63 Translation of ‘parcoords’ Introduction
    • 63.1 1. ‘parcoords’包使用说明 - 中文翻译
      • 63.1.1 parcoords
      • 63.1.2 parcoords-shiny
      • 63.1.3 ParcoordsProxy
      • 63.1.4 pcCenter
      • 63.1.5 pcFilter
      • 63.1.6 pcHide
      • 63.1.7 pcSnapshot
      • 63.1.8 pcUnhide
    • 63.2 2. ‘parcoords’使用教程 - 中文翻译
      • 63.2.1 范例
      • 63.2.2 选项
      • 63.2.3 方法
  • 64 Chinese Translation of R Packages for Interactie Plots 交互式数据可视化包: plotly & parcoords
    • 64.1 R 交互式数据可视化包 ‘plotly’
    • 64.2 R 主题/函数目录:
    • 64.3 add_annotations
    • 64.4 add_data
    • 64.5 add_fun
    • 64.6 add_trace
    • 64.7 animation_opts
    • 64.8 colorbar
    • 64.9 embed_notebook
    • 64.10 ggplotly
    • 64.11 group2NA
    • 64.12 R 交互式数据可视化包 ‘parcoords’
    • 64.13 R 主题/函数目录:
    • 64.14 parcoords
    • 64.15 parcoords-shiny
    • 64.16 parcoordsProxy
    • 64.17 pcCenter
    • 64.18 parcoords_proxy
    • 64.19 pcFilter
    • 64.20 pcHide
    • 64.21 pcSnapshot
    • 64.22 pcUnhide
  • 65 Translation of Lattice Package
    • 65.1 Lattice 画图包的使用介绍
    • 65.2 例子引入
    • 65.3 主要思想
    • 65.4 设计目标
    • 65.5 常见的高级功能
      • 65.5.1 可视化单变量分布
      • 65.5.2 可视化表格
      • 65.5.3 通用功能和方法
      • 65.5.4 散点图和扩展
      • 65.5.5 瓦块数据
      • 65.5.6 三维显示
      • 65.5.7 网格(trellis)对象
    • 65.6 更多资源
      • 65.6.1 版本信息
  • 66 ggmosaic
    • 66.1 Chinese Translation: ‘ggmosaic’(马赛克图)
    • 66.2 引言
    • 66.3 简介
    • 66.4 分割的顺序
    • 66.5 根据一个变量分割(分箱数据):
    • 66.6 根据一个变量分割(非分箱数据):
    • 66.7 根据两个变量分割
    • 66.8 根据三个变量分割
    • 66.9 调整切割的方向
    • 66.10 另外一种方法:条件变量(Conditional)
    • 66.11 另外一种方法:块化(Facet)
    • 66.12 ‘ggmosaic’ vs vcd::‘mosaic’
  • 67 Chinese translation links
    • 67.1 R and ggplot2
    • 67.2 forcats package
      • 67.2.1 示范数据准备
      • 67.2.2 关于缺失数据(NAs)的处理
      • 67.2.3 同义因子水平
      • 67.2.4 混合多个频率低的因子水平成为一个
      • 67.2.5 在ggplot2 条形图中改变条的顺序
    • 67.3 Continuous variables with R (Chinese)
    • 67.4 Visualising Spatial Data
  • IX French translation
  • 68 edav.info
  • X Korean translations
  • 69 Heatmaps
    • 69.0.1 R Markdown
    • 69.0.2 개요
    • 69.0.3 tl;dr
    • 69.0.4 간단한 예제들
    • 69.0.5 2-차원 빈 카운트를 사용한 히트 맵
    • 69.0.6 데이터 프레임의 히트 맵
    • 69.0.7 수정
    • 69.0.8 이론
    • 69.0.9 추가 자료
  • 70 nullabor
    • 70.1 nullaobr 패키지 입문
      • 70.1.1 lineup 방법
      • 70.1.2 Rorschach 방법
      • 70.1.3 특정 분포를 가진 무수의 데이터 생성하기
      • 70.1.4 순열을 통한 무수의 데이터 생성하기
      • 70.1.5 모델에서의 무수 잔차를 이용해 무수의 데이터 생성하기
      • 70.1.6 nullabor 밖의 데이터 생성하기
      • 70.1.7 유의확률 계산하기
      • 70.1.8 검정력 계산하기
    • 70.2 nullbor의 lineup 예시
      • 70.2.1 선거 개찰
    • 70.3 무수(null) 와 데이터 포인츠들간의 거리계산
      • 70.3.1 소개
      • 70.3.2 거리 운율학
      • 70.3.3 단일변수 데이터에서의 거리
      • 70.3.4 회귀 매개변수들의 거리
      • 70.3.5 박스플랏에서의 거리
      • 70.3.6 구분된 상황에서의 거리
      • 70.3.7 구간화 거리
      • 70.3.8 정렬에서의 그래프들간의 평균 거리 계산
      • 70.3.9 여러가지의 정렬들의 차이 측정법
      • 70.3.10 최적의 구간화 수
      • 70.3.11 거리 운율법의 분포도
      • 70.3.12 거리 운율법의 경험적 분포도를 그리기
      • 70.3.13 참조
  • XI EDAV specific
  • 71 Hex Sticker
  • 72 Midsemester Review
    • 72.1 Lecuture 1: Introduction
    • 72.2 Lecture 2: Histograms
    • 72.3 Lecture 3: Grammar of Graphics
    • 72.4 Lecture 4: Common ggplot2 Problems
    • 72.5 Lecture 5: Boxplots & Continuous Variables
    • 72.6 Lecture 6: Rounding Normal (Continuous Variables Wrap-up)
    • 72.7 Lecture 7: Graphical Perception
    • 72.8 Lecture 8: Categorical Variables (Textbook: Chapter 04)
    • 72.9 Lecture 9: Web Scraping & rvest package
    • 72.10 Lecture 10: Scatterplots - 2 Continuous Variables (Textbook: Chapter 05)
    • 72.11 Lecture 11: Parallel Coordinates
    • 72.12 Lecture 12: Interactive Parallel Coordinates (Htmlwidget: parcoords)
    • 72.13 Lecture 13: Git - Workflow
    • 72.14 Lecture 14: Multivariate Categorical Variables (e.g. Mosaic Plots)
    • 72.15 Lecture 15: Transforming Data
    • 72.16 Lecture 16: Likert
    • 72.17 Lecture 17: Git - Branching
    • 72.18 Lecturee 18: Simpson’s Paradox
    • 72.19 Lecture 19: Heatmaps (Textbook: Chapter 8)
    • 72.20 Lecture 20: Time Series (Textbook: Chapter 11)
  • 73 List of Community Contribution
    • 73.0.1 * A lighting talk in class
    • 73.0.2 * A cheatsheet
    • 73.0.3 * A series of tutorials
    • 73.0.4 * A workshop - “ShareYouRWork”
  • Published with bookdown

Community contributions for EDAV Fall 2019

Chapter 67 Chinese translation links

Some groups of students have contributed to the community by translating useful resources into another language.

67.1 R and ggplot2

Yuchen Pei and Jiaqi Tang

We translated two online tutorials for visualizaiton in R into Chinese. The first one is called A Comprehensive Guide to Data Visualization in R for Beginners and the second one is called ggplot2: Mastering the basics. Our translation files can be found here: https://github.com/Jasmine1231/EDAV-19Fall-Community_Contribution .

67.2 forcats package

Xu Xu and Xiaoyun Zhu

引言:

说到数据分析,我们不得不提大神 Hadley Wickham。

Hadley Wickham 是 RStudio 的首席科学家以及 Stanford University, Rice University 统计系的兼职教授。他是著名图形可视化软件包 ggplot2 的开发者,以及其他许多被广泛使用的软件包的作者,代表作品如 plyr、reshape2 等。

Wickham曾说过:“通过数据从根本上了解世界真的是一件非常,非常酷的事情”,而整理数据正是透过数据看世界的第一步。 作为一个多产的R开发者,Wickham乐于给那些喜欢摆弄数据的人提供力量和支持。他就创建了一个便于整理分类数据(categorical data)的包forcats,用于处理因子,可以更高效地对因子进行修改。

下面我们翻译了一篇R-bloggers上讲解forcats用法的文章(作者是S.Richter-Walsh),希望这个教程可以让大家在整理数据时更得心应手。

这是原文链接,供大家参考:https://www.r-bloggers.com/cats-are-great-and-so-is-the-forcats-r-package/

翻译正文:

Forcats是由Hadley Wickham创建的一个做数据整理时非常好用的包。在进行数据分析和建模之前,我们经常需要花大量的时间来清理数据(或者说预处理数据)。要我估计的话,我认为一个数据科学家会花至少70%-80%的时间来清理数据。这也是学校所教和业界真实项目之间最大的区别。学校教学时所用的数据集经常是预处理过后整齐的数据,但实际工作中的数据集基本不可能是这样的。我很喜欢清理数据,并喜欢分析在清理过程中出现的问题。我发现forcats这个包在整理分类数据时非常有效。

67.2.1 示范数据准备

我们用下面的代码生成一些数据用于示范。这个数据集是关于销售数据的,其中有50个缺失数据(NA), 7个因子。

library(dplyr) # Also load up dplyr so we can use the pipe operator: %>%
library(forcats)

df <- data_frame(sales = factor(rep(c("Online",
 "Post",
 "Web",
 "Call Centre",
 "Inbound Phone",
 "Outbound Phone",
 "Field Sales",
 NA), 50)),
 buy = sample(c(0, 1), 400, replace = T)) %>%
 mutate(sales = sample(sales, size = length(sales), replace = T))

table(df$sales)
## 
##    Call Centre    Field Sales  Inbound Phone         Online Outbound Phone 
##             49             47             57             41             48 
##           Post            Web 
##             61             53

67.2.2 关于缺失数据(NAs)的处理

缺失数据在实际工作中很常见。在R的数据框(dataframe)中计算连续变量的均值(mean),中位数(median),方差(variance)和标准差(standard deviation)时,我们需要考虑这些缺失数据。另外,如果我们希望用某个数据集来建模,我们也需要处理缺失数据来保留某些特定的变量和防止数据的缺失。我们常用的策略有用均值或中位数来代替连续型数据中的缺失值或用众数来代替分类数据中的缺失值。但有些情况下这些策略并不可取,我们有时需要将缺失数据设为一个明确的因子水平。我们可以通过forcats::fct_explicit_na()来实现这个目标,而且只需要一行代码。下面就让我们在示范数据上尝试一下:

df$sales <- fct_explicit_na(df$sales)

table(df$sales)
## 
##    Call Centre    Field Sales  Inbound Phone         Online Outbound Phone 
##             49             47             57             41             48 
##           Post            Web      (Missing) 
##             61             53             44

通过上面的处理,现在缺失数据都由一个明确的因子来表示,之前的缺失数值直接由(Missing)来代替。我们还可以用下面的代码来给这个新的因子命名:

df$sales <- fct_explicit_na(df$sales, na_level = "My New Level")

67.2.3 同义因子水平

有时分类变量会包含两个及更多指向同一分组的因子水平。语法表示可能会有细微差别,比如以大写字母开头和以小写字母开头(GroupA vs. groupA)。在这种情况下,我们可以用 forcats::fct_collapse() 来合并多个同义分子水平到一个里。在我们的测试数据中,让我们假设Web和Online指向同一销售渠道。我们想要合并这两个成为一个名为Online的因子水平。

df$sales <- fct_collapse(df$sales, Online=c("Online", "Web"))

67.2.4 混合多个频率低的因子水平成为一个

另一种可能发生的情况是,我们想使用需要大的样本容量来维持统计显著性的数据群组进行分析或者建模。 设想一个因子变量有20个水平, 但是只有其中的5个能用来解释数据集中90%以上的观测值。你可以全部剔除这些观测值,但是如果可能,要尽量避免数据丢失。或者,你可以混合多个频率低的因子水平成为一个覆盖它们全部的水平,在保持其余的和这些观测值相关的属性变量不变的同时整理了群组水平。这个函数混合多个频率低的水平成为一个名为Other的默认水平,并且保持这个水平中的数据数量为全部水平中最小的数量。使用者还可以用n参数来调整混合之后所保留的水平数量。在我们的测试数据中,名为Outbound,Phone 的水平是频率最低的,所以它们被混合进了一个新的名为Other的水平中。

df$sales <- fct_lump(df$sales)

67.2.5 在ggplot2 条形图中改变条的顺序

tidyverse是一系列包含dplyr,ggplot2,和forcats的包。这个包为数据科学家们创造了一个完美的工具生态系统。我想要介绍一下forcats::fct_infreq() 函数。这个函数可以在探索性数据分析和数据展示的阶段与ggplot2一同使用。有时改变条形图中因子的顺序可能有些难办,但是使用forcats包可以使这个任务简单化。

library(ggplot2)
ggplot(df, aes(x = fct_infreq(sales))) + geom_bar()

现在因子水平按照频率递减的顺序排列好了,并且包含了(缺失的)和Other。 通过使用forcats来进行一些快速的预处理,我们没有丢失任何原数据。祝大家天天开心!

67.3 Continuous variables with R (Chinese)

Bangwei Zhou and Zhihao Ai

We created a tutorial in Chinese on the content of continuous variables with R. We combined (and translated) texts from chapter three of the textbook Graphical Data Analysis with R by Antony Unwin and the Continuous Variables section on edav.info. Additionally, we also included an example to better illustrate how a user can fully utilize R to assess continuous variables from PSET1 problem 1.

We hope this document can effectively jumpstart any user (with limited language background to Chinese) with sufficient skills to assess continuous variables with R.

Our document can be found here.

67.4 Visualising Spatial Data

Mutian Wang and Siyuan Wang

We translated a tutorial Introduction to Visualising Spatial Data in R written by Robin Lovelace, James Cheshire, Rachel Oldroyd. The source text can be found here.

Our translation can be found on our GitHub repo. In this repo, you can see our translation in the html file, and the source code in the Rmd file.