数据清理技术

second class

Yangyong Ye

SOE,RUE

2020-11-16

Data I/O

  • readr: reads .csv data.

  • haven: reads SPSS, Stata, and SAS files.

  • readxl: reads excel files (both .xls and .xlsx).

  • jsonlite: reads .json data.

  • rio: import, export, convert

First Impression

  • sjPlot::view_df()

  • dplyr::glimpse()

  • skimr::skim()

  • janitor::clean_names() & remove_empty()

tidyverse packages:

  • ggplot2, for data visualization.

  • dplyr, for data manipulation.

  • tidyr, for data tidying.

  • readr, for data import.

tidyverse packages:

  • purrr, for functional programming.

  • tibble, for tibbles, a modern re-imagining of data frames.

  • stringr, for strings.

  • forcats, for factors.

列调整技术

  • select
  • rename
  • relocate
  • mutate
  • case_match

Overview of tidyselect

Tidyverse selections implement a dialect of R where operators make it easy to select variables:

  • : for selecting a range of consecutive variables.

  • ! for taking the complement of a set of variables.

  • & and | for selecting the intersection or the union of two sets of variables.

  • c() for combining selections.

Overview of selection features tidyselect

  • starts_with(): Starts with a prefix.

  • ends_with(): Ends with a suffix.

  • contains(): Contains a literal string.

  • matches(): Matches a regular expression.

  • num_range(): Matches a numerical range like x01, x02, x03.

  • where(): Applies a function to all variables and selects those for which the function returns TRUE.

行调整技术

  • filter

  • arrange

  • slice,slice_head, slice_tail, slice_sample, slice_min, slice_max

filter: conditions

  • conditional: >, >=, <, <=, != (not equal), and == (equal).

  • conditional: & is ??and??, | is ??or??, and ! is ??not??.,

初步统计

  • group_by & summarize

  • count

  • distinct

  • n_distinct

Multiple column or row manipulation

  • across

  • rowwise & c_across

Separate & unite

  • extract

  • separate

  • unite

TWO TABLE VERBS

  • Mutating Joins — inner_join(), left_join(), right_join(), full_join()

  • Filtering Joins — semi_join(), anti_join()

TWO TABLE VERBS

nycflights13

long & wide tidyr

  • pivot_longer()

  • pivot_wider()

factor 15

  • fct_recode

  • fct_reorder

  • fct_lump

string

  • str_c

  • str_sub

  • str_detect

  • str_replace

date/time

date/time in R4DS

Practice

  • 学生id, 学校id, 城市id,是否独生子女、是否近视、是否恋爱、期中考试语文原始成绩、是否实行百分制

  • 只选择期中语文成绩实行100制的学生

  • 将语文成绩的变量名称更改为chinese, 将是否独生子女变量名称更改为child_only, 是否恋爱变更为in_love.

Missing value

  • visdat::vis_miss()

  • tidyr::drop_na()

  • tidyr::replace_na() & na_if()

  • base::is.na()

NEXT WEEK