M A N N I N G Robert I. Kabacoff THIRD EDITION IN ACTION Data analysis and graphics with R and Tidyverse
Praise for the previous edition of R in Action “Essential to anyone doing data analysis with R, whether in industry or academia.” —Cristofer Weber, NeoGrid “A go-to reference for general R and many statistics questions.” —George Gaines, KYOS Systems Inc. “Accessible language, realistic examples, and clear code.” —Samuel D. McQuillin, University of Houston “Offers a gentle learning curve to those starting out with R for the first time.” —Indrajit Sen Gupta, Mu Sigma Business Solutions
R in Action, Third Edition
R in Action, Third Edition DATA ANALYSIS AND GRAPHICS WITH R AND TIDYVERSE ROBERT I. KABACOFF M A N N I N G SHELTER ISLAND
For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: orders@manning.com ©2022 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. The author and publisher have made every effort to ensure that the information in this book was correct at press time. The author and publisher do not assume and hereby disclaim any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from negligence, accident, or any other cause, or from any usage of the information herein. Manning Publications Co. Development editor: Karen Miller 20 Baldwin Road Technical development editor: Mike Shepard PO Box 761 Review editor: Aleksandar Dragosavljević Shelter Island, NY 11964 Production editor: Deirdre S. Hiam Copy editor: Suzanne G. Fox Proofreader: Katie Tennant Technical proofreader: Ninoslav Cerkez Typesetter and cover designer: Marija Tudor ISBN 9781617296055 Printed in the United States of America
brief contents PART 1 GETTING STARTED .......................................................... 1 1 ■ Introduction to R 3 2 ■ Creating a dataset 20 3 ■ Basic data management 46 4 ■ Getting started with graphs 68 5 ■ Advanced data management 88 PART 2 BASIC METHODS ......................................................... 115 6 ■ Basic graphs 117 7 ■ Basic statistics 147 PART 3 INTERMEDIATE METHODS ............................................ 177 8 ■ Regression 179 9 ■ Analysis of variance 221 10 ■ Power analysis 249 11 ■ Intermediate graphs 265 12 ■ Resampling statistics and bootstrapping 293 PART 4 ADVANCED METHODS .................................................. 313 13 ■ Generalized linear models 315 14 ■ Principal components and factor analysis 333 15 ■ Time series 355vii
BRIEF CONTENTSviii16 ■ Cluster analysis 386 17 ■ Classification 409 18 ■ Advanced methods for missing data 434 PART 5 EXPANDING YOUR SKILLS ............................................. 457 19 ■ Advanced graphs 459 20 ■ Advanced programming 491 21 ■ Creating dynamic reports 525 22 ■ Creating a package 543
contents preface xix acknowledgments xxi about this book xxiii about the author xxx about the cover illustration xxxi PART 1 GETTING STARTED ........................................... 1 1 Introduction to R 3 1.1 Why use R? 5 1.2 Obtaining and installing R 7 1.3 Working with R 7 Getting started 8 ■ Using RStudio 10 ■ Getting help 12 The workspace 13 ■ Projects 14 1.4 Packages 15 What are packages? 15 ■ Installing a package 15 Loading a package 16 ■ Learning about a package 16 1.5 Using output as input: Reusing results 17 1.6 Working with large datasets 18 1.7 Working through an example 18ix
CONTENTSx2 Creating a dataset 20 2.1 Understanding datasets 21 2.2 Data structures 22 Vectors 23 ■ Matrices 23 ■ Arrays 25 ■ Data frames 26 Factors 28 ■ Lists 30 ■ Tibbles 31 2.3 Data input 33 Entering data from the keyboard 34 ■ Importing data from a delimited text file 35 ■ Importing data from Excel 39 Importing data from JSON 39 ■ Importing data from the web 39 ■ Importing data from SPSS 40 ■ Importing data from SAS 40 ■ Importing data from Stata 41 Accessing database management systems 41 ■ Importing data via Stat/Transfer 42 2.4 Annotating datasets 43 Variable labels 43 ■ Value labels 44 2.5 Useful functions for working with data objects 44 3 Basic data management 46 3.1 A working example 47 3.2 Creating new variables 48 3.3 Recoding variables 50 3.4 Renaming variables 51 3.5 Missing values 52 Recoding values to missing 53 ■ Excluding missing values from analyses 53 3.6 Date values 54 Converting dates to character variables 56 ■ Going further 56 3.7 Type conversions 56 3.8 Sorting data 57 3.9 Merging datasets 58 Adding columns to a data frame 58 ■ Adding rows to a data frame 58 3.10 Subsetting datasets 59 Selecting variables 59 ■ Dropping variables 59 ■ Selecting observations 60 ■ The subset() function 61 ■ Random samples 62
CONTENTS xi3.11 Using dplyr to manipulate data frames 62 Basic dplyr functions 62 ■ Using pipe operators to chain statements 65 3.12 Using SQL statements to manipulate data frames 66 4 Getting started with graphs 68 4.1 Creating a graph with ggplot2 69 ggplot 69 ■ Geoms 70 ■ Grouping 74 ■ Scales 76 Facets 78 ■ Labels 80 ■ Themes 80 4.2 ggplot2 details 82 Placing the data and mapping options 82 ■ Graphs as objects 84 ■ Saving graphs 85 ■ Common mistakes 86 5 Advanced data management 88 5.1 A data management challenge 89 5.2 Numerical and character functions 90 Mathematical functions 90 ■ Statistical functions 91 Probability functions 93 ■ Character functions 96 Other useful functions 98 ■ Applying functions to matrices and data frames 99 ■ A solution for the data management challenge 100 5.3 Control flow 104 Repetition and looping 104 ■ Conditional execution 105 5.4 User-written functions 106 5.5 Reshaping data 109 Transposing 109 ■ Converting from wide to long dataset formats 109 5.6 Aggregating data 112 PART 2 BASIC METHODS .......................................... 115 6 Basic graphs 117 6.1 Bar charts 118 Simple bar charts 118 ■ Stacked, grouped, and filled bar charts 119 ■ Mean bar charts 121 ■ Tweaking bar charts 123 6.2 Pie charts 128
CONTENTSxii6.3 Tree maps 130 6.4 Histograms 133 6.5 Kernel density plots 135 6.6 Box plots 138 Using parallel box plots to compare groups 139 ■ Violin plots 142 6.7 Dot plots 143 7 Basic statistics 147 7.1 Descriptive statistics 148 A menagerie of methods 148 ■ Even more methods 150 Descriptive statistics by group 152 ■ Summarizing data interactively with dplyr 154 ■ Visualizing results 155 7.2 Frequency and contingency tables 156 Generating frequency tables 156 ■ Tests of independence 162 Measures of association 163 ■ Visualizing results 164 7.3 Correlations 164 Types of correlations 165 ■ Testing correlations for significance 167 ■ Visualizing correlations 169 7.4 T-tests 169 Independent t-test 169 ■ Dependent t-test 170 When there are more than two groups 171 7.5 Nonparametric tests of group differences 171 Comparing two groups 171 ■ Comparing more than two groups 173 7.6 Visualizing group differences 175 PART 3 INTERMEDIATE METHODS ............................. 177 8 Regression 179 8.1 The many faces of regression 180 Scenarios for using OLS regression 181 ■ What you need to know 182 8.2 OLS regression 183 Fitting regression models with lm() 184 ■ Simple linear regression 185 ■ Polynomial regression 188 ■ Multiple linear regression 190 ■ Multiple linear regression with interactions 192
CONTENTS xiii8.3 Regression diagnostics 194 A typical approach 195 ■ An enhanced approach 197 Multicollinearity 202 8.4 Unusual observations 203 Outliers 203 ■ High-leverage points 203 ■ Influential observations 204 8.5 Corrective measures 207 Deleting observations 208 ■ Transforming variables 208 Adding or deleting variables 210 ■ Trying a different approach 210 8.6 Selecting the “best” regression model 211 Comparing models 211 ■ Variable selection 212 8.7 Taking the analysis further 215 Cross-validation 215 ■ Relative importance 217 9 Analysis of variance 221 9.1 A crash course on terminology 222 9.2 Fitting ANOVA models 224 The aov() function 224 ■ The order of formula terms 225 9.3 One-way ANOVA 226 Multiple comparisons 228 ■ Assessing test assumptions 232 9.4 One-way ANCOVA 233 Assessing test assumptions 235 ■ Visualizing the results 236 9.5 Two-way factorial ANOVA 237 9.6 Repeated measures ANOVA 239 9.7 Multivariate analysis of variance (MANOVA) 242 Assessing test assumptions 244 ■ Robust MANOVA 245 9.8 ANOVA as regression 246 10 Power analysis 249 10.1 A quick review of hypothesis testing 250 10.2 Implementing power analysis with the pwr package 252 T-tests 253 ■ ANOVA 255 ■ Correlations 255 Linear models 256 ■ Tests of proportions 257 Chi-square tests 258 ■ Choosing an appropriate effect size in novel situations 259
CONTENTSxiv10.3 Creating power analysis plots 262 10.4 Other packages 263 11 Intermediate graphs 265 11.1 Scatter plots 266 Scatter plot matrices 269 ■ High-density scatter plots 272 3D scatter plots 275 ■ Spinning 3D scatter plots 277 Bubble plots 279 11.2 Line charts 282 11.3 Corrgrams 284 11.4 Mosaic plots 289 12 Resampling statistics and bootstrapping 293 12.1 Permutation tests 294 12.2 Permutation tests with the coin package 296 Independent two-sample and k-sample tests 297 ■ Independence in contingency tables 298 ■ Independence between numeric variables 299 ■ Dependent two-sample and k-sample tests 300 Going further 300 12.3 Permutation tests with the lmPerm package 300 Simple and polynomial regression 301 ■ Multiple regression 302 One-way ANOVA and ANCOVA 303 ■ Two-way ANOVA 304 12.4 Additional comments on permutation tests 304 12.5 Bootstrapping 305 12.6 Bootstrapping with the boot package 306 Bootstrapping a single statistic 307 ■ Bootstrapping several statistics 309 PART 4 ADVANCED METHODS ................................... 313 13 Generalized linear models 315 13.1 Generalized linear models and the glm() function 316 The glm() function 317 ■ Supporting functions 318 Model fit and regression diagnostics 319 13.2 Logistic regression 320 Interpreting the model parameters 323 ■ Assessing the impact of predictors on the probability of an outcome 323 Overdispersion 324 ■ Extensions 325
CONTENTS xv13.3 Poisson regression 326 Interpreting the model parameters 328 ■ Overdispersion 329 Extensions 331 14 Principal components and factor analysis 333 14.1 Principal components and factor analysis in R 335 14.2 Principal components 336 Selecting the number of components to extract 337 Extracting principal components 338 ■ Rotating principal components 342 ■ Obtaining principal component scores 343 14.3 Exploratory factor analysis 345 Deciding how many common factors to extract 346 Extracting common factors 347 ■ Rotating factors 348 Factor scores 352 ■ Other EFA-related packages 352 14.4 Other latent variable models 352 15 Time series 355 15.1 Creating a time-series object in R 358 15.2 Smoothing and seasonal decomposition 360 Smoothing with simple moving averages 360 ■ Seasonal decomposition 362 15.3 Exponential forecasting models 368 Simple exponential smoothing 369 ■ Holt and Holt–Winters exponential smoothing 372 ■ The ets() function and automated forecasting 374 15.4 ARIMA forecasting models 376 Prerequisite concepts 376 ■ ARMA and ARIMA models 378 Automated ARIMA forecasting 383 15.5 Going further 384 16 Cluster analysis 386 16.1 Common steps in cluster analysis 388 16.2 Calculating distances 390 16.3 Hierarchical cluster analysis 391 16.4 Partitioning-cluster analysis 396 K-means clustering 396 ■ Partitioning around medoids 403
CONTENTSxvi16.5 Avoiding nonexistent clusters 404 16.6 Going further 408 17 Classification 409 17.1 Preparing the data 410 17.2 Logistic regression 412 17.3 Decision trees 413 Classical decision trees 413 ■ Conditional inference trees 417 17.4 Random forests 418 17.5 Support vector machines 421 Tuning an SVM 423 17.6 Choosing a best predictive solution 425 17.7 Understanding black box predictions 428 Break-down plots 428 Plotting Shapley values 431 17.8 Going further 432 18 Advanced methods for missing data 434 18.1 Steps in dealing with missing data 435 18.2 Identifying missing values 437 18.3 Exploring missing-values patterns 438 Visualizing missing values 439 ■ Using correlations to explore missing values 442 18.4 Understanding the sources and impact of missing data 444 18.5 Rational approaches for dealing with incomplete data 445 18.6 Deleting missing data 446 Complete-case analysis (listwise deletion) 446 ■ Available case analysis (pairwise deletion) 448 18.7 Single imputation 448 Simple imputation 449 ■ K-nearest neighbor imputation 449 missForest 450 18.8 Multiple imputation 451 18.9 Other approaches to missing data 455
CONTENTS xviiPART 5 EXPANDING YOUR SKILLS ............................. 457 19 Advanced graphs 459 19.1 Modifying scales 460 Customizing axes 460 ■ Customizing colors 466 19.2 Modifying themes 470 Prepackaged themes 471 ■ Customizing fonts 472 Customizing legends 475 ■ Customizing the plot area 477 19.3 Adding annotations 478 19.4 Combining graphs 485 19.5 Making graphs interactive 487 20 Advanced programming 491 20.1 A review of the language 492 Data types 492 ■ Control structures 498 ■ Creating functions 501 20.2 Working with environments 503 20.3 Non-standard evaluation 505 20.4 Object-oriented programming 508 Generic functions 508 ■ Limitations of the S3 model 510 20.5 Writing efficient code 510 Efficient data input 510 ■ Vectorization 511 ■ Correctly sizing objects 512 ■ Parallelization 512 20.6 Debugging 514 Common sources of errors 514 ■ Debugging tools 515 Session options that support debugging 518 ■ Using RStudio’s visual debugger 521 20.7 Going further 523 21 Creating dynamic reports 525 21.1 A template approach to reports 528 21.2 Creating a report with R and R Markdown 529 21.3 Creating a report with R and LaTeX 534 Creating a parameterized report 536 21.4 Avoiding common R Markdown problems 540 21.5 Going further 541
xviii CONTENTS22 Creating a package 543 22.1 The edatools package 544 22.2 Creating a package 546 Installing development tools 546 ■ Creating a package project 547 ■ Writing the package functions 547 Adding function documentation 552 ■ Adding a general help file (optional) 554 ■ Adding sample data to the package (optional) 555 ■ Adding a vignette (optional) 556 Editing the DESCRIPTION file 557 ■ Building and installing the package 558 22.3 Sharing your package 562 Distributing a source package file 562 ■ Submitting to CRAN 562 ■ Hosting on GitHub 563 ■ Creating a package website 565 22.4 Going further 567 afterword Into the rabbit hole 568 appendix A Graphical user interfaces 571 appendix B Customizing the startup environment 574 appendix C Exporting data from R 577 appendix D Matrix algebra in R 579 appendix E Packages used in this book 581 appendix F Working with large datasets 587 appendix G Updating an R installation 592 references 595 index 599
preface What is the use of a book without pictures or conversations? —Alice, Alice’s Adventures in Wonderland It’s wondrous, with treasures to satiate desires both subtle and gross; but it’s not for the timid. —Q, “Q Who?” Star Trek: The Next Generation When I began writing this book, I spent quite a bit of time searching for a good quote to start things off. I ended up with two. R is a wonderfully flexible platform and lan- guage for exploring, visualizing, and understanding data. I chose the quote from Alice’s Adventures in Wonderland to capture the flavor of statistical analysis today—an interactive process of exploration, visualization, and interpretation. The second quote reflects the generally held notion that R is difficult to learn. What I hope to show you is that it doesn’t have to be. R is broad and powerful, with so many analytic and graphic functions available (more than 50,000 at last count) that it easily intimidates both novice and experienced users alike. But there is rhyme and rea- son to the apparent madness. With guidelines and instructions, you can navigate the tremendous resources available, selecting the tools you need to accomplish your work with style, elegance, efficiency—and more than a little coolness. I first encountered R several years ago when I was applying for a new statistical con- sulting position. The prospective employer asked in the pre-interview material if I was conversant in R. Following the standard advice of recruiters, I immediately said yes and set off to learn it. I was an experienced statistician and researcher, had 25 years of experience as an SAS and SPSS programmer, and was fluent in a half-dozen program- ming languages. How hard could it be? Famous last words.xix
Comments 0
Loading comments...
Reply to Comment
Edit Comment