Previous Next

The Well-Grounded Data Analyst Solve messy data problems like a pro (David Asboth) (z-library.sk, 1lib.sk, z-lib.sk)

Author: David Asboth

数据

Complete eight data science projects that lock in important real-world skills—along with a practical process you can use to learn any new technique quickly and efficiently. Data analysts need to be problem solvers—and The Well-Grounded Data Analyst will teach you how to solve the most common problems you'll face in industry. You'll explore eight scenarios that your class or bootcamp won’t have covered, so you can accomplish what your boss is asking for. In The Well-Grounded Data Analyst you'll learn: • High-value skills to tackle specific analytical problems • Deconstructing problems for faster, practical solutions • Data modeling, PDF data extraction, and categorical data manipulation • Handling vague metrics, deciphering inherited projects, and defining customer records The Well-Grounded Data Analyst is for junior and early-career data analysts looking to supplement their foundational data skills with real-world problem solving. As you explore each project, you'll also master a proven process for quickly learning new skills developed by author and Half Stack Data Science podcast host David Asboth. You'll learn how to determine a minimum viable answer for your stakeholders, identify and obtain the data you need to deliver, and reliably present and iterate on your findings. The book can be read cover-to-cover or opened to the chapter most relevant to your current challenges. About the book The Well-Grounded Data Analyst introduces you to eight scenarios that every data analyst is bound to face. You’ll practice author David Asboth’s results-oriented approach as you model data by identifying customer records, navigate poorly-defined metrics, extract data from PDFs, and much more! It also teaches you how to take over incomplete projects and create rapid prototypes with real data. Along the way, you’ll build an impressive portfolio of projects you can showcase at your next interview.

📄 File Format: PDF
💾 File Size: 24.9 MB
14
Views
0
Downloads
0.00
Total Donations

📄 Text Preview (First 20 pages)

ℹ️

Registered users can read the full content for free

Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.

📄 Page 1
M A N N I N G David Asboth Foreword by Reuven M. Lerner Solve messy data problems like a pro
📄 Page 2
2. Start at the end 1. Understand the problem 3. Identify additional resources 4. Obtain the data 6. Present the minimum viable answer 7. Iterate if necessary 5. Do the work A results-driven process to apply to any data analysis problem
📄 Page 3
The Well-Grounded Data Analyst SOLVE MESSY DATA PROBLEMS LIKE A PRO DAVID ASBOTH FOREWORD BY REUVEN M LERNER MANN I NG SHELTER ISLAND
📄 Page 4
For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: orders@manning.com ©2025 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. The authors and publisher have made every effort to ensure that the information in this book was correct at press time. The authors and publisher do not assume and hereby disclaim any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from negligence, accident, or any other cause, or from any usage of the information herein. Manning Publications Co. Development editor: Sarah G Harter 20 Baldwin Road Technical editor: Brent J Broadnax PO Box 761 Review editor: Kishor Rit Shelter Island, NY 11964 Production editor: Kathy Rossland Copy editor: Lana Todorovic-Arndt Proofreader: Katie Tennant Technical proofreader: Andrew R Freed Typesetter: Dennis Dalinnik Cover designer: Marija Tudor ISBN: 9781633437531 Printed in the United States of America
📄 Page 5
To my wife, Barbara, who has always believed in me
📄 Page 6
(This page has no text content)
📄 Page 7
brief contents 1 ■ Bridging the gap between data science training and the real world 1 2 ■ Encoding geographies 11 3 ■ Data modeling 34 4 ■ Metrics 80 5 ■ Unusual data sources 123 6 ■ Categorical data 163 7 ■ Categorical data: Advanced methods 207 8 ■ Time series data: Data preparation 228 9 ■ Time series data: Analysis 262 10 ■ Rapid prototyping: Data analysis 308 11 ■ Rapid prototyping: Creating the proof of concept 349 12 ■ Iterating on someone else’s work: Data preparation 373 13 ■ Iterating on someone else’s work: Customer segmentation 418v
📄 Page 8
(This page has no text content)
📄 Page 9
contents foreword xiii preface xv acknowledgments xvii about this book xix about the author xxiii about the cover illustration xxiv 1 Bridging the gap between data science training and the real world 1 1.1 The data analyst’s toolkit 2 1.2 A results-driven approach 4 Understand the problem 5 ■ Start at the end 6 ■ Identify additional resources 6 ■ Obtain the data 6 ■ Do the work 7 ■ Present the minimum viable answer 7 Iterate if necessary 8 1.3 Project structure 8 2 Encoding geographies 11 2.1 Project 1: Identifying customer geographies 12 Data dictionary 13vii
📄 Page 10
CONTENTSviii2.2 An example solution: Finding London 14 Setting ourselves up for success 14 ■ Creating the first iteration of a solution 15 ■ Review and future steps 31 2.3 How to use the rest of the book 32 3 Data modeling 34 3.1 The importance of data modeling 35 Common data modeling tasks 36 3.2 Project 2: Who are your customers? 38 Problem statement 39 ■ Data dictionary 39 ■ Desired outcomes 41 ■ Required tools 42 3.3 Planning our approach to customer data modeling 42 Applying the results-driven process to data modeling 42 Questions to consider 45 3.4 An example solution: Identifying customers from transactional data 45 Developing an action plan 46 ■ Exploring, extracting, and combining multiple sources of data 46 ■ Applying entity resolution to deduplicate records 65 ■ Conclusions and recommendations 76 3.5 Closing thoughts on data modeling 78 Data modeling skills for any project 79 4 Metrics 80 4.1 The importance of well-defined metrics 81 4.2 Project 3: Defining precise metrics for better decision making 82 Problem statement 82 ■ Data dictionary 83 ■ Desired outcomes 84 ■ Required tools 84 4.3 Applying the results-driven method to different metric definitions 84 Questions to consider 87 4.4 An example solution: Finding the best performing products 87 Combining and exploring product data 87 ■ Calculating product-level metrics 101 ■ Finding the best products using our defined metrics 108
📄 Page 11
CONTENTS ix4.5 Closing thoughts on metrics 121 Skills for defining better metrics for any project 122 5 Unusual data sources 123 5.1 Identifying novel data sources 124 Considerations for using new datasets 125 5.2 Project 4: Analyzing film industry trends using PDF data 125 Problem statement 126 ■ Data dictionary 127 ■ Desired outcomes 128 ■ Required tools 128 5.3 Applying the results-driven method to extracting data from PDFs 128 5.4 An example solution: Effects of the COVID-19 lockdown periods on the film industry 130 Inspecting the available data 131 ■ Extracting data from PDFs 134 ■ Analyzing the data extracted from PDFs 146 Project conclusions and recommendations 160 5.5 Closing thoughts on exploring novel data sources 161 Skills for exploring unusual data sources for any project 162 6 Categorical data 163 6.1 Working with categorical data 164 Methods for handling categorical data 166 ■ Working with survey data 168 6.2 Project 5: Analyzing a survey to understand developer attitudes toward AI tools 169 Problem statement 169 ■ Data dictionary 169 ■ Desired outcomes 170 ■ Required tools 170 6.3 Applying the results-driven method to analyzing the developer survey 171 6.4 An example solution: How do developers use AI? 173 Exploring categorical data 173 ■ Analyzing categorical survey data 180 ■ Project progress so far 206 7 Categorical data: Advanced methods 207 7.1 Project 5 revisited: Analyzing survey data to determine developer attitudes to AI tools 208 Data dictionary 208 ■ Desired outcomes 208 ■ Summary of the project so far 208
📄 Page 12
CONTENTSx7.2 Using advanced methods to answer further questions about categorical data 209 Binning continuous values to discrete categories 213 ■ Using statistical tests for categorical data 218 ■ Answering a new question from start to finish 221 ■ Project results 224 7.3 Closing thoughts on categorical data 226 Skills for working with categorical data for any project 226 8 Time series data: Data preparation 228 8.1 Working with time series data 229 The hidden depth of time series data 229 ■ How to work with time series data 230 8.2 Project 6: Analyzing time series to improve cycling infrastructure 231 Problem statement 232 ■ Data dictionary 232 ■ Desired outcomes 234 ■ Required tools 234 8.3 Applying the results-driven method to analyzing road traffic data 234 8.4 An example solution: Where should cycling infrastructure improvements be focused? 236 Investigating available data and extracting time series 237 Project progress so far 260 9 Time series data: Analysis 262 9.1 Project 6 revisited: Analyzing time series to improve cycling infrastructure 263 Problem statement 263 ■ Data dictionary 263 ■ Desired outcomes 265 9.2 Where should cycling infrastructure improvements be focused? 265 Analysis of time series data 265 ■ Project conclusions and recommendations 303 9.3 Closing thoughts: Time series 305 Skills for working with time series data for any project 306 10 Rapid prototyping: Data analysis 308 10.1 The rapid prototyping process 309 Rapid prototyping example 310
📄 Page 13
CONTENTS xi10.2 Project 7: Build a proof of concept to investigate Welsh property prices 311 Problem statement 312 ■ Data dictionary 312 ■ Desired outcomes 313 ■ Required tools 314 10.3 Applying the results-driven method to investigating Welsh property data 314 10.4 An example solution: Building a prototype to explore using house price data 316 Analyzing data before prototyping 317 ■ Investigating geographic aspects of a dataset 329 ■ Identifying how to present data in the prototype 334 ■ Project progress so far 346 11 Rapid prototyping: Creating the proof of concept 349 11.1 Project 7 revisited: Building a proof of concept to investigate Welsh property prices 349 Data dictionary 350 ■ Desired outcomes 351 ■ Project summary so far 351 11.2 Building a proof of concept 353 Preparing to build a proof of concept 353 ■ Using streamlit to build a proof of concept 361 ■ Project outcomes and next steps 367 11.3 Closing thoughts on the rapid prototyping of ideas 370 Skills for rapid prototyping for any project 371 12 Iterating on someone else’s work: Data preparation 373 12.1 Finding similar entities 374 12.2 Continuing someone else’s work 375 12.3 Project 8: Finding customer segments from mobile activity 376 Problem statement 376 ■ Data dictionary 377 ■ Desired outcomes 377 ■ Required tools 378 12.4 Applying the results-driven method to creating the second iteration of a customer segmentation 378 12.5 An example solution: Creating customer segments 380 Recreating someone else’s analysis 380 ■ Analyzing event data to learn about customer behavior 393 ■ Project progress so far 415
📄 Page 14
CONTENTSxii13 Iterating on someone else’s work: Customer segmentation 418 13.1 Project 8 revisited: Finding customer segments from mobile activity 418 Data dictionary 419 ■ Desired outcomes 420 ■ Project summary so far 420 ■ Segmentation of mobile users using clustering 420 Conclusions and next steps 433 13.2 Closing thoughts: Segmentation and clustering 436 Skills learned to use for any project 436 appendix Python installation instructions 438 index 443
📄 Page 15
foreword In the modern world, data is everywhere. Applications run by governments and com- panies collect data about the world and about our actions. We walk around with smart- phones, which constantly collect data about our movements, purchases, and preferences, and then share that information with a wide variety of companies. The good news is that this data makes it easier than ever to ask interesting ques- tions about the world, ourselves, and our customers, and to get coherent answers. The bad news is that you need to find the data that will allow you to solve the problem, which isn’t trivial. You then need to clean that data and modify it to suit your pur- poses. Only when you have wrestled the data into submission can you finally start to perform analysis. And then, when you finally have answered your questions, you have to decide how to present your analysis to others. In other words, analyzing data involves much more than just analyzing it. Most of your time in a data project will be spent searching, retrieving, cleaning, editing, and producing reports. Each of these steps, in and of itself, can be quite frustrating, and they require practice and understanding. But for a beginner, it’s worse than that because it’s not clear where to start. Even if you have lots of experience with Python and pandas, that doesn’t mean you know how to solve problems—much as knowing how to use a hammer and screwdriver doesn’t necessarily make you qualified to take on a carpentry project. That’s where this book, The Well-Grounded Data Analyst, comes in. David Asboth gives you a clear set of steps to follow when you want to solve a problem. Follow thesexiii
📄 Page 16
FOREWORDxivsteps, and you’ll know what questions to ask at each stage of a project, what inputs you’ll need, and what outputs you’ll be creating. For example, David tells you to start a project by understanding it, and then by starting from the end—that is, thinking about what answers you’re trying to find and who those answers are meant for. These might seem obvious to someone experienced in data analysis problems, but as I’ve repeatedly seen in my pandas courses, it’s all too easy to forget or ignore them. This book goes far beyond just laying out the steps: it walks you through a number of projects, each of which presents its own obstacles and problems that you’ll need to overcome. David guides you through the solutions, not only explaining how to solve them, but what pitfalls you might encounter and what tradeoffs are involved with differ- ent approaches. Every project in the book uses real-world data, which is always problematic and dirty. As you work through the examples, you’ll learn how to handle such problems— including how to decide what’s worth keeping, as opposed to throwing away. No analysis project ever truly ends. It’s thus appropriate that the final step of David’s strategy is iterate. Once you’ve gotten an answer to your question, don’t spend too much time congratulating yourself. Instead, see where the flaws are in what you just did, and see if you can do even better. If you feel nervous when starting to solve an actual data analysis problem, even (or especially) after learning the basics of Python and pandas, then this book should boost your confidence. That confidence will go a long way toward helping you analyze your own problems and projects. — REUVEN M LERNER Owner, LernerPython.com Author, Python Workout and Pandas Workout
📄 Page 17
preface When I graduated from data education and started working as a data scientist, I was shocked at how different the job was from what I expected based on my studies. Data was harder to come by than I imagined. There weren’t clean datasets just sitting around waiting for me to analyze them. When I did get my hands on some data, it was undocumented and full of problems. I soon found out I wasn’t the only one who had this experience, so when I started teaching data science alongside my day job, I wanted to bridge this gap between the classroom and the real world. I’ve been teaching data topics for a number of years now, and the single most fre- quently asked question I get after a course is, what should I learn next? Based on my own experiences, I usually give a standard answer: solve real problems and learn by doing. I’ve given this answer so many times now that I wanted to write it down some- where. This book is my extended answer. To improve as an analyst, you need two things: get better at the process of analyz- ing data, regardless of what tools you use, and be immersed in a business environment where your work directly affects your surroundings. The market is saturated with intro- ductory material. “Introduction to data science/analysis” books and courses are every- where. What has always struck me was the lack of follow-up resources. What about intermediate or advanced data science? Everything that’s out there is exclusively about tools and algorithms. That’s great, but there is more to data science than the technical details. In fact, the technical details change, whereas the job of an analyst fundamentally does not.xv
📄 Page 18
PREFACExvi Analysts need to be problem solvers. People have questions that can be answered with data; analysts answer them and, in doing so, have to solve technical and organiza- tional problems. At a high level, problem solving is the skill that analysts need to hone most after initial training. The best way to do that is to actually solve problems. But what problems? And with what projects? With this book, I want aspiring analysts to continue learning, while creating a portfolio of projects to show off their advanced skills. When choosing the projects, I was careful to focus on topics that don’t normally make it into an introductory syllabus but come up often in the real world. I also chose real-world datasets and made little or no modifications so the problems would be as true to life as possible. I hope you enjoy solving these problems as much as I enjoyed creating them.
📄 Page 19
acknowledgments This book took approximately twice as long to write as I thought it would, and it would not exist without the help of many individuals. Most importantly, I thank my wife, Barbara, for supporting me from the moment I decided on a whim to write a book. I’m sorry in advance if I decide to write any more. I’m grateful to everyone at Manning who made this book a reality. First, I thank Mike Stephens for immediately believing in the book and always making it better, mostly by making me remove words that didn’t need to be there. A huge thanks goes to my development editor, Sarah G Harter, for her continued support and encourage- ment. I’m sorry there was no precedent in the Manning guides for all the things I decided to do in the book. And I thank everyone else at Manning who was involved in the production and promotion of the book. Thanks also go to all the reviewers for taking the time to provide feedback on the book. I took everything on board, and the book is better for all your suggestions: Amarjit Bhandal, Alexander Klyanchin, Amílcar de Abreu Netto, Andrew R Freed, Arun Lakhera, Bijith Komalan, Carlos Aya-Moreno, Carlos Pavia, Deborah Mesquita, Dirk Gomez, Ed Lo, Esref Durna, Gaël Penessot, George E Carter, Giampiero Granatella, Gregorio Piccoli, Hilde Van Gysel, Igor Vieira, James Nyika, Johnny Hopkins, Juan Delgado, Louis Luangkesorn, Maxim Volgin, Murugan Lakshmanan, Nick Radcliffe, Oliver Korten, Randy Au, Rene Perrin, Rui Liu, Sriram Macharla, Sumit Bhattacharyya, Walter Alexander Mata López, and Weronika Burman. Thank you, my technical editor, Brent J Broadnax, for your early contributions as the manuscript took shape. Brent graduated with an MBA in Marketing and Informationxvii
📄 Page 20
ACKNOWLEDGMENTSxviiiSystems and currently works as a data engineer for a wide range of clients across tele- communications, home services, and finance industries. Thank you, Andrew Freed, technical proofreader, for your dedication and meticu- lous attention to detail. Special thanks go to Reuven Lerner for your thoughts and for writing the foreword; as an admirer of your work, I’m truly honored. Thanks also are due Shaun McGirr for hiring me for my first data role. Everything I learned about the reality of data science was through our work together. We’re prob- ably overdue for recording some more podcast episodes. Finally, I thank you, the reader. I hope you find the book valuable, and I’d be inter- ested to hear how you solved its problems in your own unique way.
The above is a preview of the first 20 pages. Register to read the complete e-book.

💝 Support Author

0.00
Total Amount (¥)
0
Donation Count

Login to support the author

Login Now

Recommended for You

Loading recommended books...
Failed to load, please try again later
Back to List