Statistics
33
Views
0
Downloads
0
Donations
Uploader

高宏飞

Shared on 2025-12-18
Support
Share

AuthorNathan Marz, James Warren

Big Data teaches you to build big data systems using an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data. It describes a scalable, easy-to-understand approach to big data systems that can be built and run by a small team. Following a realistic example, this book guides readers through the theory of big data systems, how to implement them in practice, and how to deploy and operate them once they're built. Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications. About the Book Web-scale applications like social networks, real-time analytics, or e-commerce sites deal with a lot of data, whose volume and velocity exceed the limits of traditional database systems. These applications require architectures built around clusters of machines to store and process data of any size, or speed. Fortunately, scale and simplicity are not mutually exclusive. Big Data teaches you to build big data systems using an architecture designed specifically to capture and analyze web-scale data. This book presents the Lambda Architecture, a scalable, easy-to-understand approach that can be built and run by a small team. You'll explore the theory of big data systems and how to implement them in practice. In addition to discovering a general framework for processing big data, you'll learn specific technologies like Hadoop, Storm, and NoSQL databases. This book requires no previous exposure to large-scale data analysis or NoSQL tools. Familiarity with traditional databases is helpful. What's Inside Introduction to big data systems Real-time processing of web-scale data Tools like Hadoop, Cassandra, and Storm Extensions to traditional database skills About the Authors Nathan Marz is the creator of Apache Storm and the originator of the Lambda Architecture for big data systems. James Warren is an analytics architect with a background in machine learnin

Tags
No tags
ISBN: 1617290343
Publisher: Manning Publications
Publish Year: 2015
Language: 英文
Pages: 328
File Format: PDF
File Size: 7.4 MB
Support Statistics
¥.00 · 0times
Text Preview (First 20 pages)
Registered users can read the full content for free

Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.

M A N N I N G Nathan Marz WITH James Warren Principles and best practices of scalable real-time data systems
Big Data PRINCIPLES AND BEST PRACTICES OF SCALABLE REAL-TIME DATA SYSTEMS NATHAN MARZ with JAMES WARREN M A N N I N G Shelter Island Licensed to Mark Watson <nordickan@gmail.com>
For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: orders@manning.com ©2015 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. Manning Publications Co. Development editors: Renae Gregoire, Jennifer Stout 20 Baldwin Road Technical development editor: Jerry Gaines PO Box 761 Copyeditor: Andy Carroll Shelter Island, NY 11964 Proofreader: Katie Tennant Technical proofreader: Jerry Kuch Typesetter: Gordan Salinovic Cover designer: Marija Tudor ISBN 9781617290343 Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 – EBM – 20 19 18 17 16 15 Licensed to Mark Watson <nordickan@gmail.com>
iii brief contents 1 ■ A new paradigm for Big Data 1 PART 1 BATCH LAYER.................................................................25 2 ■ Data model for Big Data 27 3 ■ Data model for Big Data: Illustration 47 4 ■ Data storage on the batch layer 54 5 ■ Data storage on the batch layer: Illustration 65 6 ■ Batch layer 83 7 ■ Batch layer: Illustration 111 8 ■ An example batch layer: Architecture and algorithms 139 9 ■ An example batch layer: Implementation 156 PART 2 SERVING LAYER ............................................................177 10 ■ Serving layer 179 11 ■ Serving layer: Illustration 196 Licensed to Mark Watson <nordickan@gmail.com>
BRIEF CONTENTSiv PART 3 SPEED LAYER ................................................................205 12 ■ Realtime views 207 13 ■ Realtime views: Illustration 220 14 ■ Queuing and stream processing 225 15 ■ Queuing and stream processing: Illustration 242 16 ■ Micro-batch stream processing 254 17 ■ Micro-batch stream processing: Illustration 269 18 ■ Lambda Architecture in depth 284 Licensed to Mark Watson <nordickan@gmail.com>
v contents preface xiii acknowledgments xv about this book xviii 1 A new paradigm for Big Data 1 1.1 How this book is structured 2 1.2 Scaling with a traditional database 3 Scaling with a queue 3 ■ Scaling by sharding the database 4 Fault-tolerance issues begin 5 ■ Corruption issues 5 ■ What went wrong? 5 ■ How will Big Data techniques help? 6 1.3 NoSQL is not a panacea 6 1.4 First principles 6 1.5 Desired properties of a Big Data system 7 Robustness and fault tolerance 7 ■ Low latency reads and updates 8 ■ Scalability 8 ■ Generalization 8 ■ Extensibility 8 Ad hoc queries 8 ■ Minimal maintenance 9 ■ Debuggability 9 1.6 The problems with fully incremental architectures 9 Operational complexity 10 ■ Extreme complexity of achieving eventual consistency 11 ■ Lack of human-fault tolerance 12 Fully incremental solution vs. Lambda Architecture solution 13 Licensed to Mark Watson <nordickan@gmail.com>
CONTENTSvi 1.7 Lambda Architecture 14 Batch layer 16 ■ Serving layer 17 ■ Batch and serving layers satisfy almost all properties 17 ■ Speed layer 18 1.8 Recent trends in technology 20 CPUs aren’t getting faster 20 ■ Elastic clouds 21 ■ Vibrant open source ecosystem for Big Data 21 1.9 Example application: SuperWebAnalytics.com 22 1.10 Summary 23 PART 1 BATCH LAYER .......................................................25 2 Data model for Big Data 27 2.1 The properties of data 29 Data is raw 31 ■ Data is immutable 34 ■ Data is eternally true 36 2.2 The fact-based model for representing data 37 Example facts and their properties 37 ■ Benefits of the fact-based model 39 2.3 Graph schemas 43 Elements of a graph schema 43 ■ The need for an enforceable schema 44 2.4 A complete data model for SuperWebAnalytics.com 45 2.5 Summary 46 3 Data model for Big Data: Illustration 47 3.1 Why a serialization framework? 48 3.2 Apache Thrift 48 Nodes 49 ■ Edges 49 ■ Properties 50 ■ Tying everything together into data objects 51 ■ Evolving your schema 51 3.3 Limitations of serialization frameworks 52 3.4 Summary 53 4 Data storage on the batch layer 54 4.1 Storage requirements for the master dataset 55 4.2 Choosing a storage solution for the batch layer 56 Using a key/value store for the master dataset 56 ■ Distributed filesystems 57 Licensed to Mark Watson <nordickan@gmail.com>
CONTENTS vii 4.3 How distributed filesystems work 58 4.4 Storing a master dataset with a distributed filesystem 59 4.5 Vertical partitioning 61 4.6 Low-level nature of distributed filesystems 62 4.7 Storing the SuperWebAnalytics.com master dataset on a distributed filesystem 64 4.8 Summary 64 5 Data storage on the batch layer: Illustration 65 5.1 Using the Hadoop Distributed File System 66 The small-files problem 67 ■ Towards a higher-level abstraction 67 5.2 Data storage in the batch layer with Pail 68 Basic Pail operations 69 ■ Serializing objects into pails 70 Batch operations using Pail 72 ■ Vertical partitioning with Pail 73 ■ Pail file formats and compression 74 ■ Summarizing the benefits of Pail 75 5.3 Storing the master dataset for SuperWebAnalytics.com 76 A structured pail for Thrift objects 77 ■ A basic pail for SuperWebAnalytics.com 78 ■ A split pail to vertically partition the dataset 78 5.4 Summary 82 6 Batch layer 83 6.1 Motivating examples 84 Number of pageviews over time 84 ■ Gender inference 85 Influence score 85 6.2 Computing on the batch layer 86 6.3 Recomputation algorithms vs. incremental algorithms 88 Performance 89 ■ Human-fault tolerance 90 ■ Generality of the algorithms 91 ■ Choosing a style of algorithm 91 6.4 Scalability in the batch layer 92 6.5 MapReduce: a paradigm for Big Data computing 93 Scalability 94 ■ Fault-tolerance 96 ■ Generality of MapReduce 97 6.6 Low-level nature of MapReduce 99 Multistep computations are unnatural 99 ■ Joins are very complicated to implement manually 99 ■ Logical and physical execution tightly coupled 101 Licensed to Mark Watson <nordickan@gmail.com>
CONTENTSviii 6.7 Pipe diagrams: a higher-level way of thinking about batch computation 102 Concepts of pipe diagrams 102 ■ Executing pipe diagrams via MapReduce 106 ■ Combiner aggregators 107 ■ Pipe diagram examples 108 6.8 Summary 109 7 Batch layer: Illustration 111 7.1 An illustrative example 112 7.2 Common pitfalls of data-processing tools 114 Custom languages 114 ■ Poorly composable abstractions 115 7.3 An introduction to JCascalog 115 The JCascalog data model 116 ■ The structure of a JCascalog query 117 ■ Querying multiple datasets 119 ■ Grouping and aggregators 121 ■ Stepping though an example query 122 Custom predicate operations 125 7.4 Composition 130 Combining subqueries 130 ■ Dynamically created subqueries 131 ■ Predicate macros 134 ■ Dynamically created predicate macros 136 7.5 Summary 138 8 An example batch layer: Architecture and algorithms 139 8.1 Design of the SuperWebAnalytics.com batch layer 140 Supported queries 140 ■ Batch views 141 8.2 Workflow overview 144 8.3 Ingesting new data 145 8.4 URL normalization 146 8.5 User-identifier normalization 146 8.6 Deduplicate pageviews 151 8.7 Computing batch views 151 Pageviews over time 151 ■ Unique visitors over time 152 Bounce-rate analysis 152 8.8 Summary 154 Licensed to Mark Watson <nordickan@gmail.com>
CONTENTS ix 9 An example batch layer: Implementation 156 9.1 Starting point 157 9.2 Preparing the workflow 158 9.3 Ingesting new data 158 9.4 URL normalization 162 9.5 User-identifier normalization 163 9.6 Deduplicate pageviews 168 9.7 Computing batch views 169 Pageviews over time 169 ■ Uniques over time 171 ■ Bounce- rate analysis 172 9.8 Summary 175 PART 2 SERVING LAYER...................................................177 10 Serving layer 179 10.1 Performance metrics for the serving layer 181 10.2 The serving layer solution to the normalization/ denormalization problem 183 10.3 Requirements for a serving layer database 185 10.4 Designing a serving layer for SuperWebAnalytics.com 186 Pageviews over time 186 ■ Uniques over time 187 ■ Bounce- rate analysis 188 10.5 Contrasting with a fully incremental solution 188 Fully incremental solution to uniques over time 188 ■ Comparing to the Lambda Architecture solution 194 10.6 Summary 195 11 Serving layer: Illustration 196 11.1 Basics of ElephantDB 197 View creation in ElephantDB 197 ■ View serving in ElephantDB 197 ■ Using ElephantDB 198 11.2 Building the serving layer for SuperWebAnalytics.com 200 Pageviews over time 200 ■ Uniques over time 202 ■ Bounce- rate analysis 203 11.3 Summary 204 Licensed to Mark Watson <nordickan@gmail.com>
CONTENTSx PART 3 SPEED LAYER ......................................................205 12 Realtime views 207 12.1 Computing realtime views 209 12.2 Storing realtime views 210 Eventual accuracy 211 ■ Amount of state stored in the speed layer 211 12.3 Challenges of incremental computation 212 Validity of the CAP theorem 213 ■ The complex interaction between the CAP theorem and incremental algorithms 214 12.4 Asynchronous versus synchronous updates 216 12.5 Expiring realtime views 217 12.6 Summary 219 13 Realtime views: Illustration 220 13.1 Cassandra’s data model 220 13.2 Using Cassandra 222 Advanced Cassandra 224 13.3 Summary 224 14 Queuing and stream processing 225 14.1 Queuing 226 Single-consumer queue servers 226 ■ Multi-consumer queues 228 14.2 Stream processing 229 Queues and workers 230 ■ Queues-and-workers pitfalls 231 14.3 Higher-level, one-at-a-time stream processing 231 Storm model 232 ■ Guaranteeing message processing 236 14.4 SuperWebAnalytics.com speed layer 238 Topology structure 240 14.5 Summary 241 15 Queuing and stream processing: Illustration 242 15.1 Defining topologies with Apache Storm 242 15.2 Apache Storm clusters and deployment 245 15.3 Guaranteeing message processing 247 Licensed to Mark Watson <nordickan@gmail.com>
CONTENTS xi 15.4 Implementing the SuperWebAnalytics.com uniques-over-time speed layer 249 15.5 Summary 253 16 Micro-batch stream processing 254 16.1 Achieving exactly-once semantics 255 Strongly ordered processing 255 ■ Micro-batch stream processing 256 ■ Micro-batch processing topologies 257 16.2 Core concepts of micro-batch stream processing 259 16.3 Extending pipe diagrams for micro-batch processing 260 16.4 Finishing the speed layer for SuperWebAnalytics.com 262 Pageviews over time 262 ■ Bounce-rate analysis 263 16.5 Another look at the bounce-rate-analysis example 267 16.6 Summary 268 17 Micro-batch stream processing: Illustration 269 17.1 Using Trident 270 17.2 Finishing the SuperWebAnalytics.com speed layer 273 Pageviews over time 273 ■ Bounce-rate analysis 275 17.3 Fully fault-tolerant, in-memory, micro-batch processing 281 17.4 Summary 283 18 Lambda Architecture in depth 284 18.1 Defining data systems 285 18.2 Batch and serving layers 286 Incremental batch processing 286 ■ Measuring and optimizing batch layer resource usage 293 18.3 Speed layer 297 18.4 Query layer 298 18.5 Summary 299 index 301 Licensed to Mark Watson <nordickan@gmail.com>
Licensed to Mark Watson <nordickan@gmail.com>
xiii preface When I first entered the world of Big Data, it felt like the Wild West of software devel- opment. Many were abandoning the relational database and its familiar comforts for NoSQL databases with highly restricted data models designed to scale to thousands of machines. The number of NoSQL databases, many of them with only minor differ- ences between them, became overwhelming. A new project called Hadoop began to make waves, promising the ability to do deep analyses on huge amounts of data. Mak- ing sense of how to use these new tools was bewildering. At the time, I was trying to handle the scaling problems we were faced with at the company at which I worked. The architecture was intimidatingly complex—a web of sharded relational databases, queues, workers, masters, and slaves. Corruption had worked its way into the databases, and special code existed in the application to han- dle the corruption. Slaves were always behind. I decided to explore alternative Big Data technologies to see if there was a better design for our data architecture. One experience from my early software-engineering career deeply shaped my view of how systems should be architected. A coworker of mine had spent a few weeks col- lecting data from the internet onto a shared filesystem. He was waiting to collect enough data so that he could perform an analysis on it. One day while doing some routine maintenance, I accidentally deleted all of my coworker’s data, setting him behind weeks on his project. I knew I had made a big mistake, but as a new software engineer I didn’t know what the consequences would be. Was I going to get fired for being so careless? I sent out an email to the team apologizing profusely—and to my great surprise, everyone was very sympathetic. I’ll never forget when a coworker came to my desk, patted my back, and said “Congratulations. You’re now a professional software engineer.” Licensed to Mark Watson <nordickan@gmail.com>
PREFACExiv In his joking statement lay a deep unspoken truism in software development: we don’t know how to make perfect software. Bugs can and do get deployed to production. If the application can write to the database, a bug can write to the database as well. When I set about redesigning our data architecture, this experience profoundly affected me. I knew our new architecture not only had to be scalable, tolerant to machine failure, and easy to reason about—but tolerant of human mistakes as well. My experience re-architecting that system led me down a path that caused me to question everything I thought was true about databases and data management. I came up with an architecture based on immutable data and batch computation, and I was astonished by how much simpler the new system was compared to one based solely on incremental computation. Everything became easier, including operations, evolving the system to support new features, recovering from human mistakes, and doing per- formance optimization. The approach was so generic that it seemed like it could be used for any data system. Something confused me though. When I looked at the rest of the industry, I saw that hardly anyone was using similar techniques. Instead, daunting amounts of com- plexity were embraced in the use of architectures based on huge clusters of incremen- tally updated databases. So many of the complexities in those architectures were either completely avoided or greatly softened by the approach I had developed. Over the next few years, I expanded on the approach and formalized it into what I dubbed the Lambda Architecture. When working on a startup called BackType, our team of five built a social media analytics product that provided a diverse set of realtime analytics on over 100 TB of data. Our small team also managed deployment, opera- tions, and monitoring of the system on a cluster of hundreds of machines. When we showed people our product, they were astonished that we were a team of only five people. They would often ask “How can so few people do so much?” My answer was simple: “It’s not what we’re doing, but what we’re not doing.” By using the Lambda Architecture, we avoided the complexities that plague traditional architectures. By avoiding those complexities, we became dramatically more productive. The Big Data movement has only magnified the complexities that have existed in data architectures for decades. Any architecture based primarily on large databases that are updated incrementally will suffer from these complexities, causing bugs, bur- densome operations, and hampered productivity. Although SQL and NoSQL data- bases are often painted as opposites or as duals of each other, at a fundamental level they are really the same. They encourage this same architecture with its inevitable complexities. Complexity is a vicious beast, and it will bite you regardless of whether you acknowledge it or not. This book is the result of my desire to spread the knowledge of the Lambda Archi- tecture and how it avoids the complexities of traditional architectures. It is the book I wish I had when I started working with Big Data. I hope you treat this book as a jour- ney—a journey to challenge what you thought you knew about data systems, and to discover that working with Big Data can be elegant, simple, and fun. NATHAN MARZ Licensed to Mark Watson <nordickan@gmail.com>
xv acknowledgments This book would not have been possible without the help and support of so many individuals around the world. I must start with my parents, who instilled in me from a young age a love of learning and exploring the world around me. They always encour- aged me in all my career pursuits. Likewise, my brother Iorav encouraged my intellectual interests from a young age. I still remember when he taught me Algebra while I was in elementary school. He was the one to introduce me to programming for the first time—he taught me Visual Basic as he was taking a class on it in high school. Those lessons sparked a passion for programming that led to my career. I am enormously grateful to Michael Montano and Christopher Golda, the found- ers of BackType. From the moment they brought me on as their first employee, I was given an extraordinary amount of freedom to make decisions. That freedom was essential for me to explore and exploit the Lambda Architecture to its fullest. They never questioned the value of open source and allowed me to open source our tech- nology liberally. Getting deeply involved with open source has been one of the great privileges of my life. Many of my professors from my time as a student at Stanford deserve special thanks. Tim Roughgarden is the best teacher I’ve ever had—he radically improved my ability to rigorously analyze, deconstruct, and solve difficult problems. Taking as many classes as possible with him was one of the best decisions of my life. I also give thanks to Monica Lam for instilling within me an appreciation for the elegance of Datalog. Many years later I married Datalog with MapReduce to produce my first significant open source project, Cascalog. Licensed to Mark Watson <nordickan@gmail.com>
ACKNOWLEDGMENTSxvi Chris Wensel was the first one to show me that processing data at scale could be elegant and performant. His Cascading library changed the way I looked at Big Data processing. None of my work would have been possible without the pioneers of the Big Data field. Special thanks to Jeffrey Dean and Sanjay Ghemawat for the original MapReduce paper, Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels for the original Dynamo paper, and Michael Cafarella and Doug Cutting for founding the Apache Hadoop project. Rich Hickey has been one of my biggest inspirations during my programming career. Clojure is the best language I have ever used, and I’ve become a better pro- grammer having learned it. I appreciate its practicality and focus on simplicity. Rich’s philosophy on state and complexity in programming has influenced me deeply. When I started writing this book, I was not nearly the writer I am now. Renae Gre- goire, one of my development editors at Manning, deserves special thanks for helping me improve as a writer. She drilled into me the importance of using examples to lead into general concepts, and she set off many light bulbs for me on how to effectively structure technical writing. The skills she taught me apply not only to writing techni- cal books, but to blogging, giving talks, and communication in general. For gaining an important life skill, I am forever grateful. This book would not be nearly of the same quality without the efforts of my co- author James Warren. He did a phenomenal job absorbing the theoretical concepts and finding even better ways to present the material. Much of the clarity of the book comes from his great communication skills. My publisher, Manning, was a pleasure to work with. They were patient with me and understood that finding the right way to write on such a big topic takes time. Through the whole process they were supportive and helpful, and they always gave me the resources I needed to be successful. Thanks to Marjan Bace and Michael Stephens for all the support, and to all the other staff for their help and guidance along the way. I try to learn as much as possible about writing from studying other writers. Brad- ford Cross, Clayton Christensen, Paul Graham, Carl Sagan, and Derek Sivers have been particularly influential. Finally, I can’t give enough thanks to the hundreds of people who reviewed, com- mented, and gave feedback on our book as it was being written. That feedback led us to revise, rewrite, and restructure numerous times until we found ways to present the material effectively. Special thanks to Aaron Colcord, Aaron Crow, Alex Holmes, Arun Jacob, Asif Jan, Ayon Sinha, Bill Graham, Charles Brophy, David Beckwith, Derrick Burns, Douglas Duncan, Hugo Garza, Jason Courcoux, Jonathan Esterhazy, Karl Kuntz, Kevin Martin, Leo Polovets, Mark Fisher, Massimo Ilario, Michael Fogus, Michael G. Noll, Patrick Dennis, Pedro Ferrera Bertran, Philipp Janert, Rodrigo Abreu, Rudy Bonefas, Sam Ritchie, Siva Kalagarla, Soren Macbeth, Timothy Chk- lovski, Walid Farid, and Zhenhua Guo. NATHAN MARZ Licensed to Mark Watson <nordickan@gmail.com>
ACKNOWLEDGMENTS xvii I’m astounded when I consider everyone who contributed in some manner to this book. Unfortunately, I can’t provide an exhaustive list, but that doesn’t lessen my appreciation. Nonetheless, there are individuals to whom I wish to explicitly express my gratitude: ■ My wife, Wen-Ying Feng—for your love, encouragement and support, not only for this book but for everything we do together. ■ My parents, James and Gretta Warren—for your endless faith in me and the sac- rifices you made to provide me with every opportunity. ■ My sister, Julia Warren-Ulanch—for setting a shining example so I could follow in your footsteps. ■ My professors and mentors, Ellen Toby and Sue Geller—for your willingness to answer my every question and for demonstrating the joy in sharing knowledge, not just acquiring it. ■ Chuck Lam—for saying “Hey, have you heard of this thing called Hadoop?” to me so many years ago. ■ My friends and colleagues at RockYou!, Storm8, and Bina—for the experiences we shared together and the opportunity to put theory into practice. ■ Marjan Bace, Michael Stephens, Jennifer Stout, Renae Gregoire, and the entire Manning editorial and publishing staff—for your guidance and patience in see- ing this book to completion. ■ The reviewers and early readers of this book—for your comments and critiques that pushed us to clarify our words; the end result is so much better for it. Finally, I want to convey my greatest appreciation to Nathan for inviting me to come along on this journey. I was already a great admirer of your work before joining this venture, and working with you has only deepened my respect for your ideas and phi- losophy. It has been an honor and a privilege. JAMES WARREN Licensed to Mark Watson <nordickan@gmail.com>
xviii about this book Services like social networks, web analytics, and intelligent e-commerce often need to manage data at a scale too big for a traditional database. Complexity increases with scale and demand, and handling Big Data is not as simple as just doubling down on your RDBMS or rolling out some trendy new technology. Fortunately, scalability and simplicity are not mutually exclusive—you just need to take a different approach. Big Data systems use many machines working in parallel to store and process data, which introduces fundamental challenges unfamiliar to most developers. Big Data teaches you to build these systems using an architecture that takes advan- tage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data. It describes a scalable, easy-to-understand approach to Big Data systems that can be built and run by a small team. Following a realistic example, this book guides readers through the theory of Big Data systems and how to imple- ment them in practice. Big Data requires no previous exposure to large-scale data analysis or NoSQL tools. Familiarity with traditional databases is helpful, though not required. The goal of the book is to teach you how to think about data systems and how to break down difficult problems into simple solutions. We start from first principles and from those deduce the necessary properties for each component of an architecture. Roadmap An overview of the 18 chapters in this book follows. Chapter 1 introduces the principles of data systems and gives an overview of the Lambda Architecture: a generalized approach to building any data system. Chapters 2 through 17 dive into all the pieces of the Lambda Architecture, with chapters alternating between theory and illustration chapters. Theory chapters demonstrate the Licensed to Mark Watson <nordickan@gmail.com>
ABOUT THIS BOOK xix concepts that hold true regardless of existing tools, while illustration chapters use real-world tools to demonstrate the concepts. Don’t let the names fool you, though— all chapters are highly example-driven. Chapters 2 through 9 focus on the batch layer of the Lambda Architecture. Here you will learn about modeling your master dataset, using batch processing to create arbitrary views of your data, and the trade-offs between incremental and batch processing. Chapters 10 and 11 focus on the serving layer, which provides low latency access to the views produced by the batch layer. Here you will learn about specialized databases that are only written to in bulk. You will discover that these databases are dramatically simpler than traditional databases, giving them excellent performance, operational, and robustness properties. Chapters 12 through 17 focus on the speed layer, which compensates for the batch layer’s high latency to provide up-to-date results for all queries. Here you will learn about NoSQL databases, stream processing, and managing the complexities of incre- mental computation. Chapter 18 uses your new-found knowledge to review the Lambda Architecture once more and fill in any remaining gaps. You’ll learn about incremental batch pro- cessing, variants of the basic Lambda Architecture, and how to get the most out of your resources. Code downloads and conventions The source code for the book can be found at https://github.com/Big-Data-Manning. We have provided source code for the running example SuperWebAnalytics.com. Much of the source code is shown in numbered listings. These listings are meant to provide complete segments of code. Some listings are annotated to help highlight or explain certain parts of the code. In other places throughout the text, code frag- ments are used when necessary. Courier typeface is used to denote code for Java. In both the listings and fragments, we make use of a bold code font to help identify key parts of the code that are being explained in the text. Author Online Purchase of Big Data includes free access to a private web forum run by Manning Pub- lications where you can make comments about the book, ask technical questions, and receive help from the authors and other users. To access the forum and subscribe to it, point your web browser to www.manning.com/BigData. This Author Online (AO) page provides information on how to get on the forum once you’re registered, what kind of help is available, and the rules of conduct on the forum. Manning’s commitment to our readers is to provide a venue where a meaningful dialog among individual readers and between readers and the authors can take place. It’s not a commitment to any specific amount of participation on the part of the authors, whose contribution to the AO forum remains voluntary (and unpaid). We suggest you try asking the authors some challenging questions, lest their interest stray! Licensed to Mark Watson <nordickan@gmail.com>
The above is a preview of the first 20 pages. Register to read the complete e-book.