HBase The Definitive Guide (Lars George) (Z-Library)

Author: Lars George

科学

If you’re looking for a scalable storage solution to accommodate a virtually endless amount of data, this updated edition shows you how Apache HBase can meet your needs. Modeled after Google’s BigTable architecture, HBase scales to billions of rows and millions of columns, while ensuring that write and read performance remain constant. Fully revised for HBase 1.0, this second edition brings you up to speed on the new HBase client API, as well as security features and new case studies that demonstrate HBase use in the real world. Whether you just started to evaluate this non-relational database, or plan to put it into practice right away, this book has your back. Launch into basic, advanced, and administrative features of HBase’s new client-facing API Use new classes to integrate HBase with Hadoop’s MapReduce framework Explore HBase’s architecture, including the storage format, write-ahead log, and background processes Dive into advanced usage, such extended client and server options Learn cluster sizing, tuning, and monitoring best practices Design schemas, copy tables, import bulk data, decommission nodes, and other tasks Go deeper into HBase security, including Kerberos and encryption at rest

📄 File Format: PDF
💾 File Size: 11.5 MB
15
Views
0
Downloads
0.00
Total Donations

📄 Text Preview (First 20 pages)

ℹ️

Registered users can read the full content for free

Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.

📄 Page 1
(This page has no text content)
📄 Page 2
Lars George SECOND EDITION HBase - The Definitive Guide - 2nd Edition
📄 Page 3
ISBN: 063-6-920-03394-3 [?] HBase - The Definitive Guide - 2nd Edition, Second Edition by Lars George Copyright © 2010 Lars George. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribookson line.com). For more information, contact our corporate/institutional sales depart‐ ment: 800-998-9938 or <corporate@oreilly.com>. Editor: Ann Spencer Production Editor: FIX ME! Copyeditor: FIX ME! Proofreader: FIX ME! Indexer: FIX ME! Cover Designer: Karen Montgomery Interior Designer: David Futato Illustrator: Rebecca Demarest January -4712: Second Edition Revision History for the Second Edition: 2015-04-10 Early release revision 1 2015-07-07 Early release revision See http://oreilly.com/catalog/errata.csp?isbn=0636920033943 for release details. Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are regis‐ tered trademarks of O’Reilly Media, Inc. !!FILL THIS IN!! and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publish‐ er and authors assume no responsibility for errors or omissions, or for damages re‐ sulting from the use of the information contained herein.
📄 Page 4
Table of Contents Foreword: Michael Stack. . . . . . . . . . . . . . . . . . . . . . . . . . ix Foreword: Carter Page. . . . . . . . . . . . . . . . . . . . . . . . . . xiii Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii 1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 The Dawn of Big Data 1 The Problem with Relational Database Systems 7 Nonrelational Database Systems, Not-Only SQL or NoSQL? 10 Dimensions 13 Scalability 15 Database (De-)Normalization 16 Building Blocks 19 Backdrop 19 Namespaces, Tables, Rows, Columns, and Cells 21 Auto-Sharding 26 Storage API 28 Implementation 29 Summary 33 HBase: The Hadoop Database 34 History 34 Nomenclature 37 Summary 37 2. Installation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Quick-Start Guide 39 Requirements 43 Hardware 43 Software 51 Filesystems for HBase 67 Local 69 HDFS 70 iii
📄 Page 5
S3 70 Other Filesystems 72 Installation Choices 73 Apache Binary Release 73 Building from Source 76 Run Modes 79 Standalone Mode 79 Distributed Mode 79 Configuration 85 hbase-site.xml and hbase-default.xml 87 hbase-env.sh and hbase-env.cmd 88 regionserver 88 log4j.properties 89 Example Configuration 89 Client Configuration 91 Deployment 92 Script-Based 92 Apache Whirr 94 Puppet and Chef 94 Operating a Cluster 95 Running and Confirming Your Installation 95 Web-based UI Introduction 96 Shell Introduction 98 Stopping the Cluster 99 3. Client API: The Basics. . . . . . . . . . . . . . . . . . . . . . . . . 101 General Notes 101 Data Types and Hierarchy 103 Generic Attributes 104 Operations: Fingerprint and ID 104 Query versus Mutation 106 Durability, Consistency, and Isolation 108 The Cell 112 API Building Blocks 117 CRUD Operations 122 Put Method 122 Get Method 146 Delete Method 168 Append Method 181 Mutate Method 184 Batch Operations 187 Scans 193 Introduction 193 The ResultScanner Class 199 Table of Contentsiv
📄 Page 6
Scanner Caching 203 Scanner Batching 206 Slicing Rows 210 Load Column Families on Demand 213 Scanner Metrics 214 Miscellaneous Features 215 The Table Utility Methods 215 The Bytes Class 216 4. Client API: Advanced Features. . . . . . . . . . . . . . . . . . 219 Filters 219 Introduction to Filters 219 Comparison Filters 223 Dedicated Filters 232 Decorating Filters 252 FilterList 256 Custom Filters 259 Filter Parser Utility 269 Filters Summary 272 Counters 273 Introduction to Counters 274 Single Counters 277 Multiple Counters 278 Coprocessors 282 Introduction to Coprocessors 282 The Coprocessor Class Trinity 285 Coprocessor Loading 289 Endpoints 298 Observers 311 The ObserverContext Class 312 The RegionObserver Class 314 The MasterObserver Class 334 The RegionServerObserver Class 340 The WALObserver Class 342 The BulkLoadObserver Class 344 The EndPointObserver Class 344 5. Client API: Administrative Features. . . . . . . . . . . . . . 347 Schema Definition 347 Namespaces 347 Tables 350 Table Properties 358 Column Families 362 HBaseAdmin 375 Basic Operations 375 Table of Contents v
📄 Page 7
Namespace Operations 376 Table Operations 378 Schema Operations 391 Cluster Operations 393 Cluster Status Information 411 ReplicationAdmin 422 6. Available Clients. . . . . . . . . . . . . . . . . . . . . . . . . . . . 427 Introduction 427 Gateways 427 Frameworks 431 Gateway Clients 432 Native Java 432 REST 433 Thrift 444 Thrift2 458 SQL over NoSQL 459 Framework Clients 460 MapReduce 460 Hive 460 Mapping Existing Tables 469 Mapping Existing Table Snapshots 473 Pig 474 Cascading 479 Other Clients 480 Shell 481 Basics 481 Commands 484 Scripting 497 Web-based UI 503 Master UI Status Page 504 Master UI Related Pages 521 Region Server UI Status Page 532 Shared Pages 551 7. Hadoop Integration. . . . . . . . . . . . . . . . . . . . . . . . . . 559 Framework 559 MapReduce Introduction 560 Processing Classes 562 Supporting Classes 575 MapReduce Locality 581 Table Splits 583 MapReduce over Tables 586 Preparation 586 Table as a Data Sink 603 Table of Contentsvi
📄 Page 8
Table as a Data Source 610 Table as both Data Source and Sink 614 Custom Processing 617 MapReduce over Snapshots 620 Bulk Loading Data 627 A. Upgrade from Previous Releases. . . . . . . . . . . . . . . . 633 Table of Contents vii
📄 Page 9
(This page has no text content)
📄 Page 10
1. “Bigtable: A Distributed Storage System for Structured Data” by Fay Chang et al. Foreword: Michael Stack The HBase story begins in 2006, when the San Francisco-based start‐ up Powerset was trying to build a natural language search engine for the Web. Their indexing pipeline was an involved multistep process that produced an index about two orders of magnitude larger, on aver‐ age, than your standard term-based index. The datastore that they’d built on top of the then nascent Amazon Web Services to hold the in‐ dex intermediaries and the webcrawl was buckling under the load (Ring. Ring. “Hello! This is AWS. Whatever you are running, please turn it off!”). They were looking for an alternative. The Google Bigta‐ ble paper1 had just been published. Chad Walters, Powerset’s head of engineering at the time, reflects back on the experience as follows: Building an open source system to run on top of Hadoop’s Distribut‐ ed Filesystem (HDFS) in much the same way that Bigtable ran on top of the Google File System seemed like a good approach be‐ cause: 1) it was a proven scalable architecture; 2) we could lever‐ age existing work on Hadoop’s HDFS; and 3) we could both contrib‐ ute to and get additional leverage from the growing Hadoop ecosys‐ tem. After the publication of the Google Bigtable paper, there were on- again, off-again discussions around what a Bigtable-like system on top of Hadoop might look. Then, in early 2007, out of the blue, Mike Ca‐ farela dropped a tarball of thirty odd Java files into the Hadoop issue tracker: “I’ve written some code for HBase, a Bigtable-like file store. It’s not perfect, but it’s ready for other people to play with and exam‐ ix
📄 Page 11
ine.” Mike had been working with Doug Cutting on Nutch, an open source search engine. He’d done similar drive-by code dumps there to add features such as a Google File System clone so the Nutch index‐ ing process was not bounded by the amount of disk you attach to a single machine. (This Nutch distributed filesystem would later grow up to be HDFS.) Jim Kellerman of Powerset took Mike’s dump and started filling in the gaps, adding tests and getting it into shape so that it could be commit‐ ted as part of Hadoop. The first commit of the HBase code was made by Doug Cutting on April 3, 2007, under the contrib subdirectory. The first HBase “working” release was bundled as part of Hadoop 0.15.0 in October 2007. Not long after, Lars, the author of the book you are now reading, showed up on the #hbase IRC channel. He had a big-data problem of his own, and was game to try HBase. After some back and forth, Lars became one of the first users to run HBase in production outside of the Powerset home base. Through many ups and downs, Lars stuck around. I distinctly remember a directory listing Lars made for me a while back on his production cluster at WorldLingo, where he was em‐ ployed as CTO, sysadmin, and grunt. The listing showed ten or so HBase releases from Hadoop 0.15.1 (November 2007) on up through HBase 0.20, each of which he’d run on his 40-node cluster at one time or another during production. Of all those who have contributed to HBase over the years, it is poetic justice that Lars is the one to write this book. Lars was always dog‐ ging HBase contributors that the documentation needed to be better if we hoped to gain broader adoption. Everyone agreed, nodded their heads in ascent, amen’d, and went back to coding. So Lars started writing critical how-to’s and architectural descriptions in-between jobs and his intra-European travels as unofficial HBase European am‐ bassador. His Lineland blogs on HBase gave the best description, out‐ side of the source, of how HBase worked, and at a few critical junc‐ tures, carried the community across awkward transitions (e.g., an im‐ portant blog explained the labyrinthian HBase build during the brief period we thought an Ivy-based build to be a “good idea”). His lus‐ cious diagrams were poached by one and all wherever an HBase pre‐ sentation was given. HBase has seen some interesting times, including a period of sponsor‐ ship by Microsoft, of all things. Powerset was acquired in July 2008, and after a couple of months during which Powerset employees were disallowed from contributing while Microsoft’s legal department vet‐ ted the HBase codebase to see if it impinged on SQLServer patents, Foreword: Michael Stackx
📄 Page 12
we were allowed to resume contributing (I was a Microsoft employee working near full time on an Apache open source project). The times ahead look promising, too, whether it’s the variety of contortions HBase is being put through at Facebook—as the underpinnings for their massive Facebook mail app or fielding millions of hits a second on their analytics clusters—or more deploys along the lines of Ya‐ hoo!’s 1k node HBase cluster used to host their snapshot of Micro‐ soft’s Bing crawl. Other developments include HBase running on file‐ systems other than Apache HDFS, such as MapR. But plain to me though is that none of these developments would have been possible were it not for the hard work put in by our awesome HBase community driven by a core of HBase committers. Some mem‐ bers of the core have only been around a year or so—Todd Lipcon, Gary Helmling, and Nicolas Spiegelberg—and we would be lost without them, but a good portion have been there from close to project inception and have shaped HBase into the (scalable) general datastore that it is today. These include Jonathan Gray, who gambled his startup streamy.com on HBase; Andrew Purtell, who built an HBase team at Trend Micro long before such a thing was fashionable; Ryan Rawson, who got StumbleUpon—which became the main spon‐ sor after HBase moved on from Powerset/Microsoft—on board, and who had the sense to hire John-Daniel Cryans, now a power contribu‐ tor but just a bushy-tailed student at the time. And then there is Lars, who during the bug fixes, was always about documenting how it all worked. Of those of us who know HBase, there is no better man quali‐ fied to write this first, critical HBase book. —Michael Stack HBase Project Janitor Foreword: Michael Stack xi
📄 Page 13
(This page has no text content)
📄 Page 14
Foreword: Carter Page In late 2003, Google had a problem: We were continually building our web index from scratch, and each iteration was taking an entire month, even with all the parallelization we had at our disposal. What’s more the web was growing geometrically, and we were expanding into many new product areas, some of which were personalized. We had a filesystem, called GFS, which could scale to these sizes, but it lacked the ability to update records in place, or to insert or delete new re‐ cords in sequence. It was clear that Google needed to build a new database. There were only a few people in the world who knew how to solve a database design problem at this scale, and fortunately, several of them worked at Google. On November 4, 2003, Jeff Dean and Sanjay Ghemawat committed the first 5 source code files of what was to be‐ come Bigtable. Joined by seven other engineers in Mountain View and New York City, they built the first version, which went live in 2004. To this day, the biggest applications at Google rely on Bigtable: GMail, search, Google Analytics, and hundreds of other applications. A Bigta‐ ble cluster can hold many hundreds of petabytes and serve over a ter‐ abyte of data each second. Even so, we’re still working each year to push the limits of its scalability. The book you have in your hands, or on your screen, will tell you all about how to use and operate HBase, the open-source re-creation of Bigtable. I’m in the unusual position to know the deep internals of both systems; and the engineers who, in 2006, set out to build an open source version of Bigtable created something very close in design and behavior. xiii
📄 Page 15
My first experience with HBase came after I had been with the Bigta‐ ble engineering team in New York City. Out of curiosity, I attended a HBase meetup in Facebook’s offices near Grand Central Terminal. There I listened to three engineers describe work they had done in what turned out to be a mirror world of the one I was familiar with. It was an uncanny moment for me. Before long we broke out into ses‐ sions, and I found myself giving tips to strangers on schema design in this product that I had never used in my life. I didn’t tell anyone I was from Google, and no one asked (until later at a bar), but I think some of them found it odd when I slipped and mentioned “tablets” and “merge compactions”--alien nomenclature for what HBase refers to as “regions” and “minor compactions”. One of the surprises at that meetup came when a Facebook engineer presented a new feature that enables a client to read snapshot data directly from the filesystem, bypassing the region server. We had coin‐ cidentally developed the exact same functionality internally on Bigta‐ ble, calling it Offline Access. I looked into HBase’s history a little more and realized that many of its features were developed in parallel with similar features in Bigtable: replication, coprocessors, multi-tenancy, and most recently, some dabbling in multiple write-ahead logs. That these two development paths have been so symmetric is a testament to both the logical cogency of the original architecture and the ingen‐ uity of the HBase contributors in solving the same problems we en‐ countered at Google. Since I started following HBase and its community for the past year and a half, I have consistently observed certain characteristics about its culture. The individual developers love the academic challenge of building distributed systems. They come from different companies, with often competing interests, but they always put the technology first. They show a respect for each other, and a sense of responsibility to build a quality product for others to rely upon. In my shop, we call that “being Googley.” Culture is critical to success at Google, and it comes as little surprise that a similar culture binds the otherwise dis‐ parate group of engineers that built HBase. I’ll share one last realization I had about HBase about a year after that first meetup, at a Big Data conference. In the Jacob Javitz Convention Center on the west side of Manhattan, I saw presentation after pre‐ sentation by organizations that had built data processing infrastruc‐ tures that scaled to insane levels. One had built its infrastructure on Hadoop, another on Storm and Kafka, and another using the darling of that conference, Spark. But there was one consistent factor, no matter which data processing framework had been used or what prob‐ lem was being solved. Every brain-explodingly large system that need‐ Foreword: Carter Pagexiv
📄 Page 16
ed a real database was built on HBase. The biggest timeseries archi‐ tectures? HBase. Massive geo data analytics? HBase. The UIDAI in In‐ dia, which stores biometrics for more than 600 million people? What else but HBase. Presenters were saying, “I built a system that scaled to petabytes and millions of operations per second!” and I was struck by just how much HBase and its amazing ecosystem and contributors had enabled these applications. Dozens of the biggest technology companies have adopted HBase as the database of choice for truly big data. Facebook moved its messag‐ ing system to HBase to handle billions of messages per day. Bloom‐ berg uses HBase to serve mission-critical market data to hundreds of thousands of traders around the world. And Apple uses HBase to store the hundreds of terabytes of voice recognition data that power Siri. And you may wonder, what are the eventual limits? From my time on the Bigtable team, I’ve seen that while the data keeps getting bigger, we’re a long way from running out of room to scale. We’ve had to re‐ duce contention on our master server and our distributed lock server, but theoretically, we don’t see why a single cluster couldn’t hold many exabytes of data. To put it simply, there’s a lot of room to grow. We’ll keep finding new applications for this technology for years to come, just as the HBase community will continue to find extraordinary new ways to put this architecture to work. —Carter Page Engineering Manager, Bigtable Team, Google Foreword: Carter Page xv
📄 Page 17
(This page has no text content)
📄 Page 18
Preface You may be reading this book for many reasons. It could be because you heard all about Hadoop and what it can do to crunch petabytes of data in a reasonable amount of time. While reading into Hadoop you found that, for random access to the accumulated data, there is some‐ thing called HBase. Or it was the hype that is prevalent these days ad‐ dressing a new kind of data storage architecture. It strives to solve large-scale data problems where traditional solutions may be either too involved or cost-prohibitive. A common term used in this area is NoSQL. No matter how you have arrived here, I presume you want to know and learn—like I did not too long ago—how you can use HBase in your company or organization to store a virtually endless amount of data. You may have a background in relational database theory or you want to start fresh and this “column-oriented thing” is something that seems to fit your bill. You also heard that HBase can scale without much effort, and that alone is reason enough to look at it since you are building the next web-scale system. And did I mention it is free like Hadoop? I was at that point in late 2007 when I was facing the task of storing millions of documents in a system that needed to be fault-tolerant and scalable while still being maintainable by just me. I had decent skills in managing a MySQL database system, and was using the database to store data that would ultimately be served to our website users. This database was running on a single server, with another as a back‐ up. The issue was that it would not be able to hold the amount of data I needed to store for this new project. I would have to either invest in serious RDBMS scalability skills, or find something else instead. xvii
📄 Page 19
1. See the Bigtable paper for reference. Obviously, I took the latter route, and since my mantra always was (and still is) “How does someone like Google do it?” I came across Ha‐ doop. After a few attempts to use Hadoop, and more specifically HDFS, directly, I was faced with implementing a random access layer on top of it—but that problem had been solved already: in 2006, Goo‐ gle had published a paper titled “Bigtable”1 and the Hadoop develop‐ ers had an open source implementation of it called HBase (the Hadoop Database). That was the answer to all my problems. Or so it seemed… These days, I try not to think about how difficult my first experience with Hadoop and HBase was. Looking back, I realize that I would have wished for this particular project to start today. HBase is now mature, completed a 1.0 release, and is used by many high-profile companies, such as Facebook, Apple, eBay, Adobe, Yahoo!, Xiaomi, Trend Micro, Bloomberg, Nielsen, and Saleforce.com (see http://wiki.apache.org/ hadoop/Hbase/PoweredBy for a longer, though not complete list). Mine was one of the very first clusters in production and my use case triggered a few very interesting issues (let me refrain from saying more). But that was to be expected, betting on a 0.1x version of a community project. And I had the opportunity over the years to contribute back and stay close to the development team so that eventually I was hum‐ bled by being asked to become a full-time committer as well. I learned a lot over the past few years from my fellow HBase develop‐ ers and am still learning more every day. My belief is that we are no‐ where near the peak of this technology and it will evolve further over the years to come. Let me pay my respect to the entire HBase commu‐ nity with this book, which strives to cover not just the internal work‐ ings of HBase or how to get it going, but more specifically, how to ap‐ ply it to your use case. In fact, I strongly assume that this is why you are here right now. You want to learn how HBase can solve your problem. Let me help you try to figure this out. General Information Before we get started a few general notes. More information about the code examples and Hush, a complete HBase application used throughout the book, can be found in (to come). Prefacexviii
📄 Page 20
HBase Version This book covers the 1.0.0 release of HBase. This in itself is a very ma‐ jor milestone for the project, seeing HBase maturing over the years where it is now ready to fall into a proper release cycle. In the past the developers were free to decide the versioning and indeed changed the very same a few times. More can be read about this throughout the book, but suffice it to say that this should not happen again. (to come) sheds more light on the future of HBase, while “History” (page 34) shows the past. Moreover, there is now a system in place that annotates all external facing APIs with a audience and stability level. In this book we only deal with these classes and specifically with those that are marked public. You can read about the entire set of annotations in (to come). The code for HBase can be found in a few official places, for example the Apache archive (http://s.apache.org/hbase-1.0.0-archive), which has the release files as binary and source tarballs (aka compressed file archives). There is also the source repository (http://s.apache.org/ hbase-1.0.0-apache) and a mirror on the popular GitHub site (https:// github.com/apache/hbase/tree/1.0.0). Chapter 2 has more on how to select the right source and start from there. Since this book was printed there may have been important updates, so please check the book’s website at http://www.hbasebook.com in case something does not seem right and you want to verify what is go‐ ing on. I will update the website as I get feedback from the readers and time is moving on. What is in this Book? The book is organized in larger chapters, where Chapter 1 starts off with an overview of the origins of HBase. Chapter 2 explains the intri‐ cacies of spinning up a HBase cluster. Chapter 3, Chapter 4, and Chapter 5 explain all the user facing interfaces exposed by HBase, continued by Chapter 6 and Chapter 7, both showing additional ways to access data stored in a cluster and—though limited here—how to administrate it. The second half of the book takes you deeper into the topics, with (to come) explaining how everything works under the hood (with some particular deep details moved into appendixes). [Link to Come] ex‐ plains the essential need of designing data schemas correctly to gain most out of HBase and introduces you to key design. Preface xix
The above is a preview of the first 20 pages. Register to read the complete e-book.

💝 Support Author

0.00
Total Amount (¥)
0
Donation Count

Login to support the author

Login Now
Back to List