Download Programming Hive PDF

TitleProgramming Hive
TagsProgramming
LanguageEnglish
File Size7.1 MB
Total Pages350
Table of Contents
                            Table of Contents
Preface
	Conventions Used in This Book
	Using Code Examples
	Safari® Books Online
	How to Contact Us
	What Brought Us to Hive?
		Edward Capriolo
		Dean Wampler
		Jason Rutherglen
	Acknowledgments
Chapter 1. Introduction
	An Overview of Hadoop and MapReduce
		MapReduce
	Hive in the Hadoop Ecosystem
		Pig
		HBase
		Cascading, Crunch, and Others
	Java Versus Hive: The Word Count Algorithm
	What’s Next
Chapter 2. Getting Started
	Installing a Preconfigured Virtual Machine
	Detailed Installation
		Installing Java
			Linux-specific Java steps
			Mac OS X−specific Java steps
		Installing Hadoop
		Local Mode, Pseudodistributed Mode, and Distributed Mode
		Testing Hadoop
		Installing Hive
	What Is Inside Hive?
	Starting Hive
	Configuring Your Hadoop Environment
		Local Mode Configuration
		Distributed and Pseudodistributed Mode Configuration
		Metastore Using JDBC
	The Hive Command
		Command Options
	The Command-Line Interface
		CLI Options
		Variables and Properties
		Hive “One Shot” Commands
		Executing Hive Queries from Files
		The .hiverc File
		More on Using the Hive CLI
			Autocomplete
		Command History
		Shell Execution
		Hadoop dfs Commands from Inside Hive
		Comments in Hive Scripts
		Query Column Headers
Chapter 3. Data Types and File Formats
	Primitive Data Types
	Collection Data Types
	Text File Encoding of Data Values
	Schema on Read
Chapter 4. HiveQL: Data Definition
	Databases in Hive
	Alter Database
	Creating Tables
		Managed Tables
		External Tables
	Partitioned, Managed Tables
		External Partitioned Tables
		Customizing Table Storage Formats
	Dropping Tables
	Alter Table
		Renaming a Table
		Adding, Modifying, and Dropping a Table Partition
		Changing Columns
		Adding Columns
		Deleting or Replacing Columns
		Alter Table Properties
		Alter Storage Properties
		Miscellaneous Alter Table Statements
Chapter 5. HiveQL: Data Manipulation
	Loading Data into Managed Tables
	Inserting Data into Tables from Queries
		Dynamic Partition Inserts
	Creating Tables and Loading Them in One Query
	Exporting Data
Chapter 6. HiveQL: Queries
	SELECT … FROM Clauses
		Specify Columns with Regular Expressions
		Computing with Column Values
		Arithmetic Operators
		Using Functions
			Mathematical functions
			Aggregate functions
			Table generating functions
			Other built-in functions
		LIMIT Clause
		Column Aliases
		Nested SELECT Statements
		CASE … WHEN … THEN Statements
		When Hive Can Avoid MapReduce
	WHERE Clauses
		Predicate Operators
		Gotchas with Floating-Point Comparisons
		LIKE and RLIKE
	GROUP BY Clauses
		HAVING Clauses
	JOIN Statements
		Inner JOIN
		Join Optimizations
		LEFT OUTER JOIN
		OUTER JOIN Gotcha
		RIGHT OUTER JOIN
		FULL OUTER JOIN
		LEFT SEMI-JOIN
		Cartesian Product JOINs
		Map-side Joins
	ORDER BY and SORT BY
	DISTRIBUTE BY with SORT BY
	CLUSTER BY
	Casting
		Casting BINARY Values
	Queries that Sample Data
		Block Sampling
		Input Pruning for Bucket Tables
	UNION ALL
Chapter 7. HiveQL: Views
	Views to Reduce Query Complexity
	Views that Restrict Data Based on Conditions
	Views and Map Type for Dynamic Tables
	View Odds and Ends
Chapter 8. HiveQL: Indexes
	Creating an Index
		Bitmap Indexes
	Rebuilding the Index
	Showing an Index
	Dropping an Index
	Implementing a Custom Index Handler
Chapter 9. Schema Design
	Table-by-Day
	Over Partitioning
	Unique Keys and Normalization
	Making Multiple Passes over the Same Data
	The Case for Partitioning Every Table
	Bucketing Table Data Storage
	Adding Columns to a Table
	Using Columnar Tables
		Repeated Data
		Many Columns
	(Almost) Always Use Compression!
Chapter 10. Tuning
	Using EXPLAIN
	EXPLAIN EXTENDED
	Limit Tuning
	Optimized Joins
	Local Mode
	Parallel Execution
	Strict Mode
	Tuning the Number of Mappers and Reducers
	JVM Reuse
	Indexes
	Dynamic Partition Tuning
	Speculative Execution
	Single MapReduce MultiGROUP BY
	Virtual Columns
Chapter 11. Other File Formats and Compression
	Determining Installed Codecs
	Choosing a Compression Codec
	Enabling Intermediate Compression
	Final Output Compression
	Sequence Files
	Compression in Action
	Archive Partition
	Compression: Wrapping Up
Chapter 12. Developing
	Changing Log4J Properties
	Connecting a Java Debugger to Hive
	Building Hive from Source
		Running Hive Test Cases
		Execution Hooks
	Setting Up Hive and Eclipse
	Hive in a Maven Project
	Unit Testing in Hive with hive_test
	The New Plugin Developer Kit
Chapter 13. Functions
	Discovering and Describing Functions
	Calling Functions
	Standard Functions
	Aggregate Functions
	Table Generating Functions
	A UDF for Finding a Zodiac Sign from a Day
	UDF Versus GenericUDF
	Permanent Functions
	User-Defined Aggregate Functions
		Creating a COLLECT UDAF to Emulate GROUP_CONCAT
	User-Defined Table Generating Functions
		UDTFs that Produce Multiple Rows
		UDTFs that Produce a Single Row with Multiple Columns
		UDTFs that Simulate Complex Types
	Accessing the Distributed Cache from a UDF
	Annotations for Use with Functions
		Deterministic
		Stateful
		DistinctLike
	Macros
Chapter 14. Streaming
	Identity Transformation
	Changing Types
	Projecting Transformation
	Manipulative Transformations
	Using the Distributed Cache
	Producing Multiple Rows from a Single Row
	Calculating Aggregates with Streaming
	CLUSTER BY, DISTRIBUTE BY, SORT BY
	GenericMR Tools for Streaming to Java
	Calculating Cogroups
Chapter 15. Customizing Hive File and Record
  Formats
	File Versus Record Formats
	Demystifying CREATE TABLE Statements
	File Formats
		SequenceFile
		RCFile
		Example of a Custom Input Format: DualInputFormat
	Record Formats: SerDes
	CSV and TSV SerDes
	ObjectInspector
	Think Big Hive Reflection ObjectInspector
	XML UDF
	XPath-Related Functions
	JSON SerDe
	Avro Hive SerDe
		Defining Avro Schema Using Table Properties
		Defining a Schema from a URI
		Evolving Schema
	Binary Output
Chapter 16. Hive Thrift Service
	Starting the Thrift Server
	Setting Up Groovy to Connect to HiveService
	Connecting to HiveServer
	Getting Cluster Status
	Result Set Schema
	Fetching Results
	Retrieving Query Plan
	Metastore Methods
		Example Table Checker
			Finding tables not marked as external
	Administrating HiveServer
		Productionizing HiveService
		Cleanup
	Hive ThriftMetastore
		ThriftMetastore Configuration
		Client Configuration
Chapter 17. Storage Handlers and NoSQL
	Storage Handler Background
	HiveStorageHandler
	HBase
	Cassandra
		Static Column Mapping
		Transposed Column Mapping for Dynamic Columns
		Cassandra SerDe Properties
	DynamoDB
Chapter 18. Security
	Integration with Hadoop Security
	Authentication with Hive
	Authorization in Hive
		Users, Groups, and Roles
		Privileges to Grant and Revoke
		Partition-Level Privileges
		Automatic Grants
Chapter 19. Locking
	Locking Support in Hive with Zookeeper
	Explicit, Exclusive Locks
Chapter 20. Hive Integration with Oozie
	Oozie Actions
		Hive Thrift Service Action
	A Two-Query Workflow
	Oozie Web Console
	Variables in Workflows
	Capturing Output
	Capturing Output to Variables
Chapter 21. Hive and Amazon Web Services (AWS)
	Why Elastic MapReduce?
	Instances
	Before You Start
	Managing Your EMR Hive Cluster
	Thrift Server on EMR Hive
	Instance Groups on EMR
	Configuring Your EMR Cluster
		Deploying hive-site.xml
		Deploying a .hiverc Script
			Deploying .hiverc using a config step
			Deploying a .hiverc using a bootstrap action
		Setting Up a Memory-Intensive Configuration
	Persistence and the Metastore on EMR
	HDFS and S3 on EMR Cluster
	Putting Resources, Configs, and Bootstrap Scripts on S3
	Logs on S3
	Spot Instances
	Security Groups
	EMR Versus EC2 and Apache Hive
	Wrapping Up
Chapter 22. HCatalog
	Introduction
	MapReduce
		Reading Data
		Writing Data
	Command Line
	Security Model
	Architecture
Chapter 23. Case Studies
	m6d.com (Media6Degrees)
		Data Science at M6D Using Hive and R
		M6D UDF Pseudorank
		M6D Managing Hive Data Across Multiple MapReduce Clusters
			Cross deployment queries with Hive
			Replicating Hive data between deployments
	Outbrain
		In-Site Referrer Identification
			Cleaning up the URLs
			Determining referrer type
			Multiple URLs
		Counting Uniques
			Why this is a problem
			Load a temp table
			Querying the temp table
		Sessionization
			Setting it up
			Finding origin pageviews
			Bucketing PVs to origins
			Aggregating on origins
			Aggregating on origin type
			Measure engagement
	NASA’s Jet Propulsion Laboratory
		The Regional Climate Model Evaluation System
		Our Experience: Why Hive?
		Some Challenges and How We Overcame Them
			Conclusion
	Photobucket
		Big Data at Photobucket
		What Hardware Do We Use for Hive?
		What’s in Hive?
		Who Does It Support?
	SimpleReach
	Experiences and Needs from the Customer Trenches
		A Karmasphere Perspective
		Introduction
		Use Case Examples from the Customer Trenches
			Customer trenches #1: Optimal data formatting for Hive
			Customer trenches #2: Partitions and performance
			Customer trenches #3: Text analytics with Regex, Lateral View Explode, Ngram, and other UDFs
				Apache Hive in production: Incremental needs and capabilities
				About Karmasphere
Glossary
Appendix. References
Index
                        
Document Text Contents
Page 175

CHAPTER 12

Developing

Hive won’t provide everything you could possibly need. Sometimes a third-party library
will fill a gap. At other times, you or someone else who is a Java developer will need to
write user-defined functions (UDFs; see Chapter 13), SerDes (see “Record Formats:
SerDes” on page 205), input and/or output formats (see Chapter 15), or other
enhancements.

This chapter explores working with the Hive source code itself, including the new
Plugin Developer Kit introduced in Hive v0.8.0.

Changing Log4J Properties
Hive can be configured with two separate Log4J configuration files found in
$HIVE_HOME/conf. The hive-log4j.properties file controls the logging of the CLI or
other locally launched components. The hive-exec-log4j.properties file controls the log-
ging inside the MapReduce tasks. These files do not need to be present inside the Hive
installation because the default properties come built inside the Hive JARs. In fact, the
actual files in the conf directory have the .template extension, so they are ignored by
default. To use either of them, copy it with a name that removes the .template extension
and edit it to taste:

$ cp conf/hive-log4j.properties.template conf/hive-log4j.properties
$ ... edit file ...

It is also possible to change the logging configuration of Hive temporarily without
copying and editing the Log4J files. The hiveconf switch can be specified on start-up
with definitions of any properties in the log4.properties file. For example, here we set
the default logger to the DEBUG level and send output to the console appender:

$ bin/hive -hiveconf hive.root.logger=DEBUG,console
12/03/27 08:46:01 WARN conf.HiveConf: hive-site.xml not found on CLASSPATH
12/03/27 08:46:01 DEBUG conf.Configuration: java.io.IOException: config()

155

Page 176

Connecting a Java Debugger to Hive
When enabling more verbose output does not help find the solution to the problem
you are troubleshooting, attaching a Java debugger will give you the ability to step
through the Hive code and hopefully find the problem.

Remote debugging is a feature of Java that is manually enabled by setting specific com-
mand-line properties for the JVM. The Hive shell script provides a switch and help
screen that makes it easy to set these properties (some output truncated for space):

$ bin/hive --help --debug
Allows to debug Hive by connecting to it via JDI API
Usage: hive --debug[:comma-separated parameters list]

Parameters:

recursive=<y|n> Should child JVMs also be started in debug mode. Default: y
port=<port_number> Port on which main JVM listens for debug connection. Defaul...
mainSuspend=<y|n> Should main JVM wait with execution for the debugger to con...
childSuspend=<y|n> Should child JVMs wait with execution for the debugger to c...
swapSuspend Swaps suspend options between main and child JVMs

Building Hive from Source
Running Apache releases is usually a good idea, however you may wish to use features
that are not part of a release, or have an internal branch with nonpublic customizations.

Hence, you’ll need to build Hive from source. The minimum requirements for building
Hive are a recent Java JDK, Subversion, and ANT. Hive also contains components such
as Thrift-generated classes that are not built by default. Rebuilding Hive requires a
Thrift compiler, too.

The following commands check out a Hive release and builds it, produces output in
the hive-trunk/build/dist directory:

$ svn co http://svn.apache.org/repos/asf/hive/trunk hive-trunk
$ cd hive-trunk
$ ant package

$ ls build/dist/
bin examples LICENSE README.txt scripts
conf lib NOTICE RELEASE_NOTES.txt

Running Hive Test Cases
Hive has a unique built-in infrastructure for testing. Hive does have traditional JUnit
tests, however the majority of the testing happens by running queries saved in .q files,
then comparing the results with a previous run saved in Hive source.1 There are multiple

1. That is, they are more like feature or acceptance tests.

156 | Chapter 12: Developing

Page 349

About the Authors
Edward Capriolo is currently System Administrator at Media6degrees, where he helps
design and maintain distributed data storage systems for the Internet advertising
industry.

Edward is a member of the Apache Software Foundation and a committer for the
Hadoop-Hive project. He has experience as a developer, as well as a Linux and network
administrator, and enjoys the rich world of open source software.

Dean Wampler is a Principal Consultant at Think Big Analytics, where he specializes
in “Big Data” problems and tools like Hadoop and Machine Learning. Besides Big Data,
he specializes in Scala, the JVM ecosystem, JavaScript, Ruby, functional and object-
oriented programming, and Agile methods. Dean is a frequent speaker at industry and
academic conferences on these topics. He has a Ph.D. in Physics from the University
of Washington.

Jason Rutherglen is a software architect at Think Big Analytics and specializes in Big
Data, Hadoop, search, and security.

Colophon
The animal on the cover of is a European hornet ( ) and
its hive. The European hornet is the only hornet in North America, introduced to the
continent when European settlers migrated to the Americas. This hornet can be found
throughout Europe and much of Asia, adapting its hive-building techniques to different
climates when necessary.

The hornet is a social insect, related to bees and ants. The hornet’s hive consists of one
queen, a few male hornets (drones), and a large quantity of sterile female workers. The
chief purpose of drones is to reproduce with the hornet queen, and they die soon after.
It is the female workers who are responsible for building the hive, carrying food, and
tending to the hornet queen’s eggs.

The hornet’s nest itself is the consistency of paper, since it is constructed out of wood
pulp in several layers of hexagonal cells. The end result is a pear-shaped nest attached
to its shelter by a short stem. In colder areas, hornets will abandon the nest in the winter
and take refuge in hollow logs or trees, or even human houses, where the queen and
her eggs will stay until the warmer weather returns. The eggs form the start of a new
colony, and the hive can be constructed once again.

The cover image is from . The cover font is Adobe ITC Ga-
ramond. The text font is Linotype Birka; the heading font is Adobe Myriad Condensed;
and the code font is LucasFont’s TheSansMonoCondensed.

Similer Documents