HadoopHive_hadoop的hive下载资源-CSDN文库

共3个文件

ppt：2个

pdf：1个

2星需积分: 4 119 浏览量 2009-01-08 11:35:27 上传评论收藏 337KB RAR 举报

Hadoop Hive是一个基于Hadoop的数据仓库工具，它允许用户使用SQL-like语言（称为HiveQL）来查询、管理和处理大规模数据集。Hive是为大数据分析设计的，它将复杂的MapReduce作业转换为简单的SQL查询，使非程序员也能进行数据分析。在Facebook的“Hive by Ashish Thusoo.ppt”中，我们可以预期会了解到Hive的起源和在Facebook的使用场景。Ashish Thusoo是Hive的主要开发者之一，他可能会详细介绍Hive如何帮助Facebook处理海量的日志数据，以及Hive如何与Facebook的数据基础设施集成，提供高效的数据分析能力。 Yahoo的“Hadoop by Hairong Kuang.ppt”则可能更深入地探讨Hadoop生态系统中的Hive角色。Hairong Kuang可能是Hadoop领域的专家，他的演讲可能包括Hive如何与HDFS（Hadoop分布式文件系统）协同工作，以及如何利用Hive进行批处理和实时数据分析。他可能会强调Hive的并行处理能力，以及如何通过优化查询来提升性能。 “HiveTutorial.pdf”可能是一个全面的Hive教程，涵盖了Hive的基本概念，如表的创建、数据加载、查询语法、分区和桶的概念，以及如何使用Hive进行数据挖掘和业务智能。这个教程可能还会讨论Hive与其他Hadoop组件如Pig和HBase的交互，以及如何使用Hive Metastore来管理元数据。 Hadoop Hive的关键特性包括： 1. 扩展性：Hive能够轻松扩展到数千个节点，处理PB级别的数据。 2. 易用性：HiveQL使得非程序员可以编写查询，简化了大数据分析的门槛。 3. 灵活性：支持多种数据格式，如文本、Avro、Parquet等，且能处理结构化和半结构化数据。 4. 可移植性：基于标准的SQL，可以方便地与其他SQL工具集成。 5. 高效性：通过Hadoop的并行处理，Hive可以快速处理大量数据。 6. 分区和桶：通过数据分区和桶化，可以提高查询性能，尤其对于范围查询。在实际应用中，Hive通常用于离线分析，适合大批量、低延迟要求不高的场景。随着技术的发展，Hive也引入了实时查询功能，如Hive on Tez和Hive on Spark，以适应更多样化的数据分析需求。总结来说，Hadoop Hive是一个强大的大数据分析工具，它在Facebook和Yahoo等大型互联网公司的成功应用，证明了其在处理海量数据时的实用性。通过深入学习这些材料，我们可以更好地理解和掌握Hive在大数据环境中的作用和操作技巧。

资源推荐

资源详情

资源评论

收起资源包目录

hadoop.rar （3个子文件）

Yahoo Hadoop by Hairong Kuang.ppt 334KB

HiveTutorial.pdf 67KB

Facebook Hive by Ashish Thusoo.ppt 281KB

Facebook | Data/Hive/Hive 2.0 Tutorial - Facebook http://www.dev.facebook.com/intern/wiki/index.php/Data/Hive/Hive_2...

1 of 10 7/8/2008 3:40 PM

[edit]

Data/Hive/Hive 2.0 Tutorial

Concepts

What is Hive 2.0

Hive 2.0 is the next generation infrastructure made with the

goal of providing tools to enable easy data summarization,

adhoc querying and analysis of detail data. Just like Hive 1.0,

Hive 2.0 provides a mechanism to put structure on the data. In

addition it also provides a simple query language called QL

which is based on SQL and which enables users familiar with

SQL to do adhoc querying, summarization and data analysis. At

the same time, this language also allows traditional

map/reduce programmers to be able to plug in their custom

mappers and reducers to do more sophisticated analysis which

may not be supported by the built in capabilities of the

language.

What is NOT Hive 2.0

Hive is based on hadoop which is a batch processing system.

Accordingly, this system does not and cannot promise low

latencies on queries. The paradigm here is strictly of submitting

jobs and being notified when the jobs are completed as

opposed to real time queries. As a result it should not be

compared with systems like Oracle where analysis is done on a

significantly smaller amount of data but the analysis proceeds

much more iteratively with the response times between iterations being less than a few minutes. For

Hive queries response times for even the smallest jobs can be of the order of 5-10 minutes and for

larger jobs this may even run into hours.

In the following sections we provide a tutorial on the capabilities of the system. We start by

describing the concepts of data types, tables and partitions (which are very similar to what you would

find in a traditional relational database) and then illustrate the capabilities of the language with the

help of some examples:

Data Units

In order of granularity - Hive data is organized into:

Tables: Homogenous units of data which have the same schema. An example of a table could be

page_views table. Each row of this table could comprise of the following columns, which define

the schema of the table:

timestamp - which corresponds to a datetime type that captures when the page was viewed.

userid - which is a long integer identifying the user who viewed the page.

page_url - which is a string that captures the location of the page.

referer_url - which is a string that captures the location of the page from where we arrived

at the current page.

IP - which is a string that captures the IP address from where the page request was made.

Partitions: Each Table can have one or more partition Keys which determines how the data is

stored. Partitions - apart from being storage units - also allow the user to efficiently identify the

rows that satisfy a certain criteria. e.g a date_ partition of type Datestamp and country_ partition

of type String. Each unique specification of the partition keys defines a partition of the Table e.g.

all "US" data from "2008-02-02" is a partition of the page_views table. Therefore, if you have to

run analysis on only the "US" data for 2008-02-02, you can run that analysis only on the relevant

Concepts

What is Hive 2.0

What is NOT Hive 2.0

Data Units

Type System

Language capabilities

Usage and Examples

Creating Tables

Describing and Showing

Tables

Loading Data

Simple Query

Partition Based Query

Joins

Aggregations

Multi Table/File Inserts

Inserting into local files

Sampling

Union all

Array Operations

Map Operations

Custom map/reduce scripts

Co groups

Useful built in Functions

Altering Tables

Logged in as Ashish Thusoo LogoutTracCamp Tasks Release Trac Wiki Data

Wiki

Article

Edit History Delete Move Watch

Facebook | Data/Hive/Hive 2.0 Tutorial - Facebook http://www.dev.facebook.com/intern/wiki/index.php/Data/Hive/Hive_2...

2 of 10 7/8/2008 3:40 PM

[edit]

partition of the table thereby speeding up the analysis significantly.

Buckets: Data in each partition may in term be divided into Buckets based on hash of some

column of the Table. For example the page_views table may be bucketed by userid (which is one

of the columns of the page_view table. These can be used to efficiently sample the data.

Note that it is not necessary for tables to be partitioned or bucketed. But these abstractions allow us

significant optimizations in query processing.

Type System

Types are associated with the columns in the tables. The following Primitive types are supported:

Characters

Integers (small is 2 bytes , medium is 4 bytes and big is 8 bytes)

Floating point numbers (Single and Double precision)

Strings

Datetime

Boolean

Composite Types can be built up from primitive types and other composite types using:

Composition (structs/records)

Maps (key-value tuples)

Arrays

e.g. a type User may comprise of the following fields:

id - which is a 4 byte integer.

name - which is a string.

age - which is an integer.

weight - which is a floating point number.

friends - which is a array of ids(integers).

gender - which is an integer.

active - which is a boolean.

Language capabilities

Hive 2.0's query language provides the basic SQL like operations. These operations work on tables or

partitions. These operations are:

Ability to filter rows from a table using a where clause.

Ability to select certain columns from the table using a select clause.

Ability to do equi-joins between two tables.

Ability to evaluate aggregations on multiple group bys columns for the data stored in a table.

Ability to store the results of a query into another table.

Ability to download the contents of a table to a local file.

Ability to store the results of a query in a hadoop dfs file.

Ability to manage types (create, drop and alter).

Ability to manage tables and partitions (create, drop and alter).

Ability to plug in custom scripts in the language of choice for custom map/reduce jobs.

Usage and Examples

The following examples highlight some salient features of the system.

Creating Tables

An example statement that would create the page_view table mentioned above would be like:

CREATE TABLE page_view(viewTime DATETIME, userid MEDIUMINT,

Facebook | Data/Hive/Hive 2.0 Tutorial - Facebook http://www.dev.facebook.com/intern/wiki/index.php/Data/Hive/Hive_2...

3 of 10 7/8/2008 3:40 PM

[edit]

page_url STRING, referrer_url STRING,

friends ARRAY<BIGINT>, properties MAP<STRING, STRING>

ip STRING COMMENT 'IP Address of the User')

COMMENT 'This is the page view table'

PARTITIONED BY(date DATETIME, country STRING)

BUCKETED ON (userid) INTO 32 BUCKETS

ROW FORMAT DELIMITED

FIELDS TERMINATED BY \001

COLLECTION ITEMS TERMINATED BY \002

MAP KEYS TERMINATED BY \003

LINES TERMINATED BY \012

STORED AS COMPRESSED

In this example the columns that comprise of the table row are specified in a similar way as the

definition of types. Comments can be attached both at the column level as well as at the table level.

Additionally the partitioned by clause defines the partitioning columns which are different from the

data columns and are actually not stored in the data. The bucketed on clause specifies which column

to use for bucketing as well as how many buckets to create. The delimited row format specifies how

the rows are stored in the hive table. In the case of the delimited format, this specifies how the fields

are terminated, how the items within collections (arrays or maps) are terminated and how the map

keys are terminated. STORED AS compressed indicates that this data is compressed and stored in a

binary format (using hadoop SequenceFiles) on hdfs. Other than COMPRESSED, the data may also be

stored as TEXT. The values shown for the ROW FORMAT and STORED AS clauses in the above

example represent the system defaults.

Describing and Showing Tables

Describing and Showing of tables uses similar syntax as describing and showing of types. Accordingly

all of the following statements return table names which match the specified criteria.

SHOW TABLES;

SHOW TABLES WHERE name = 'page_view';

SHOW TABLE WHERE COMMENT like '%user table%';

In order to look at the entire information of a table one could use a normal describe statement like:

DESCRIBE TABLE page_views;

Loading Data

There are multiple mechanisms of loading data into Hive tables. The user can create an external table

that points to a specified location within hdfs. In this particular usage, the user can copy a file into the

specified location using the hdfs put or copy commands and create a table pointing to this location

with all the relevant row format information. Once this is done, the user can transform this data and

insert into any other Hive tables. e.g. if the file /tmp/pv_2008-06-08.txt contains comma separated

page views served on 2008-06-08, and this needs to be loaded into the page_view table in the

appropriate partition, the following sequence of commands can achieve this:

CREATE EXTERNAL TABLE page_view_stg(viewTime DATETIME, userid MEDIUMINT,

page_url STRING, referrer_url STRING,

ip STRING COMMENT 'IP Address of the User',

country STRING COMMENT 'country of origination')

COMMENT 'This is the staging page view table'

ROW FORMAT DELIMITED FIELDS TERMINATED BY \054 LINES TERMINATED BY \012

LOCATION '/user/facebook/staging/page_view';

hadoop dfs -put /tmp/pv_2008-06-08.txt /user/facebook/staging/page_view

and finally

FROM page_view_stg pvs

INSERT OVERWRITE TABLE page_view PARTITION(date=2008-06-08, country='US')

评论收藏

内容反馈

fcgong

2012-09-28

写得有点简单了。在详细点就好了

zhangzb717

粉丝: 4
资源: 28

Hadoop Hive

hive和hadoop

hadoop下hive的安装

Hadoop和HIVE面试题

hadoop hive hbase安装过程

Hive

Hadoop

hive

大数据整理hadoop/hive

Hadoop之hive安装

基于Hadoop Hive健身馆可视化分析平台项目源码+数据库文件.zip

hadoop hive可用的数据连接jar包

Hadoop Hive HBase Spark Storm概念解释

小牛学堂-大数据24期-04-Hadoop Hive Hbase Flume Sqoop-12天适合初学者

Hadoop各组件详解（Hive篇）

Hive Hadoop Spark优化

Hive之查询

Hadoop数据仓库工具hive介绍.pdf

基于Hadoop的数据仓库Hive基础知识

最新资源