Facebook | Data/Hive/Hive 2.0 Tutorial - Facebook http://www.dev.facebook.com/intern/wiki/index.php/Data/Hive/Hive_2...
3 of 10 7/8/2008 3:40 PM
[edit]
[edit]
page_url STRING, referrer_url STRING,
friends ARRAY<BIGINT>, properties MAP<STRING, STRING>
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(date DATETIME, country STRING)
BUCKETED ON (userid) INTO 32 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY \001
COLLECTION ITEMS TERMINATED BY \002
MAP KEYS TERMINATED BY \003
LINES TERMINATED BY \012
STORED AS COMPRESSED
In this example the columns that comprise of the table row are specified in a similar way as the
definition of types. Comments can be attached both at the column level as well as at the table level.
Additionally the partitioned by clause defines the partitioning columns which are different from the
data columns and are actually not stored in the data. The bucketed on clause specifies which column
to use for bucketing as well as how many buckets to create. The delimited row format specifies how
the rows are stored in the hive table. In the case of the delimited format, this specifies how the fields
are terminated, how the items within collections (arrays or maps) are terminated and how the map
keys are terminated. STORED AS compressed indicates that this data is compressed and stored in a
binary format (using hadoop SequenceFiles) on hdfs. Other than COMPRESSED, the data may also be
stored as TEXT. The values shown for the ROW FORMAT and STORED AS clauses in the above
example represent the system defaults.
Describing and Showing Tables
Describing and Showing of tables uses similar syntax as describing and showing of types. Accordingly
all of the following statements return table names which match the specified criteria.
SHOW TABLES;
SHOW TABLES WHERE name = 'page_view';
SHOW TABLE WHERE COMMENT like '%user table%';
In order to look at the entire information of a table one could use a normal describe statement like:
DESCRIBE TABLE page_views;
Loading Data
There are multiple mechanisms of loading data into Hive tables. The user can create an external table
that points to a specified location within hdfs. In this particular usage, the user can copy a file into the
specified location using the hdfs put or copy commands and create a table pointing to this location
with all the relevant row format information. Once this is done, the user can transform this data and
insert into any other Hive tables. e.g. if the file /tmp/pv_2008-06-08.txt contains comma separated
page views served on 2008-06-08, and this needs to be loaded into the page_view table in the
appropriate partition, the following sequence of commands can achieve this:
CREATE EXTERNAL TABLE page_view_stg(viewTime DATETIME, userid MEDIUMINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User',
country STRING COMMENT 'country of origination')
COMMENT 'This is the staging page view table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY \054 LINES TERMINATED BY \012
LOCATION '/user/facebook/staging/page_view';
hadoop dfs -put /tmp/pv_2008-06-08.txt /user/facebook/staging/page_view
and finally
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(date=2008-06-08, country='US')