Previous Hive implementation
INSERT OVERWRITE TABLE tmp_table1!
PARTITION ( . . .)
SELECT entity_id, target_id, feature_id, feature_value
FROM input_table!
WHERE ...
INSERT OVERWRITE TABLE tmp_table2!
PARTITION ( . . .)
SELECT entity_id, target_id, AGG(feature_id, feature_value)
FROM tmp_table1!
SELECT TRANSFORM (entity_id % SHARDS as shard_id, ...) !
USING 'indexer' -- writes indexed files to hdfs
AS shard_id, status!
FROM tmp_table2
indexed!
hdfs_files
•
60 TB + compressed input
data size
•
Split into hundreds of smaller
hive jobs sharded by entity id
•
Unmanageable and slow
Filter
Aggregate
Shard