Data Access Optimizing an Apache Hive data warehouse
Optimizing an Apache Hive data warehouse
You can tune your data warehouse infrastructure, components, and client connection parameters to improve the
performance and relevance of business intelligence and other applications. Tuning Hive and background components
that support Hive processing is particularly important as your workload and database volume increases.
Increasingly, enterprises want to run SQL workloads that return faster results than batch processing can provide.
These enterprises often want data analytics applications to support interactive queries. Hive low-latency analytical
processing (LLAP) can improve the performance of interactive queries. A Hive interactive query that runs on the
Hortonworks Data Platform (HDP) meets low-latency, variably guaged benchmarks to which Hive LLAP responds in
15 seconds or fewer. LLAP enables application development and IT infrastructure to run queries that return real-time
or near-real-time results.
You can further enhance LLAP performance with real-time data by integrating the enterprise data warehouse (EDW)
with the Druid business intelligence engine.
When you query large-scale EDW data sets, you have to meet service-level agreement (SLA) benchmarks or other
performance expectations. Because how you tune your query processing environment depends on factors such as
system resources, depth of data analysis, and query latency requirements, you must become familiar with Hive
warehouse processing, prepare for tuning, and configure LLAP using parameters that meet your performance needs.
LLAP ports
You use port 10500 to make the JDBC connection through Beeline to query Hive through the HiveServer Interactive
host. The LLAP daemon uses several other ports.
List of port properties
• HiveServer Interactive (LLAP) port (10500)
• hive.server2.thrift.http.port (10501)
• hive.llap.daemon.rpc.port (0)
• hive.llap.daemon.web.port (15002)
• hive.llap.daemon.yarn.shuffle.port (15551)
• hive.llap.management.rpc.port (15004)
Preparations for tuning performance
Before you tune Apache Hive, you should follow best practices. These guidelines include how you configure the
cluster, store data, and write queries.
Best practices
• Set up your cluster to use Apache Tez or the Hive on Tez execution engine.
In HDP 3.x, the MapReduce execution engine is replaced by Tez.
• Disable user impersonation by setting Run as end user to false in Ambari, which is equivalent to setting
hive.server2.enable.doAs in hive-site.xml.
LLAP caches data for multiple queries and this capability does not support user impersonation.
• Add the Ranger security service to your cluster and dependent services.
• Set up LLAP to run interactive queries.
• Store data using the ORC File format.
4