Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS, HBase, or the Amazon Simple Storage Service (S3). In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface (Impala query UI in Hue) as Apache Hive. This provides a familiar and unified platform for real-time or batch-oriented queries. Impala is an addition to tools available for querying big data. Impala does not replace the batch processing frameworks built on MapReduce such as Hive. Hive and other frameworks built on MapReduce are best suited for long running batch jobs, such as those involving batch processing of Extract, Transform, and Load (ETL) type jobs. Note: Impala was accepted into the Apache incubator on December 2, 2015. In places where the documentation formerly referred to “Cloudera Impala”, now the official name is “Apache Impala (incubating)”. ### Impala-2.8: Fast Interactive SQL Queries for Big Data #### Introduction to Apache Impala (Incubating) Apache Impala is a high-performance, distributed SQL query engine that enables fast, interactive SQL queries on data stored in Apache Hadoop's HDFS, HBase, or Amazon S3. This system is designed to provide users with a familiar and unified platform for real-time or batch-oriented queries by leveraging the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface (Impala query UI in Hue) as Apache Hive. #### Key Benefits of Impala - **Interactive Queries**: Impala is optimized for interactive use cases, allowing analysts and developers to query large datasets quickly. - **Unified Platform**: By using the same metadata, SQL syntax, ODBC driver, and UI as Hive, it simplifies the development and deployment process. - **Integration**: It seamlessly integrates with existing Hadoop ecosystems, including HDFS, HBase, and other components. #### How Impala Works with Apache Hadoop Impala complements the Hadoop ecosystem by providing a fast SQL interface without relying on the MapReduce framework. Instead, it uses a low-latency, massively parallel processing (MPP) architecture to execute queries directly on the data stored in HDFS, HBase, or S3. #### Primary Features of Impala - **High Performance**: Impala can perform queries significantly faster than traditional batch-oriented systems like Hive. - **Interactive Query Processing**: It supports ad-hoc queries, making it ideal for exploratory data analysis. - **Scalability**: Impala can scale out to handle large volumes of data across multiple nodes. #### Impala Concepts and Architecture - **Components of the Impala Server**: - **Impala Daemon**: Runs on each node and executes queries. - **Impala Statestore**: Manages cluster membership and coordination between Impala daemons. - **Impala Catalog Service**: Maintains the metadata for all tables and columns. #### Developing Impala Applications - **Overview of the Impala SQL Dialect**: Impala supports a subset of SQL, including standard SQL constructs like SELECT, FROM, WHERE, JOIN, and GROUP BY. It also includes additional functions and features specific to Impala. - **Overview of Impala Programming Interfaces**: Impala supports various interfaces, including ODBC, JDBC, and REST APIs, which enable easy integration with various applications and tools. #### How Impala Fits into the Hadoop Ecosystem Impala coexists alongside other Hadoop components, such as Hive and Pig, offering a complementary solution for interactive analytics. While Hive is better suited for long-running batch jobs, Impala excels in delivering fast results for ad-hoc queries. #### How Impala Works with Hive - **Shared Metadata**: Both Impala and Hive share the same metadata through the Hive Metastore, ensuring consistency between the two systems. - **SQL Syntax**: Impala uses the Hive SQL dialect, making it easier for users familiar with Hive to transition to Impala. #### Overview of Impala Metadata and the Metastore - **Metadata Management**: Impala relies on the Hive Metastore for metadata management, which stores information about the structure and location of data. - **Schema Evolution**: The metastore supports schema evolution, enabling changes to table schemas without disrupting data access. #### How Impala Uses HDFS and HBase - **HDFS**: Impala reads and writes data directly from HDFS, bypassing the need for intermediate processing stages. - **HBase**: For HBase tables, Impala leverages the HBase client API to access data directly, providing faster query performance. #### Planning for Impala Deployment - **Requirements**: - **Supported Operating Systems**: Linux distributions like CentOS, Red Hat Enterprise Linux, and Ubuntu. - **Hive Metastore and Related Configuration**: Impala requires a properly configured Hive Metastore to manage metadata. - **Java Dependencies**: Java runtime environment is required. - **Networking Configuration**: Proper network configuration to ensure connectivity between Impala daemons and other services. - **Hardware Requirements**: Sufficient CPU, memory, and disk space based on the size of the data and expected query workload. - **User Account Requirements**: User accounts with appropriate permissions to access data and run Impala services. - **Cluster Sizing Guidelines**: - Consider the amount of data, number of concurrent users, and expected query complexity when sizing the cluster. - Use dedicated machines for Impala daemons to avoid contention with other Hadoop components. #### Guidelines for Designing Impala Schemas - **Columnar Storage**: Utilize columnar storage formats like Parquet or ORC for improved query performance. - **Partitioning**: Implement partitioning to improve query efficiency by reducing the amount of data scanned. - **Schema Simplification**: Keep schemas simple and denormalized where possible to reduce the overhead of joins. #### Installing Impala - **What is Included in an Impala Installation**: Typically includes the Impala daemon, state store, catalog service, and necessary libraries. - **Managing Impala**: Post-installation configuration, including setting up ODBC and JDBC connections, is crucial for optimal performance and security. #### Configuring Impala to Work with ODBC and JDBC - **ODBC Configuration**: Requires configuring the ODBC driver and establishing a DSN (Data Source Name). - **JDBC Configuration**: - **Configuring the JDBC Port**: Specifies the port used for JDBC connections. - **Choosing the JDBC Driver**: Select a compatible JDBC driver version. - **Enabling JDBC Support on Client Systems**: Install the JDBC driver on client systems. - **Establishing JDBC Connections**: Configure the connection URL and credentials. #### Upgrading Impala - **Upgrade Process**: Follow the recommended upgrade path provided by the Impala documentation to ensure compatibility and minimize downtime. #### Starting Impala - **Starting Impala from the Command Line**: Use the `impala-server` command to start the Impala daemon. - **Modifying Impala Startup Options**: Customize startup options for different daemons (e.g., impalad, statestored, catalogd). #### Checking the Values of Impala Configuration Options - **Common Startup Options**: - For `impalad`: Specifies options like the hostname, port, and query log directory. - For `statestored`: Manages cluster membership and coordination. - For `catalogd`: Controls the metadata service. #### Impala Tutorials - **Tutorials for Getting Started**: - **Explore a New Impala Instance**: Learn how to connect to Impala and run basic queries. - **Load CSV Data from Local Files**: Understand how to import data from CSV files into Impala. - **Point an Impala Table at Existing Data Files**: Explore how to link Impala tables to data stored in HDFS. - **Describe the Impala Table**: Discover how to view the schema of a table. - **Query the Impala Table**: Execute complex queries and analyze the results. #### Advanced Tutorials - **Attaching an External Partitioned Table to an HDFS Directory Structure**: Guide on creating external tables and linking them to directories in HDFS. - **Switching Back and Forth Between Impala and Hive**: Techniques for seamlessly transitioning between Impala and Hive for different tasks. #### Conclusion Apache Impala is a powerful tool for performing fast, interactive SQL queries on big data stored in Hadoop. Its integration with Hive and other Hadoop components makes it a valuable addition to any organization’s data processing infrastructure. By understanding its architecture, features, and deployment guidelines, organizations can leverage Impala to gain insights from their data more efficiently and effectively.
剩余807页未读,继续阅读
- 粉丝: 0
- 资源: 1
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助