Impala currently only supports in-memory hash aggregation. In Impala 2.0 and higher, if the memory requirements for a join or aggregation operation exceed the memory limit for a particular host, Impala uses a temporary work area on disk to help the query complete successfully.
When a SELECT statement fails, the cause usually falls into one of the following categories:
No. Impala does not make any changes to the HDFS or HBase data sets.
The default Parquet block size is relatively large (256 MB in Impala 2.0 and later; 1 GB in earlier releases). You can control the block size when creating Parquet files using the PARQUET_FILE_SIZE query option.
You can set up a proxy server to relay requests back and forth to the Impala servers, for load balancing and high availability.
Impala does not use MapReduce at all.
Impala does not currently have an UPDATE statement, which would typically be used to change a single row, a small group of rows, or a specific column. The HDFS-based files used by typical Impala queries are optimized for bulk operations across many megabytes of data at a time, making traditional UPDATEoperations inefficient or impractical.
You can use the following techniques to achieve the same goals as the familiar UPDATE statement, in a way that preserves efficient file layouts for subsequent queries:
There is no defined maximum. Some customers have used Impala to query a table with over a trillion rows.
Impala streams results whenever they are available, when possible. Certain SQL operations (aggregation or ORDER BY) require all of the input to be ready before Impala can return results.
Impala uses two pieces of metadata: the catalog information from the Hive metastore and the file metadata from the NameNode. Currently, this metadata is lazily populated and cached when an impalad needs it to plan a query.
The REFRESH statement updates the metadata for a particular table after loading new data through Hive. The INVALIDATE METADATA Statement statement refreshes all metadata, so that Impala recognizes new tables or other DDL and DML changes performed through Hive.
In Impala 1.2 and higher, a dedicated catalogd daemon broadcasts metadata changes due to Impala DDL or DML statements to all nodes, reducing or eliminating the need to use the REFRESH and INVALIDATE METADATAstatements.
There is not a single point of failure in Impala. All Impala daemons are fully able to handle incoming queries. If a machine fails however, all queries with fragments running on that machine will fail. Because queries are expected to return quickly, you can just rerun the query if there is a failure.
The longer wer: Impala must be able to connect to the Hive metastore. Impala aggressively caches metadata so the metastore host should have minimal load. Impala relies on the HDFS NameNode, and, in CDH4, you can configure HA for HDFS. Impala also has centralized services, known as the statestore andcatalog services, that run on one host only. Impala continues to execute queries if the statestore host is down, but it will not get state updates. For example, if a host is added to the cluster while the statestore host is down, the existing instances of impalad running on the other hosts will not find out about this new host. Once the statestore process is restarted, all the information it serves is automatically reconstructed from all running Impala daemons.
When you use the INSERT ... SELECT * syntax to copy data into a partitioned table, the columns corresponding to the partition key columns must appear last in the columns returned by the SELECT *. You can create the table with the partition key columns defined last. Or, you can use the CREATE VIEW statement to create a view that reorders the columns: put the partition key columns last, then do the INSERT ... SELECT * from the view.
Cloudera strongly recommends running the impalad daemon on each DataNode for good performance. Although this topology is not a hard requirement, if there are data blocks with no Impala daemons running on any of the hosts containing replicas of those blocks, queries involving that data could be very inefficient. In that case, the data must be trmitted from one host to another for processing by "remote reads", a condition Impala normally tries to avoid.
Impala adds support for UDFs in Impala 1.@You can write your own functions in C++, or reuse existing Java-based Hive UDFs. The UDF support includes scalar functions and user-defined aggregate functions (UDAs). User-defined table functions (UDTFs) are not currently supported.
Impala does not currently support an extensible serialization-deserialization framework (SerDes), and so adding extra functionality to Impala is not as straightforward as for Hive or Pig.
Impala has finished its beta release cycle, and the 1.0, 1.1, and 1.2 GA releases are production ready. The 1.1.x series includes additional security features for authorization, an important requirement for production use in many organizations. The 1.2.x series includes important performance features, particularly for large join queries. Some Cloudera customers are already using Impala for large workloads.
The Impala 1.3.0 and higher releases are bundled with corresponding levels of CDH @The number of new features grows with each release.
If a query fails with an error indicating "memory limit exceeded", you might suspect a memory leak. The problem could actually be a query that is structured in a way that causes Impala to allocate more memory than you expect, exceeded the memory allocated for Impala on a particular node. Some examples of query or table structures that are especially memory-intensive are:
You can get scripts that produce data files and set up an environment for TPC-DS style benchmark tests from this Github repository. In addition to being useful for experimenting with performance, the tables are suited to experimenting with many aspects of SQL on Impala: they contain a good mixture of data types, data distributions, partitioning, and relational data suitable for join queries.
Ad-hoc queries are the primary use case for Impala. We anticipate it being used in many other situations where low-latency is required. Whether Impala is appropriate for any particular use-case depends on the workload, data size and query volume.
Impala utilizes multiple strategies to allow joins between tables and result sets of various sizes. When joining a large table with a small one, the data from the small table is trmitted to each node for intermediate processing. When joining two large tables, the data from one of the tables is divided into pieces, and each node processes only selected pieces.
For example, in an industrial environment, many agents may generate large amounts of data. Can Impala be used to analyze this data, checking for notable changes in the environment?
Complex Event Processing (CEP) is usually performed by dedicated stream-processing systems. Impala is not a stream-processing system, as it most closely resembles a relational database.
Impala allocates memory using tcmalloc, a memory allocator that is optimized for high concurrency. Once Impala allocates memory, it keeps that memory reserved to use for future queries. Thus, it is normal for Impala to show high memory usage when idle. If Impala detects that it is about to exceed its memory limit (defined by the -mem_limit startup option or the MEM_LIMIT query option), it deallocates memory not needed by the current queries.
When issuing queries through the JDBC or ODBC interfaces, make sure to call the appropriate close method afterwards. Otherwise, some memory associated with the query is not freed.
To load a data file into a partitioned table, when the data file includes fields like year, month, and so on that correspond to the partition key columns, use a two-stage process. First, use the LOAD DATA or CREATE EXTERNAL TABLE statement to bring the data into an unpartitioned text table. Then use an INSERT ... SELECT statement to copy the data from the unpartitioned table to a partitioned one. Include a PARTITION clause in the INSERTstatement to specify the partition key columns.
Starting with Impala 1.3.0, Impala documentation is integrated with the CDH 5 documentation, in addition to the standalone Impala documentation for use with CDH @For CDH 5, the core Impala developer and administrator information remains in the associated Impala documentation portion. Information about Impala release notes, installation, configuration, startup, and security is embedded in the corresponding CDH 5 guides.
HBase tables are ideal for queries where normally you would use a key-value store. That is, where you retrieve a single row or a few rows, by testing a special unique key column using the = or IN operators.
HBase tables are not suitable for queries that produce large result sets with thousands of rows. HBase tables are also not suitable for queries that perform full table sc because the WHERE clause does not request specific values from the unique key column.
Use HBase tables for data that is inserted one row or a few rows at a time, such as by the INSERT ... VALUESsyntax. Loading data piecemeal like this into an HDFS-backed table produces many tiny files, which is a very inefficient layout for HDFS data files.
If the lack of an UPDATE statement in Impala is a problem for you, you can simulate single-row updates by doing an INSERT ... VALUES statement using an existing value for the key column. The old row value is hidden; only the new row value is seen by queries.
HBase tables are often wide (containing many columns) and sparse (with most column values NULL). For example, you might record hundreds of different data points for each user of an online service, such as whether the user had registered for an online game or enabled particular account features. With Impala and HBase, you could look up all the information for a specific customer efficiently in a single query. For any given customer, most of these columns might be NULL, because a typical customer might not make use of most features of an online service.
The Hive metastore service is a requirement. Impala shares the same metastore database as Hive, allowing Impala and Hive to access the same tables trparently.
Hive itself is optional, and does not need to be installed on the same nodes as Impala. Currently, Impala supports a wider variety of read (query) operations than write (insert) operations; you use Hive to insert data into tables that use certain file formats.
By default, Impala automatically determines the most efficient order in which to join tables using a cost-based method, based on their overall size and number of rows. (This is a new feature in Impala 1.2.2 and higher.) The COMPUTE STATS statement gathers information about each table that is crucial for efficient join performance. Impala chooses between two techniques for join queries, known as "broadcast joins" and "partitioned joins".
Impala 1.2 and higher does support UDFs and UDAs. You can either write native Impala UDFs and UDAs in C++, or reuse UDFs (but not UDAs) originally written in Java for use with Hive.
When an INSERT statement fails, it is usually the result of exceeding some limit within a Hadoop component, typically HDFS.
Yes. There are some minor differences in how some queries are handled, but Impala queries can also be completed in Hive. Impala SQL is a subset of HiveQL, with some functional limitations such as trforms.
Impala does not cache table data. It does cache some table and file metadata. Although queries might run faster on subsequent iterations because the data set was cached in the OS buffer cache, Impala does not explicitly control this.
Impala takes advantage of the HDFS caching feature in CDH @You can designate which tables or partitions are cached through the CACHED and UNCACHED clauses of the CREATE TABLE and ALTER TABLE statements. Impala can also take advantage of data that is pinned in the HDFS cache through the hdfscacheadmin command.
Cloudera offers a demonstration VM called the QuickStart VM, available in VMWare, VirtualBox, and KVM formats. For more information, see the Cloudera QuickStart VM. After booting the QuickStart VM, many services are turned off by default; in the Cloudera Manager UI that appears automatically, turn on Impala and any other components that you want to try out.
Impala deletes data files when you issue a DROP TABLE on an internal table, but not an external one. By default, the CREATE TABLE statement creates internal tables, where the files are managed by Impala. An external table is created with a CREATE EXTERNAL TABLE statement, where the files reside in a location outside the control of Impala. Issue a DESCRIBE FORMATTED statement to check whether a table is internal or external. The keyword MANAGED_TABLE indicates an internal table, from which Impala can delete the data files. The keyword EXTERNAL_TABLE indicates an external table, where Impala will leave the data files untouched when you drop the table.
Even when you drop an internal table and the files are removed from their original location, you might not get the hard drive space back immediately. By default, files that are deleted in HDFS go into a special trashcan directory, from which they are purged after a period of time (by default, 6 hours). For background information on the trashcan mechanism.
To look at the core features and functionality on Impala, the easiest way to try out Impala is to download the Cloudera QuickStart VM and start the Impala service through Cloudera Manager, then use impala-shell in a terminal window or the Impala Query UI in the Hue web interface.
To do performance testing and try out the management features for Impala on a cluster, you need to move beyond the QuickStart VM with its virtualized single-node environment. Ideally, download the Cloudera Manager software to set up the cluster, then install the Impala software through Cloudera Manager.
Impala is different from Hive and Pig because it uses its own daemons that are spread across the cluster for queries. Because Impala does not rely on MapReduce, it avoids the startup overhead of MapReduce jobs, allowing Impala to return results in real time.
In Impala 1.2 and higher, there is much less need to use the REFRESH and INVALIDATE METADATA statements:
Impala supports the HiveServer2 JDBC driver.
You might be used to running queries against a single-row table named DUAL to try out expressions, built-in functions, and UDFs. Impala does not have a DUAL table. To achieve the same result, you can issue a SELECTstatement without any table name:
Although Impala is not an in-memory database, when dealing with large tables and large result sets, you should expect to dedicate a substantial portion of physical memory for the impalad daemon. Recommended physical memory for an Impala node is 128 GB or higher. If practical, devote approximately 80% of physical memory to Impala.
The amount of memory required for an Impala operation depends on several factors:
For example, if you execute the query:
select * from giant_table order by some_column limit 1000;
and your cluster has 50 nodes, then each of those 50 nodes will trmit a maximum of 1000 rows back to the coordinator node. The coordinator node needs enough memory to sort (LIMIT * cluster_size) rows, although in the end the final result set is at most LIMIT rows, 1000 in this case.
Likewise, if you execute the query:
select * from giant_table where test_val > 100 order by some_column;
then each node filters out a set of rows matching the WHERE conditions, sorts the results (with no size limit), and sends the sorted intermediate rows back to the coordinator node. The coordinator node might need substantial memory to sort the final result set, and so might use a temporary disk work area for that final phase of the query.
The Impala statestore keeps track of how many impalad nodes are currently available. You can see this information through the statestore web interface. For example, at the URL http://statestore_host:25010/metrics you might see lines like the following:
statestore.live-backends.list:[host1:22000, host1:26000, host2:22000]
The number of impalad nodes is the number of list items referring to port 22000, in this case two. (Typically, this number is one less than the number reported by the statestore.live-backends line.) If an impalad node became unavailable or came back after an outage, the information reported on this page would change appropriately.
The load Impala generates is very similar to MapReduce. Impala contacts the NameNode during the planning phase to get the file metadata (this is only run on the host the query was sent to). Every impaladwill read files as part of normal processing of the query.
Currently, if the memory required to process intermediate results on a node exceeds the amount available to Impala on that node, the query is cancelled. You can adjust the memory available to Impala on each node, and you can fine-tune the join strategy to reduce the memory required for the biggest queries. We do plan on supporting external joins and sorting in the future.
Keep in mind though that the memory usage is not directly based on the input data set size. For aggregations, the memory usage is the number of rows after grouping. For joins, the memory usage is the combined size of the tables excluding the biggest table, and Impala can use join strategies that divide up large joined tables among the various nodes rather than trmitting the entire table to each node.
Yes. Impala scales with the number of hosts. It is important to install Impala on all the DataNodes in the cluster, because otherwise some of the nodes must do remote reads to retrieve data not available for local reads. Data locality is an important architectural aspect for Impala performance.
Impala is well-suited to executing SQL queries for interactive exploratory analytics on large data sets. Hive and MapReduce are appropriate for very long running, batch-oriented tasks such as ETL.
There are no additional steps to allow Impala to query tables managed by Hive, whether they are stored in HDFS or HBase. Make sure that Impala is configured to access the Hive metastore correctly and you should be ready to go. Keep in mind that impalad, by default, runs as the impala user, so you might need to adjust some file permissions depending on how strict your permissions are currently.
These are the main factors in the performance of Impala versus that of other Hadoop components and related technologies.
Impala avoids MapReduce. While MapReduce is a great general parallel processing model with many benefits, it is not designed to execute SQL. Impala avoids the inefficiencies of MapReduce in these ways:
Yes, Avro is supported. Impala has always been able to query Avro tables. You can use the Impala LOAD DATAstatement to load existing Avro data files into a table. Starting with Impala 1.4, you can create Avro tables with Impala. Currently, you still use the INSERT statement in Hive to copy data from another table into an Avro table.