Top 26 Apache Tajo Interview Questions You Must Prepare 26.Apr.2024

Some salient feaures of Tajo are:

  • Superior scalability and optimized performance
  • Low latency
  • User-defined functions
  • Row/columnar storage processing framework.
  • Compatibility with HiveQL and Hive MetaStore
  • Simple data flow and easy maintenance.

The logical view of the data source is defined as table. The table consists of various properties like logical schema, partitions, URL etc. A Tajo table can be a directory in HDFS, a single file, one HBase table, or a RDBMS table.

The types of tables supported by Apache Tajo are:

External table: External table needs the location property when the table is created. For instance, if the data is already there as Text/JSON files or HBase table, it can be registered as Tajo external table. The following query is an example of external table creation.

create external table sample(col1 int,col2 text,col3 int) 

location ‘hdfs://path/to/table';

Internal table: A Internal table is also called an Managed Table. It is created in a pre-defined physical location called the Tablespace.

create table table1(col1 int,col2 text);

By default, Tajo uses “tajo.warehouse.directory” located in “conf/tajo-site.xml” . Tablespace configuration is used to assign new location for the table.

Tajo supports the following storage formats:

  • HDFS
  • JDBC
  • Amazon S3
  • Apache HBase
  • Elasticsearch

Apache Tajo is a relational and distributed data processing framework. It is designed for low latency and scalable ad-hoc query analysis.

  • Tajo supports standard SQL and various data formats. Most of the Tajo queries can be executed without any modification.
  • Tajo has fault-tolerance through a restart mechanism for failed tasks and extensible query rewrite engine.
  • Tajo performs the necessary ETL (Extract Trform and Load process) operations to summarize large datasets stored on HDFS. It is an alternative choice to Hive/Pig.

To insert new column in the “students” table, type the following syntax -

Alter table ADD COLUMN

 alter table students add column grade text;

The HAVING clause enables you to specify conditions that filter which group results appear in the final results. The WHERE clause places conditions on the selected columns, whereas the HAVING clause places conditions on the groups created by the GROUP BY clause.

SELECT column1, column2 FROM table1 

     GROUP BY column HAVING [ conditions ]

 select age from mytable group by age having sum(mark) > 200;

The locations in the storage system are defined by Tablespace. It is supported for only internal tables. Tablespaces are accessed by their names. Each tablespace can use a different storage type. If the tablespace is not specified then, Tajo uses the default tablespace in the root directory. Tajo’s internal table records can be accessed from another table only. It can be configured with tablespace.

CREATE TABLE [IF NOT EXISTS] 

   [(column_list)] [TABLESPACE tablespace_name]

   [using [with ( = , ...)]] [AS ]

Some of the SQL functions supported by Apache Tajo are categorized into:

  • Math Functions
  • String Functions
  • DateTime Functions
  • JSON Functions

Start server

$ bin/start-tajo.sh

Start Shell

$ bin/tsql

List Database

default> l

List out Built-in Functions

default> df

Describe Function: df function name - This query returns the complete description of the given function.

default> df sqrt

Quit Terminal

default> q

Cluster Info

default&> admin -cluster

Show master

default> admin -showmasters

Worker Heap Memory Size: The environment variable TAJO_WORKER_HEAPSIZE in conf/tajo-env.sh allow Tajo Worker to use the specified heap memory size. If you want to adjust heap memory size, set TAJO_WORKER_HEAPSIZE variable in conf/tajo-env.

sh with a proper size as follows:

TAJO_WORKER_HEAPSIZE=8000

The default size is 1000 (1GB).

Temporary Data Directory: TajoWorker stores temporary data on local file system due to out-of-core algorithms. It is possible to specify one or more temporary data directories where temporary data will be stored.

Maximum number of parallel running tasks for each worker: Each worker can execute multiple tasks at a time. Tajo allows users to specify the maximum number of parallel running tasks for each worker.

Client: Client submits the SQL statements to the Tajo Master to get the result.

Master: Master is the main daemon. It is responsible for query planning and is the coordinator for workers.

Catalog server: Maintains the table and index descriptions. It is embedded in the Master daemon. The catalog server uses Apache Derby as the storage layer and connects via JDBC client.

Worker: Master node assigns task to worker nodes. TajoWorker processes data. As the number of TajoWorkers increases, the processing capacity also increases linearly.

Query Master: Tajo master assigns query to the Query Master. The Query Master is responsible for controlling a distributed execution plan. It launches the TaskRunner and schedules tasks to TaskRunner. The main role of the Query Master is to monitor the running tasks and report them to the Master node.

Node Managers: Manages the resource of the worker node. It decides on allocating requests to the node.

TaskRunner: Acts as a local query execution engine. It is used to run and monitor query process. The TaskRunner processes one task at a time.

It has the following three main attributes:

Logical plan - An execution block which created the task.

A fragment - an input path, an offset range, and schema.

Fetches URIs:

Query Executor: It is used to execute a query.

Storage service: Connects the underlying data storage to Tajo.

Tajo’s configuration is based on Hadoop’s configuration system.

Tajo uses two config files:

catalog-site.xml- configuration for the catalog server.

tajo-site.xml- configuration for other tajo modules. Tajo has a variety of internal configs. If you don’t set some config explicitly, the default config will be used for for that config. Tajo is designed to use only a few of configs in usual cases. You may not be concerned with the configuration.

In default, there is no tajo-site.xml in ${TAJO}/conf directory. If you set some configs, first copy $TAJO_HOME/conf/tajo-site.xml.templete to tajo-site.xml. Then, add the configs to your tajo-site.

The syntax used to drop a database is -

DROP DATABASE

Ex: test> c default

Predicates: To evaluate the true/false values of the UNKNOWN, an expression used is known as Predication. For the search condition of WHERE clause and HAVING clause, and constructs that require a Boolean value, predicate is used.

Explain: To obtain a query execution plan with a logical and global plan execution of a statement, Explain is used.

Join: SQL joins are used to combine rows from two or more tables.

The following are the different types of SQL Joins:

  • Inner join
  • { LEFT | RIGHT | FULL } OUTER JOIN
  • Cross join
  • Self join
  • Natural join

Apache Tajo offers the following benefits:

  • Easy to use
  • Simplified architecture
  • Cost-based query optimization
  • Vectorized query execution plan
  • Fast delivery
  • Simple I/O mechanism and supports various type of storage.
  • Fault tolerance

The CREATE INDEX statement is used to create indexes in tables. Index is used for fast retrieval of data. Current version supports index for only plain TEXT formats stored on HDFS.

CREATE INDEX [ name ] ON table_name ( { column_name | ( expression ) }

create index student_index on mytable(id);

This property is used to change the table’s property.

ALTER TABLE students SET PROPERTY 'compression.type' = 'RECORD',

'compression.codec' = 'org.apache.hadoop.io.compress.Snappy Codec' ;

The functions that execute on a set of rows and return a single value for each row are Window functions. The Window function in a query, defines the window using the OVER() clause.

The OVER() clause has the following capabilities:

  • Defines window partitions to form groups of rows. (PARTITION BY clause)
  • Orders rows within a partition. (ORDER BY clause)

Some of the window functions are:

  • rank()
  • row_num()
  • lead(value[, offset integer[, default any]])
  • lag(value[, offset integer[, default any]])
  • first_value(value)
  • last_value(value)

Tajo supports PostgreSQL storage handler. It enables user queries to access database objects in PostgreSQL. It is the default storage handler in Tajo so you can easily configure it.

{

"spaces": {

"postgre": {

"uri": "jdbc:postgresql://hostname:port/database1"

"configs": {

"mapped_database": “sampledb”

"connection_properties": {

"user":“tajo", "password": "pwd"

}

}

}

}

}

Here, “database1” refers to the postgreSQL database which is mapped to the database “sampledb” in Tajo.

A table column may contain duplicate values. The DISTINCT keyword can be used to return only distinct (different) values.

SELECT DISTINCT column1,column2 FROM table name;

 select distinct age from mytable;

Apache Tajo supports the following data formats:

  • JSON
  • Text file(CSV)
  • Parquet
  • Sequence File
  • AVRO
  • Protocol Buffer
  • Apache Orc

The statement used to create a database in Tajo is Create Database and the syntax for the statement is:

CREATE DATABASE [IF NOT EXISTS]

Ex: default> create database if not exists test;

If you want to customize the catalog service, copy $TAJO_HOME/conf/catalog-site.xml.template to catalog-site.xml. Then, add the following configs to catalog-site.xml. Note that the default configs are enough to launch Tajo cluster in most cases.

tajo.catalog.master.addr - If you want to launch a Tajo cluster in distributed mode, you must specify this address. For more detail information, see Default Ports.

tajo.catalog.store.class - If you want to change the persistent storage of the catalog server, specify the class name. Its default value is tajo.catalog.store.DerbyStore. In the current version, Tajo provides three persistent storage classes as follows:

tajo.catalog.store.DerbyStore - this storage class uses Apache Derby.

tajo.catalog.store.MySQLStore - this storage class uses MySQL.

tajo.catalog.store.MemStore - this is the in-memory storage. It is only used in unit tests to shorten the duration of unit tests.

To insert records in the 'test' table, type the following query.

db sample> insert overwrite into test select * from mytable;

To launch the tajo master, execute start-tajo.sh.

$ $TAJO HOME/sbin/start-tajo.sh

After then, you can use tajo-cli to access the command line interface of Tajo. If you want to how to use tsql, read Tajo Interactive Shell document.

$ $TAJO HOME/bin/tsql