Top 50 Spark Sql Programming Interview Questions and Answers for 27.Jul.2024

Q1. What Is Catalyst Framework?

Catalyst framework is a new optimization framework present in Spark SQL. It allows Spark to automatically transform SQL queries by adding new optimizations to build a faster processing system.

Q2. Name A Few Companies That Use Apache Spark In Production.

Pinterest, Conviva, Shopify, Open Table

Q3. Name A Few Commonly Used Spark Ecosystems.

Spark SQL (Shark)

Spark Streaming

GraphX

MLlib

SparkR

Q4. Is It Possible To Run Apache Spark On Apache Mesos?

Yes, Apache Spark can be run on the hardware clusters managed by Mesos.

Q5. How Can You Minimize Data Transfers When Working With Spark?

Minimizing data transfers and avoiding shuffling helps write spark programs that run in a fast and reliable manner. The various ways in which data transfers can be minimized when working with Apache Spark are:

Using Broadcast Variable- Broadcast variable enhances the efficiency of joins between small and large RDDs.
Using Accumulators – Accumulators help update the values of variables in parallel while executing.
The most common way is to avoid operations ByKey, repartition or any other operations which trigger shuffles.

Q6. Hadoop Uses Replication To Achieve Fault Tolerance. How Is This Achieved In Apache Spark?

Data storage model in Apache Spark is based on RDDs. RDDs help achieve fault tolerance through lineage. RDD always has the information on how to build from other datasets. If any partition of a RDD is lost due to failure, lineage helps build only that particular lost partition.

Q7. What Is A Parquet File?

Parquet is a columnar format file supported by many other data processing systems. Spark SQL performs both read and write operations with Parquet file and consider it be one of the best big data analytics format so far.

Q8. What Are The Various Levels Of Persistence In Apache Spark?

Apache Spark automatically persists the intermediary data from various shuffle operations, however it is often suggested that users call persist () method on the RDD in case they plan to reuse it. Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication levels.

The various storage/persistence levels in Spark are -

MEMORY_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER, DISK_ONLY
OFF_HEAP

Q9. Is It Necessary To Install Spark On All The Nodes Of A Yarn Cluster While Running Apache Spark On Yarn ?

No , it is not necessary because Apache Spark runs on top of YARN.

Q10. What Is The Advantage Of A Parquet File?

Parquet file is a columnar format file that helps –

Limit I/O operations
Consumes less space
Fetches only required columns.

Q11. What Do You Understand By Pair Rdd?

Special operations can be performed on RDDs in Spark using key/value pairs and such RDDs are referred to as Pair RDDs. Pair RDDs allow users to access each key in parallel. They have a reduceByKey () method that collects data based on each key and a join () method that combines different RDDs together, based on the elements having the same key.

Q12. Why Is There A Need For Broadcast Variables When Working With Apache Spark?

These are read only variables, present in-memory cache on every machine. When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster. Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup ().

Q13. How Spark Uses Akka?

Spark uses Akka basically for scheduling. All the workers request for a task to master after registering. The master just assigns the task. Here Spark uses Akka for messaging between the workers and masters.

Q14. What Is Hive On Spark?

Hive is a component of Hortonworks’ Data Platform (HDP). Hive provides an SQL-like interface to data stored in the HDP. Spark users will automatically get the complete set of Hive’s rich features, including any new features that Hive might introduce in the future.

The main task around implementing the Spark execution engine for Hive lies in query planning, where Hive operator plans from the semantic analyzer which is translated to a task plan that Spark can execute. It also includes query execution, where the generated Spark plan gets actually executed in the Spark cluster.

Q15. Explain About The Different Cluster Managers In Apache Spark

The 3 different clusters managers supported in Apache Spark are:

YARN
Apache Mesos -Has rich resource scheduling capabilities and is well suited to run Spark along with other applications. It is advantageous when several users run interactive shells because it scales down the CPU allocation between commands.
Standalone deployments – Well suited for new deployments which only run and are easy to set up.

Q16. Explain About The Different Types Of Transformations On Dstreams?

Stateless Transformations- Processing of the batch does not depend on the output of the previous batch. Examples – map (), reduceByKey (), filter ().
Stateful Transformations- Processing of the batch depends on the intermediary results of the previous batch. Examples –Transformations that depend on sliding windows.

Q17. How Spark Handles Monitoring And Logging In Standalone Mode?

Spark has a web based user interface for monitoring the cluster in standalone mode that shows the cluster and job statistics. The log output for each job is written to the work directory of the slave nodes.

Q18. Explain About The Popular Use Cases Of Apache Spark

Apache Spark is mainly used for

Iterative machine learning.
Interactive data analytics and processing.
Stream processing
Sensor data processing

Q19. What Is A Dstream?

Discretized Stream is a sequence of Resilient Distributed Databases that represent a stream of data. DStreams can be created from various sources like Apache Kafka, HDFS, and Apache Flume. DStreams have two operations –

Transformations that produce a new DStream.
Output operations that write data to an external system.

Q20. What Are The Languages Supported By Apache Spark For Developing Big Data Applications?

Scala, Java, Python, R and Clojure

Q21. What Is Lineage Graph?

The RDDs in Spark, depend on one or more other RDDs. The representation of dependencies in between RDDs is known as the lineage graph. Lineage graph information is used to compute each RDD on demand, so that whenever a part of persistent RDD is lost, the data that is lost can be recovered using the lineage graph information.

Q22. How Spark Uses Hadoop?

Spark has its own cluster management computation and mainly uses Hadoop for storage.

Q23. Explain About The Common Workflow Of A Spark Program

The foremost step in a Spark program involves creating input RDD's from external data.
Use various RDD transformations like filter() to create new transformed RDD's based on the business logic.
persist() any intermediate RDD's which might have to be reused in future.
Launch various RDD actions() like first(), count() to begin parallel computation , which will then be optimized and executed by Spark.

Q24. What Do You Understand By Lazy Evaluation?

Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate on a given dataset, it heeds the instructions and makes a note of it, so that it does not forget - but it does nothing, unless asked for the final result. When a transformation like map () is called on a RDD-the operation is not performed immediately. Transformations in Spark are not evaluated till you perform an action. This helps optimize the overall data processing workflow.

Q25. What Is Shark?

Most of the data users know only SQL and are not good at programming. Shark is a tool, developed for people who are from a database background - to access Scala MLib capabilities through Hive like SQL interface. Shark tool helps data users run Hive on Spark - offering compatibility with Hive metastore, queries and data.

Q26. Is It Necessary To Start Hadoop To Run Any Apache Spark Application ?

Starting hadoop is not manadatory to run any spark application. As there is no seperate storage in Apache Spark, it uses Hadoop HDFS but it is not mandatory. The data can be stored in local file system, can be loaded from local file system and processed.

Q27. How Can You Remove The Elements With A Key Present In Any Other Rdd?

Use the subtractByKey () function

Q28. What Are The Key Features Of Apache Spark That You Like?

Spark provides advanced analytic options like graph algorithms, machine learning, streaming data, etc
It has built-in APIs in multiple languages like Java, Scala, Python and R
It has good performance gains, as it helps run an application in the Hadoop cluster ten times faster on disk and 100 times faster in memory.

Q29. What Do You Understand By Executor Memory In A Spark Application?

Every spark application has same fixed heap size and fixed number of cores for a spark executor. The heap size is what referred to as the Spark executor memory which is controlled with the spark.executor.memory property of the –executor-memory flag. Every spark application will have one executor on each worker node. The executor memory is basically a measure on how much memory of the worker node will the application utilize.

Q30. How Can You Compare Hadoop And Spark In Terms Of Ease Of Use?

Hadoop MapReduce requires programming in Java which is difficult, though Pig and Hive make it considerably easier. Learning Pig and Hive syntax takes time. Spark has interactive APIs for different languages like Java, Python or Scala and also includes Shark i.e. Spark SQL for SQL lovers - making it comparatively easier to use than Hadoop.

Q31. Why Is Blinkdb Used?

BlinkDB is a query engine for executing interactive SQL queries on huge volumes of data and renders query results marked with meaningful error bars. BlinkDB helps users balance ‘query accuracy’ with response time.

Q32. What Is Rdd?

RDDs (Resilient Distributed Datasets) are basic abstraction in Apache Spark that represent the data coming into the system in object format. RDDs are used for in-memory computations on large clusters, in a fault tolerant manner. RDDs are read-only portioned, collection of records, that are –

Immutable – RDDs cannot be altered.
Resilient – If a node holding the partition fails the other node takes the data.

Q33. What Is Spark Sql?

SQL Spark, better known as Shark is a novel module introduced in Spark to work with structured data and perform structured data processing. Through this module, Spark executes relational SQL queries on the data. The core of the component supports an altogether different RDD called SchemaRDD, composed of rows objects and schema objects defining data type of each column in the row. It is similar to a table in relational database.

Q34. What Is The Default Level Of Parallelism In Apache Spark?

If the user does not explicitly specify then the number of partitions are considered as default level of parallelism in Apache Spark.

Q35. What Is A Sparse Vector?

sparse vector has two parallel arrays –one for indices and the other for values. These vectors are used for storing non-zero entries to save space.

Q36. Explain About The Major Libraries That Constitute The Spark Ecosystem

Spark MLib- Machine learning library in Spark for commonly used learning algorithms like clustering, regression, classification, etc.
Spark Streaming – This library is used to process real time streaming data.
Spark GraphX – Spark API for graph parallel computations with basic operators like joinVertices, subgraph, aggregateMessages, etc.
Spark SQL – Helps execute SQL like queries on Spark data using standard visualization or BI tools.

Q37. Can We Do Real-time Processing Using Spark Sql?

Not directly but we can register an existing RDD as a SQL table and trigger SQL queries on top of that.

Q38. What Are The Benefits Of Using Spark With Apache Mesos?

It renders scalable partitioning among various Spark instances and dynamic partitioning between Spark and other big data frameworks.

Q39. How Can You Trigger Automatic Clean-ups In Spark To Handle Accumulated Metadata?

You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or by dividing the long running jobs into different batches and writing the intermediary results to the disk.

Q40. When Running Spark Applications, Is It Necessary To Install Spark On All The Nodes Of Yarn Cluster?

Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.

Q41. Can You Use Spark To Access And Analyse Data Stored In Cassandra Databases?

Yes, it is possible if you use Spark Cassandra Connector.

Q42. List The Functions Of Spark Sql.

Spark SQL is capable of:

Loading data from a variety of structured sources
Querying data using SQL statements, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC). For instance, using business intelligence tools like Tableau
Providing rich integration between SQL and regular Python/Java/Scala code, including the ability to join RDDs and SQL tables, expose custom functions in SQL, and more

Q43. What Are The Common Mistakes Developers Make When Running Spark Applications?

Developers often make the mistake of-

Hitting the web service several times by using multiple clusters.
Run everything on the local node instead of distributing it.

Developers need to be careful with this, as Spark makes use of memory for processing.

Q44. What Are Benefits Of Spark Over Mapreduce?

Due to the availability of in-memory processing, Spark implements the processing around 10-100x faster than Hadoop MapReduce. MapReduce makes use of persistence storage for any of the data processing tasks.

Unlike Hadoop, Spark provides in-built libraries to perform multiple tasks form the same core like batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch processing.
Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage
Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by Hadoop.

Q45. What Makes Apache Spark Good At Low-latency Workloads Like Graph Processing And Machine Learning?

Apache Spark stores data in-memory for faster model building and training. Machine learning algorithms require multiple iterations to generate a resulting optimal model and similarly graph algorithms traverse all the nodes and edges.These low latency workloads that need multiple iterations can lead to increased performance. Less disk access and controlled network traffic make a huge difference when there is lots of data to be processed.

Q46. How Sparksql Is Different From Hql And Sql?

SparkSQL is a special component on the spark Core engine that support SQL and Hive Query Language without changing any syntax. It’s possible to join SQL table and HQL table.

Q47. What Does The Spark Engine Do?

Spark engine schedules, distributes and monitors the data application across the spark cluster.

Q48. How Can You Achieve High Availability In Apache Spark?

Implementing single node recovery with local file system
Using StandBy Masters with Apache ZooKeeper.

Q49. Is Apache Spark A Good Fit For Reinforcement Learning?

No. Apache Spark works well only for simple machine learning algorithms like clustering, regression, classification.

Q50. What Do You Understand By Schemardd?

An RDD that consists of row objects (wrappers around basic string or integer arrays) with schema information about the type of data in each column.

Top 50 Spark Sql Programming Interview Questions You Must Prepare 27.Jul.2024