Top 50 Apache spark interview questions and answers

Table of Contents

Q1. What is RDD?

Ans. RDD (Resilient Distribution Datasets) is a fault-tolerant collection of operational elements that run parallel. The partitioned data in RDD is immutable and distributed.

Q2. Name the different types of RDD.

Ans. There are primarily two types of RDD – parallelized collection and Hadoop datasets.

Q3. What are the methods of creating RDDs in Spark?

Ans. There are two methods –

By parallelizing a collection in your Driver program.
By loading an external dataset from external storage like HDFS, HBase, shared file system.

Q4. What is a Sparse Vector?

Ans. A sparse vector has two parallel arrays –one for indices and the other for values.

Q5. Mention some of the areas where Spark outperforms Hadoop in processing.

Ans. Sensor data processing, real-time querying of data, and stream processing.

Q6. What are the languages supported by Apache Spark and which is the most popular one?

Ans. There are four languages supported by Apache Spark – Scala, Java, Python, and R. Scala is the most popular one.

Q7. What is Yarn?

Ans. Yarn is one of the key features in Spark, providing a central and resource management platform to deliver scalable operations across the cluster.

Q8. Do you need to install Spark on all nodes of the Yarn cluster? Why?

Ans. No, because Spark runs on top of Yarn.

Q9. Is it possible to run Apache Spark on Apache Mesos?

Ans. Yes.

Q10. What is the lineage graph?

Ans. The RDDs in Spark, depend on one or more other RDDs. The representation of dependencies in between RDDs is known as the lineage graph.

Q11. Define Partitions in Apache Spark

Ans. Partition is a smaller and logical division of data similar to ‘split’ in MapReduce. It is a logical chunk of a large distributed data set. Partitioning is the process to derive logical units of data to speed up the processing process.

Q12. What is a DStream?

Ans. Discretized Stream (DStream) is a sequence of Resilient Distributed Databases that represent a stream of data.

Q13. What is a Catalyst framework?

Ans. Catalyst framework is an optimization framework present in Spark SQL. It allows Spark to automatically transform SQL queries by adding new optimizations to build a faster processing system.

Q14. What are the Actions in Spark?

Ans. An action helps in bringing back the data from RDD to the local machine. An action’s execution is the result of all previously created transformations.

Q15. What is a Parquet file?

Ans. Parquet is a columnar format file supported by many other data processing systems.

Q16. What is GraphX?

Ans. Spark uses GraphX for graph processing to build and transform interactive graphs.

Q17. What file systems does Spark support?

Ans. Spark supports the following systems:

Hadoop Distributed File System (HDFS).
Local File system.
Amazon S3

Q18. What is the difference between persist () and cache ()?

Ans. Persist () allows the user to specify the storage level whereas cache () uses the default storage level.

Q19. What do you understand by SchemaRDD?

Ans. SchemaRDD is an RDD that consists of row objects (wrappers around the basic string or integer arrays) with schema information about the type of data in each column.

Q20. What is a lazy evaluation in Spark?

Ans. For large data in Spark, multiple operations take place even for the execution of a basic transformation. When a transformation is called on an RDD, the operation does not occur immediately. Transformations in Spark are not evaluated until you trigger an action. This is known as lazy evaluation. It avoids unnecessary memory and CPU usage that could take place due to certain mistakes.

Q21. What is Apache Spark?

Ans. Apache Spark is easy to use and flexible data processing framework. Spark can round on Hadoop, standalone, or in the cloud. It is capable of assessing diverse data source, which includes HDFS, Cassandra, and others.

Q22. Explain Dsstream with reference to Apache Spark

Ans. Dstream is a sequence of resilient distributed database which represent a stream of data. You can create Dstream from various source like HDFS, Apache Flume, Apache Kafka, etc.

Q23. Name three data source available in SparkSQL

Ans. There data source available in SparkSQL are:

JSON Datasets
Hive tables
Parquet file

Q24. Name some internal daemons used in spark?

Ans. Important daemon used in spark are Blockmanager, Memestore, DAGscheduler, Driver, Worker, Executor, Tasks,etc.

Q25. Define the term ‘Sparse Vector.’

Ans. Sparse vector is a vector which has two parallel arrays, one for indices, one for values, use for storing non-zero entities to save space.

Q26. Name the language supported by Apache Spark for developing big data applications

Ans. Important language use for developing big data application are:

Java
Python
R
Clojure
Scala

Q27. What is the method to create a Data frame?

Ans. In Apache Spark, a Data frame can be created using Tables in Hive and Structured data files.

Q28. Explain SchemaRDD

Ans. RDD which consists of row object with schema information about the type of data in each column is called SchemaRDD.

Q29. What are accumulators?

Ans. Accumulators are the write-only variables. They are initialized once and sent to the workers. These workers will update based on the logic written, which will send back to the driver.

Q30. What are the components of Spark Ecosystem?

Ans. An important component of Spark are:

Spark Core: It is a base engine for large-scale parallel and distributed data processing
Spark Streaming: This component used for real-time data streaming.
Spark SQL: Integrates relational processing by using Spark’s functional programming API
GraphX: Allows graphs and graph-parallel computation
MLlib: Allows you to perform machine learning in Apache Spark

Q31. Name three features of using Apache Spark

Ans. Three most important feature of using Apache Spark are:

Support for Sophisticated Analytics
Helps you to Integrate with Hadoop and Existing Hadoop Data
It allows you to run an application in Hadoop cluster, up to 100 times faster in memory, and ten times faster on disk.

Q32. Explain the default level of parallelism in Apache Spark

Ans. If the user isn’t able to specify, then the number of partitions are considered as default level of parallelism in Apache Spark.

Q33. Name three companies which is used Spark Streaming services

Ans. Three known companies using Spark Streaming services are:

Uber
Netflix
Pinterest

Q34. What is Spark SQL?

Ans. Spark SQL is a module for structured data processing where we take advantage of SQL queries running on that database.

Q35. Explain Parquet file

Ans. Paraquet is a columnar format file support by many other data processing systems. Spark SQL allows you to performs both read and write operations with Parquet file.

Q36. Explain Spark Driver?

Ans. Spark Driver is the program which runs on the master node of the machine and declares transformations and actions on data RDDs.

Q37) How can you store the data in spark?

Ans. Spark is a processing engine which doesn’t have any storage engine. It can retrieve data from another storage engine like HDFS, S3.

Q38. Explain the use of File system API in Apache Spark

Ans. File system API allows you to read data from various storage devices like HDFS, S3 or local Filesystem.

Q39. What is the task of Spark Engine

Ans. Spark Engine is helpful for scheduling, distributing and monitoring the data application across the cluster.

Q40. What is the user of spark context?

Ans. SparkContent is the entry point to spark. SparkContext allows you to create RDDs which provided various way of churning data.

Q41. Explain about transformations and actions in the context of RDDs.

Ans. Transformations are functions executed on demand, to produce a new RDD. All transformations are followed by actions. Some examples of transformations include map, filter and reduceByKey.

Actions are the results of RDD computations or transformations. After an action is performed, the data from RDD moves back to the local machine. Some examples of actions include reduce, collect, first, and take.

Q42. Can you use Spark to access and analyse data stored in Cassandra databases?

Ans. Yes, it is possible if you use Spark Cassandra Connector.

Q43. Is it possible to run Apache Spark on Apache Mesos?

Ans. Yes, Apache Spark can be run on the hardware clusters managed by Mesos.

Q44. Explain about the different cluster managers in Apache Spark

Ans. The 3 different clusters managers supported in Apache Spark are:

YARN
Apache Mesos -Has rich resource scheduling capabilities and is well suited to run Spark along with other applications. It is advantageous when several users run interactive shells because it scales down the CPU allocation between commands.
Standalone deployments – Well suited for new deployments which only run and are easy to set up.

Q45. How can Spark be connected to Apache Mesos?

Ans. To connect Spark with Mesos-

Configure the spark driver program to connect to Mesos. Spark binary package should be in a location accessible by Mesos. (or)
Install Apache Spark in the same location as that of Apache Mesos and configure the property ‘spark.mesos.executor.home’ to point to the location where it is installed.

Q46. Why is there a need for broadcast variables when working with Apache Spark?

Ans. These are read only variables, present in-memory cache on every machine. When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster. Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup ().

Q47. Is it possible to run Spark and Mesos along with Hadoop?

Ans. Yes, it is possible to run Spark and Mesos with Hadoop by launching each of these as a separate service on the machines. Mesos acts as a unified scheduler that assigns tasks to either Spark or Hadoop.

Q48. What are the benefits of using Spark with Apache Mesos?

Ans. It renders scalable partitioning among various Spark instances and dynamic partitioning between Spark and other big data frameworks

Q49. What is a DStream?

Ans. Discretized Stream is a sequence of Resilient Distributed Databases that represent a stream of data. DStreams can be created from various sources like Apache Kafka, HDFS, and Apache Flume. DStreams have two operations –

Transformations that produce a new DStream.
Output operations that write data to an external system.

Q50. When running Spark applications, is it necessary to install Spark on all the nodes of YARN cluster?

Ans. Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.

Author
Recent Posts

Follow me

Rajesh Kumar

Mentor for DevOps - DevSecOps - SRE - Cloud - Container & Micorservices at Software AG

Join my following certification courses...
- DevOps Certified Professionals (DCP)
- Site Reliability Engineering Certified Professionals (SRECP)
- Master in DevOps Engineering (MDE)
- DevSecOps Certified Professionals (DSOCP)
URL - https://www.devopsschool.com/certification/

My Linkedin - https://www.linkedin.com/in/rajeshkumarin