{"id":26669,"date":"2022-02-12T07:35:03","date_gmt":"2022-02-12T07:35:03","guid":{"rendered":"https:\/\/www.devopsschool.com\/blog\/?p=26669"},"modified":"2022-04-13T16:57:11","modified_gmt":"2022-04-13T16:57:11","slug":"top-50-apache-spark-interview-questions-and-answers","status":"publish","type":"post","link":"https:\/\/www.devopsschool.com\/blog\/top-50-apache-spark-interview-questions-and-answers\/","title":{"rendered":"Top 50 Apache spark interview questions and answers"},"content":{"rendered":"\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/blog-background.jpg\" alt=\"\" class=\"wp-image-26672\" width=\"814\" height=\"349\" srcset=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/blog-background.jpg 700w, https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/blog-background-300x129.jpg 300w\" sizes=\"auto, (max-width: 814px) 100vw, 814px\" \/><figcaption><em><strong>Apache Spark<\/strong><\/em><\/figcaption><\/figure><\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q1-what-is-rdd\">Q1. What is RDD?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"> Ans. RDD (Resilient Distribution Datasets) is a fault-tolerant collection of operational elements that run parallel. The partitioned data in RDD is immutable and distributed.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q2-name-the-different-types-of-rdd\">Q2. Name the different types of RDD.<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. There are primarily two types of RDD \u2013 parallelized collection and Hadoop datasets.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q3-what-are-the-methods-of-creating-rdds-in-spark\">Q3. What are the methods of creating RDDs in Spark?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. There are two methods \u2013<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-1\" data-shcb-language-name=\"JavaScript\" data-shcb-language-slug=\"javascript\"><span><code class=\"hljs language-javascript\">By parallelizing a collection <span class=\"hljs-keyword\">in<\/span> your Driver program.\nBy loading an external dataset <span class=\"hljs-keyword\">from<\/span> external storage like HDFS, HBase, shared file system.<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-1\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">JavaScript<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">javascript<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<h2 class=\"wp-block-heading\" id=\"q4-what-is-a-sparse-vector\">Q4. What is a Sparse Vector?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. A sparse vector has two parallel arrays \u2013one for indices and the other for values.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q5-mention-some-of-the-areas-where-spark-outperforms-hadoop-in-processing\">Q5. Mention some of the areas where Spark outperforms Hadoop in processing.<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. Sensor data processing, real-time querying of data, and stream processing.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q6-what-are-the-languages-supported-by-apache-spark-and-which-is-the-most-popular-one\">Q6. What are the languages supported by Apache Spark and which is the most popular one?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. There are four languages supported by Apache Spark \u2013 Scala, Java, Python, and R. Scala is the most popular one.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q7-what-is-yarn\">Q7. What is Yarn?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. Yarn is one of the key features in Spark, providing a central and resource management platform to deliver scalable operations across the cluster.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q8-do-you-need-to-install-spark-on-all-nodes-of-the-yarn-cluster-why\">Q8. Do you need to install Spark on all nodes of the Yarn cluster? Why?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. No, because Spark runs on top of Yarn.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q9-is-it-possible-to-run-apache-spark-on-apache-mesos\">Q9. Is it possible to run Apache Spark on Apache Mesos?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. Yes.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q10-what-is-the-lineage-graph\">Q10. What is the lineage graph?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. The RDDs in Spark, depend on one or more other RDDs. The representation of dependencies in between RDDs is known as the lineage graph.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q11-define-partitions-in-apache-spark\">Q11. Define Partitions in Apache Spark<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. Partition is a smaller and logical division of data similar to \u2018split\u2019 in MapReduce. It is a logical chunk of a large distributed data set. Partitioning is the process to derive logical units of data to speed up the processing process.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q12-what-is-a-dstream\">Q12. What is a DStream?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. Discretized Stream (DStream) is a sequence of Resilient Distributed Databases that represent a stream of data.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q13-what-is-a-catalyst-framework\">Q13. What is a Catalyst framework?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. Catalyst framework is an optimization framework present in Spark SQL. It allows Spark to automatically transform SQL queries by adding new optimizations to build a faster processing system.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q14-what-are-the-actions-in-spark\">Q14. What are the Actions in Spark?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. An action helps in bringing back the data from RDD to the local machine. An action\u2019s execution is the result of all previously created transformations.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q15-what-is-a-parquet-file\">Q15. What is a Parquet file?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. Parquet is a columnar format file supported by many other data processing systems.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q16-what-is-graphx\">Q16. What is GraphX?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. Spark uses GraphX for graph processing to build and transform interactive graphs.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q17-what-file-systems-does-spark-support\">Q17. What file systems does Spark support?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. Spark supports the following systems:<\/p>\n\n\n<pre class=\"wp-block-code\"><span><code class=\"hljs\">Hadoop Distributed File System (HDFS).\nLocal File system.\nAmazon S3<\/code><\/span><\/pre>\n\n\n<h2 class=\"wp-block-heading\" id=\"q18-what-is-the-difference-between-persist-and-cache\">Q18. What is the difference between persist () and cache ()?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. Persist () allows the user to specify the storage level whereas cache () uses the default storage level.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q19-what-do-you-understand-by-schemardd\">Q19. What do you understand by SchemaRDD?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. SchemaRDD is an RDD that consists of row objects (wrappers around the basic string or integer arrays) with schema information about the type of data in each column.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q20-what-is-a-lazy-evaluation-in-spark\">Q20. What is a lazy evaluation in Spark?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. For large data in Spark, multiple operations take place even for the execution of a basic transformation. When a transformation is called on an RDD, the operation does not occur immediately. Transformations in Spark are not evaluated until you trigger an action. This is known as lazy evaluation. It avoids unnecessary memory and CPU usage that could take place due to certain mistakes.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q21-what-is-apache-spark\">Q21. What is Apache Spark?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. Apache Spark is easy to use and flexible data processing framework. Spark can round on Hadoop, standalone, or in the cloud. It is capable of assessing diverse data source, which includes HDFS, Cassandra, and others.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q22-explain-dsstream-with-reference-to-apache-spark\">Q22. Explain Dsstream with reference to Apache Spark<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. Dstream is a sequence of resilient distributed database which represent a stream of data. You can create Dstream from various source like HDFS, Apache Flume, Apache Kafka, etc.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q23-name-three-data-source-available-in-sparksql\">Q23. Name three data source available in SparkSQL<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. There data source available in SparkSQL are:<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-2\" data-shcb-language-name=\"JavaScript\" data-shcb-language-slug=\"javascript\"><span><code class=\"hljs language-javascript\"><span class=\"hljs-built_in\">JSON<\/span> Datasets\nHive tables\nParquet file<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-2\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">JavaScript<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">javascript<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<h2 class=\"wp-block-heading\" id=\"q24-name-some-internal-daemons-used-in-spark\">Q24. Name some internal daemons used in spark?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. Important daemon used in spark are Blockmanager, Memestore, DAGscheduler, Driver, Worker, Executor, Tasks,etc.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q25-define-the-term-sparse-vector\">Q25. Define the term \u2018Sparse Vector.\u2019<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. Sparse vector is a vector which has two parallel arrays, one for indices, one for values, use for storing non-zero entities to save space.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q26-name-the-language-supported-by-apache-spark-for-developing-big-data-applications\">Q26. Name the language supported by Apache Spark for developing big data applications<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. Important language use for developing big data application are:<\/p>\n\n\n<pre class=\"wp-block-code\"><span><code class=\"hljs\">Java\nPython\nR\nClojure\nScala<\/code><\/span><\/pre>\n\n\n<h2 class=\"wp-block-heading\" id=\"q27-what-is-the-method-to-create-a-data-frame\">Q27. What is the method to create a Data frame?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. In Apache Spark, a Data frame can be created using Tables in Hive and Structured data files.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q28-explain-schemardd\">Q28. Explain SchemaRDD<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. RDD which consists of row object with schema information about the type of data in each column is called SchemaRDD.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q29-what-are-accumulators\">Q29. What are accumulators?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. Accumulators are the write-only variables. They are initialized once and sent to the workers. These workers will update based on the logic written, which will send back to the driver.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q30-what-are-the-components-of-spark-ecosystem\">Q30. What are the components of Spark Ecosystem?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. An important component of Spark are:<\/p>\n\n\n<pre class=\"wp-block-code\"><span><code class=\"hljs\">Spark Core: It is a base engine for large-scale parallel and distributed data processing\nSpark Streaming: This component used for real-time data streaming.\nSpark SQL: Integrates relational processing by using Spark\u2019s functional programming API\nGraphX: Allows graphs and graph-parallel computation\nMLlib: Allows you to perform machine learning in Apache Spark<\/code><\/span><\/pre>\n\n\n<h2 class=\"wp-block-heading\" id=\"q31-name-three-features-of-using-apache-spark\">Q31. Name three features of using Apache Spark<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. Three most important feature of using Apache Spark are:<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-3\" data-shcb-language-name=\"JavaScript\" data-shcb-language-slug=\"javascript\"><span><code class=\"hljs language-javascript\">Support <span class=\"hljs-keyword\">for<\/span> Sophisticated Analytics\nHelps you to Integrate <span class=\"hljs-keyword\">with<\/span> Hadoop and Existing Hadoop Data\nIt allows you to run an application <span class=\"hljs-keyword\">in<\/span> Hadoop cluster, up to <span class=\"hljs-number\">100<\/span> times faster <span class=\"hljs-keyword\">in<\/span> memory, and ten times faster on disk.<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-3\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">JavaScript<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">javascript<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<h2 class=\"wp-block-heading\" id=\"q32-explain-the-default-level-of-parallelism-in-apache-spark\">Q32. Explain the default level of parallelism in Apache Spark<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. If the user isn\u2019t able to specify, then the number of partitions are considered as default level of parallelism in Apache Spark.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q33-name-three-companies-which-is-used-spark-streaming-services\">Q33. Name three companies which is used Spark Streaming services<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. Three known companies using Spark Streaming services are:<\/p>\n\n\n<pre class=\"wp-block-code\"><span><code class=\"hljs\">Uber\nNetflix\nPinterest<\/code><\/span><\/pre>\n\n\n<h2 class=\"wp-block-heading\" id=\"q34-what-is-spark-sql\">Q34. What is Spark SQL?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. Spark SQL is a module for structured data processing where we take advantage of SQL queries running on that database.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q35-explain-parquet-file\">Q35. Explain Parquet file<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. Paraquet is a columnar format file support by many other data processing systems. Spark SQL allows you to performs both read and write operations with Parquet file.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q36-explain-spark-driver\">Q36. Explain Spark Driver?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. Spark Driver is the program which runs on the master node of the machine and declares transformations and actions on data RDDs.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q37-how-can-you-store-the-data-in-spark\">Q37) How can you store the data in spark?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. Spark is a processing engine which doesn\u2019t have any storage engine. It can retrieve data from another storage engine like HDFS, S3.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q38-explain-the-use-of-file-system-api-in-apache-spark\">Q38. Explain the use of File system API in Apache Spark<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. File system API allows you to read data from various storage devices like HDFS, S3 or local Filesystem.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q39-what-is-the-task-of-spark-engine\">Q39. What is the task of Spark Engine<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. Spark Engine is helpful for scheduling, distributing and monitoring the data application across the cluster.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q40-what-is-the-user-of-spark-context\">Q40. What is the user of spark context?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. SparkContent is the entry point to spark. SparkContext allows you to create RDDs which provided various way of churning data.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q41-explain-about-transformations-and-actions-in-the-context-of-rdds\">Q41. Explain about transformations and actions in the context of RDDs.<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. Transformations are functions executed on demand, to produce a new RDD. All transformations are followed by actions. Some examples of transformations include map, filter and reduceByKey.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Actions are the results of RDD computations or transformations. After an action is performed, the data from RDD moves back to the local machine. Some examples of actions include reduce, collect, first, and take.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q42-can-you-use-spark-to-access-and-analyse-data-stored-in-cassandra-databases\">Q42. Can you use Spark to access and analyse data stored in Cassandra databases?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. Yes, it is possible if you use Spark Cassandra Connector.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q43-is-it-possible-to-run-apache-spark-on-apache-mesos\">Q43. Is it possible to run Apache Spark on Apache Mesos?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. Yes, Apache Spark can be run on the hardware clusters managed by Mesos.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q44-explain-about-the-different-cluster-managers-in-apache-spark\">Q44. Explain about the different cluster managers in Apache Spark<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. The 3 different clusters managers supported in Apache Spark are:<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-4\" data-shcb-language-name=\"JavaScript\" data-shcb-language-slug=\"javascript\"><span><code class=\"hljs language-javascript\">YARN\nApache Mesos -Has rich resource scheduling capabilities and is well suited to run Spark along <span class=\"hljs-keyword\">with<\/span> other applications. It is advantageous when several users run interactive shells because it scales down the CPU allocation between commands.\nStandalone deployments \u2013 Well suited <span class=\"hljs-keyword\">for<\/span> <span class=\"hljs-keyword\">new<\/span> deployments which only run and are easy to <span class=\"hljs-keyword\">set<\/span> up.<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-4\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">JavaScript<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">javascript<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<h2 class=\"wp-block-heading\" id=\"q45-how-can-spark-be-connected-to-apache-mesos\">Q45. How can Spark be connected to Apache Mesos?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. To connect Spark with Mesos-<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-5\" data-shcb-language-name=\"CSS\" data-shcb-language-slug=\"css\"><span><code class=\"hljs language-css\"><span class=\"hljs-selector-tag\">Configure<\/span> <span class=\"hljs-selector-tag\">the<\/span> <span class=\"hljs-selector-tag\">spark<\/span> <span class=\"hljs-selector-tag\">driver<\/span> <span class=\"hljs-selector-tag\">program<\/span> <span class=\"hljs-selector-tag\">to<\/span> <span class=\"hljs-selector-tag\">connect<\/span> <span class=\"hljs-selector-tag\">to<\/span> <span class=\"hljs-selector-tag\">Mesos<\/span>. <span class=\"hljs-selector-tag\">Spark<\/span> <span class=\"hljs-selector-tag\">binary<\/span> <span class=\"hljs-selector-tag\">package<\/span> <span class=\"hljs-selector-tag\">should<\/span> <span class=\"hljs-selector-tag\">be<\/span> <span class=\"hljs-selector-tag\">in<\/span> <span class=\"hljs-selector-tag\">a<\/span> <span class=\"hljs-selector-tag\">location<\/span> <span class=\"hljs-selector-tag\">accessible<\/span> <span class=\"hljs-selector-tag\">by<\/span> <span class=\"hljs-selector-tag\">Mesos<\/span>. (<span class=\"hljs-selector-tag\">or<\/span>)\n<span class=\"hljs-selector-tag\">Install<\/span> <span class=\"hljs-selector-tag\">Apache<\/span> <span class=\"hljs-selector-tag\">Spark<\/span> <span class=\"hljs-selector-tag\">in<\/span> <span class=\"hljs-selector-tag\">the<\/span> <span class=\"hljs-selector-tag\">same<\/span> <span class=\"hljs-selector-tag\">location<\/span> <span class=\"hljs-selector-tag\">as<\/span> <span class=\"hljs-selector-tag\">that<\/span> <span class=\"hljs-selector-tag\">of<\/span> <span class=\"hljs-selector-tag\">Apache<\/span> <span class=\"hljs-selector-tag\">Mesos<\/span> <span class=\"hljs-selector-tag\">and<\/span> <span class=\"hljs-selector-tag\">configure<\/span> <span class=\"hljs-selector-tag\">the<\/span> <span class=\"hljs-selector-tag\">property<\/span> \u2018<span class=\"hljs-selector-tag\">spark<\/span><span class=\"hljs-selector-class\">.mesos<\/span><span class=\"hljs-selector-class\">.executor<\/span><span class=\"hljs-selector-class\">.home<\/span>\u2019 <span class=\"hljs-selector-tag\">to<\/span> <span class=\"hljs-selector-tag\">point<\/span> <span class=\"hljs-selector-tag\">to<\/span> <span class=\"hljs-selector-tag\">the<\/span> <span class=\"hljs-selector-tag\">location<\/span> <span class=\"hljs-selector-tag\">where<\/span> <span class=\"hljs-selector-tag\">it<\/span> <span class=\"hljs-selector-tag\">is<\/span> <span class=\"hljs-selector-tag\">installed<\/span>.<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-5\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">CSS<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">css<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<h2 class=\"wp-block-heading\" id=\"q46-why-is-there-a-need-for-broadcast-variables-when-working-with-apache-spark\">Q46. Why is there a need for broadcast variables when working with Apache Spark?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. These are read only variables, present in-memory cache on every machine. When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster. Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup ().<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q47-is-it-possible-to-run-spark-and-mesos-along-with-hadoop\">Q47. Is it possible to run Spark and Mesos along with Hadoop?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. Yes, it is possible to run Spark and Mesos with Hadoop by launching each of these as a separate service on the machines. Mesos acts as a unified scheduler that assigns tasks to either Spark or Hadoop.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q48-what-are-the-benefits-of-using-spark-with-apache-mesos\">Q48. What are the benefits of using Spark with Apache Mesos?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. It renders scalable partitioning among various Spark instances and dynamic partitioning between Spark and other big data frameworks<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q49-what-is-a-dstream\">Q49. What is a DStream?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. Discretized Stream is a sequence of Resilient Distributed Databases that represent a stream of data. DStreams can be created from various sources like Apache Kafka, HDFS, and Apache Flume. DStreams have two operations \u2013<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-6\" data-shcb-language-name=\"JavaScript\" data-shcb-language-slug=\"javascript\"><span><code class=\"hljs language-javascript\">Transformations that produce a <span class=\"hljs-keyword\">new<\/span> DStream.\nOutput operations that write data to an external system.<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-6\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">JavaScript<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">javascript<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<h2 class=\"wp-block-heading\" id=\"q50-when-running-spark-applications-is-it-necessary-to-install-spark-on-all-the-nodes-of-yarn-cluster\">Q50. When running Spark applications, is it necessary to install Spark on all the nodes of YARN cluster?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ans. Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"related-video\">Related video:<\/h4>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\"  id=\"_ytid_81659\"  width=\"760\" height=\"427\"  data-origwidth=\"760\" data-origheight=\"427\" src=\"https:\/\/www.youtube.com\/embed\/zC9cnh8rJd0?enablejsapi=1&#038;autoplay=0&#038;cc_load_policy=0&#038;cc_lang_pref=&#038;iv_load_policy=1&#038;loop=0&#038;rel=1&#038;fs=1&#038;playsinline=0&#038;autohide=2&#038;theme=dark&#038;color=red&#038;controls=1&#038;disablekb=0&#038;\" class=\"__youtube_prefs__  epyt-is-override  no-lazyload\" title=\"YouTube player\"  allow=\"fullscreen; accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen data-no-lazy=\"1\" data-skipgform_ajax_framebjll=\"\"><\/iframe>\n<\/div><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>Q1. What is RDD? Ans. RDD (Resilient Distribution Datasets) is a fault-tolerant collection of operational elements that run parallel. The partitioned data in RDD is immutable and&#8230; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_joinchat":[],"footnotes":""},"categories":[52],"tags":[7252,420,310,296,349,5457,173,325,1543,176,637],"class_list":["post-26669","post","type-post","status-publish","format-standard","hentry","category-interview-questions-answers","tag-spark","tag-apache","tag-application","tag-features","tag-file","tag-framework","tag-java","tag-language","tag-node","tag-python","tag-top"],"_links":{"self":[{"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/26669","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/comments?post=26669"}],"version-history":[{"count":2,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/26669\/revisions"}],"predecessor-version":[{"id":26679,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/26669\/revisions\/26679"}],"wp:attachment":[{"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/media?parent=26669"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/categories?post=26669"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/tags?post=26669"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}