{"id":26791,"date":"2022-02-15T11:30:47","date_gmt":"2022-02-15T11:30:47","guid":{"rendered":"https:\/\/www.devopsschool.com\/blog\/?p=26791"},"modified":"2022-03-16T06:10:09","modified_gmt":"2022-03-16T06:10:09","slug":"top-50-interview-questions-and-answers-of-hadoop","status":"publish","type":"post","link":"https:\/\/www.devopsschool.com\/blog\/top-50-interview-questions-and-answers-of-hadoop\/","title":{"rendered":"Top 50 interview questions and answers of Hadoop"},"content":{"rendered":"\n<p><strong>A quick discussion about Hadoop <\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/what-is-hadoop-1024x576.jpg\" alt=\"\" class=\"wp-image-26799\" srcset=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/what-is-hadoop-1024x576.jpg 1024w, https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/what-is-hadoop-300x169.jpg 300w, https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/what-is-hadoop-768x432.jpg 768w, https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/what-is-hadoop-355x199.jpg 355w, https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/what-is-hadoop.jpg 1280w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Apache <strong>Hadoop <\/strong>is an open-source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly.<\/p>\n\n\n\n<p>1. <strong>What are the different vendor-specific distributions of Hadoop?<\/strong><\/p>\n\n\n\n<p><strong>Answer: <\/strong>The different vendor-specific distributions of Hadoop are Cloudera, MAPR, Amazon EMR, Microsoft Azure, IBM InfoSphere, and Hortonworks (Cloudera).<\/p>\n\n\n\n<p>2. <strong>What are the different Hadoop configuration files?<\/strong><\/p>\n\n\n\n<p><strong>Answer: <\/strong>The different Hadoop configuration files include:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Hadoop-env.sh<\/li><li>mapred-site.xml<\/li><li>core-site.xml<\/li><li>yarn-site.xml<\/li><li>hdfs-site.xml<\/li><li>Master and Slaves<\/li><\/ul>\n\n\n\n<p>3. <strong>What are the three modes in which Hadoop can run?<\/strong><\/p>\n\n\n\n<p><strong>Answer: <\/strong>The three modes in which Hadoop can run are :<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>Standalone mode: <\/strong>This is the default mode. It uses the local FileSystem and a single Java process to run the Hadoop services.<\/li><li><strong>Pseudo-distributed mode:<\/strong> This uses a single-node Hadoop deployment to execute all Hadoop services.<\/li><li><strong>Fully-distributed mode:<\/strong> This uses separate nodes to run Hadoop master and slave services.<\/li><\/ul>\n\n\n\n<p>4. What are the differences between regular <strong>FileSystem<\/strong> and <strong>HDFS<\/strong>?<\/p>\n\n\n\n<p><strong>Answer: <\/strong><\/p>\n\n\n\n<p><strong>Regular FileSystem: <\/strong>In a regular FileSystem, data is maintained in a single system. If the machine crashes, data recovery is challenging due to low fault tolerance. Seek time is more and hence it takes more time to process the data.<br><strong>HDFS: <\/strong>Data is distributed and maintained on multiple systems. If a DataNode crashes, data can still be recovered from other nodes in the cluster. Time taken to read data is comparatively more, as there is local data read to the disc and coordination of data from multiple systems.<\/p>\n\n\n\n<p>5. <strong>What are the two types of metadata that a NameNode server holds?<\/strong><\/p>\n\n\n\n<p><strong>Answer: <\/strong>The two types of metadata that a NameNode server holds are:<\/p>\n\n\n\n<p><strong>Metadata in Disk <\/strong>&#8211; This contains the edit log and the FSImage<br><strong>Metadata in RAM <\/strong>&#8211; This contains the information about DataNodes<\/p>\n\n\n\n<p>6. <strong>How can you restart NameNode and all the daemons in Hadoop?<\/strong><\/p>\n\n\n\n<p><strong>Answer: <\/strong><\/p>\n\n\n\n<p>The following commands will help you restart NameNode and all the daemons:<\/p>\n\n\n\n<p>You can stop the NameNode with .\/sbin \/Hadoop-daemon.sh stop NameNode command and then start the NameNode using .\/sbin\/Hadoop-daemon.sh start NameNode command.<\/p>\n\n\n\n<p>You can stop all the daemons with .\/sbin \/stop-all.sh command and then start the daemons using the .\/sbin\/start-all.sh command.<\/p>\n\n\n\n<p>7. <strong>Which command will help you find the status of blocks and FileSystem health?<\/strong><\/p>\n\n\n\n<p><strong>Answer: <\/strong><\/p>\n\n\n\n<p>To check the status of the blocks, use the command:<\/p>\n\n\n\n<p>hdfs fsck -files -blocks<\/p>\n\n\n\n<p>To check the health status of FileSystem, use the command:<\/p>\n\n\n\n<p>hdfs fsck \/ -files \u2013blocks \u2013locations &gt; dfs-fsck.log<\/p>\n\n\n\n<p>8. <strong>What would happen if you store too many small files in a cluster on HDFS?<\/strong><\/p>\n\n\n\n<p><strong>Answer: <\/strong>Storing several small files on HDFS generates a lot of metadata files. To store this metadata in the RAM is a challenge as each file, block, or directory takes 150 bytes for metadata. Thus, the cumulative size of all the metadata will be too large.<\/p>\n\n\n\n<p>9. <strong>How do you copy data from the local system onto HDFS?<\/strong><\/p>\n\n\n\n<p><strong>Answer: <\/strong><\/p>\n\n\n\n<p>The following command will copy data from the local file system onto HDFS:<\/p>\n\n\n\n<p>hadoop fs \u2013copyFromLocal [source] [destination]<\/p>\n\n\n\n<p>Example: hadoop fs \u2013copyFromLocal \/tmp\/data.csv \/user\/test\/data.csv<\/p>\n\n\n\n<p>In the above syntax, the source is the local path and destination is the HDFS path. Copy from the local system using a -f option (flag option), which allows you to write the same file or a new file to HDFS.<\/p>\n\n\n\n<p>10. <strong>Is there any way to change the replication of files on HDFS after they are already written to HDFS?<\/strong><\/p>\n\n\n\n<p><strong>Answer: <\/strong>Yes, the following are ways to change the replication of files on HDFS:<\/p>\n\n\n\n<p>We can change the dfs.replication value to a particular number in the $HADOOP_HOME\/conf\/hadoop-site.xml file, which will start replicating to the factor of that number for any new content that comes in.<\/p>\n\n\n\n<p>If you want to change the replication factor for a particular file or directory, use:<\/p>\n\n\n\n<p>$HADOOP_HOME\/bin\/Hadoop dfs \u2013setrep \u2013w4 \/path of the file<\/p>\n\n\n\n<p>Example: $HADOOP_HOME\/bin\/Hadoop dfs \u2013setrep \u2013w4 \/user\/temp\/test.csv<\/p>\n\n\n\n<p>11.<strong> Is it possible to change the number of mappers to be created in a MapReduce job?<\/strong><\/p>\n\n\n\n<p><strong>Answer: <\/strong><\/p>\n\n\n\n<p>By default, you cannot change the number of mappers, because it is equal to the number of input splits. However, there are different ways in which you can either set a property or customize the code to change the number of mappers.<\/p>\n\n\n\n<p>For example, if you have a 1GB file that is split into eight blocks (of 128MB each), there will only be only eight mappers running on the cluster. However, there are different ways in which you can either set a property or customize the code to change the number of mappers.<\/p>\n\n\n\n<p>12. <strong>What is speculative execution in Hadoop?<\/strong><\/p>\n\n\n\n<p><strong>Answer: <\/strong><\/p>\n\n\n\n<p>If a DataNode is executing any task slowly, the master node can redundantly execute another instance of the same task on another node. The task that finishes first will be accepted, and the other task would be killed. Therefore, speculative execution is useful if you are working in an intensive workload kind of environment.<\/p>\n\n\n\n<p>The following image depicts the speculative execution:<\/p>\n\n\n\n<p>From the above example, you can see that node A has a slower task. A scheduler maintains the resources available, and with speculative execution turned on, a copy of the slower task runs on node B. If the node A task is slower, then the output is accepted from node B.<\/p>\n\n\n\n<p>13. <strong>What are the major configuration parameters required in a MapReduce program?<\/strong><\/p>\n\n\n\n<p><strong>Answer: <\/strong><\/p>\n\n\n\n<p>We need to have the following configuration parameters:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Input location of the job in HDFS<\/li><li>Output location of the job in HDFS<\/li><li>Input and output formats<\/li><li>Classes containing a map and reduce functions<\/li><li>JAR file for mapper, reducer, and driver classes<\/li><\/ul>\n\n\n\n<p>14. <strong>What is the role of the OutputCommitter class in a MapReduce job?<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong> As the name indicates, OutputCommitter describes the commit of task output for a MapReduce job.<\/p>\n\n\n\n<p>Example: org.apache.hadoop.mapreduce.OutputCommitter<\/p>\n\n\n\n<p>public abstract class OutputCommitter extends OutputCommitter<\/p>\n\n\n\n<p>MapReduce relies on the OutputCommitter for the following:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Set up the job initialization<\/li><li>Cleaning up the job after the job completion<\/li><li>Set up the task\u2019s temporary output<\/li><li>Check whether a task needs a commit<\/li><li>Committing the task output<\/li><li>Discard the task commit<\/li><\/ul>\n\n\n\n<p>15. <strong>Explain the process of spilling in MapReduce.<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong> Spilling is a process of copying the data from memory buffer to disk when the buffer usage reaches a specific threshold size. This happens when there is not enough memory to fit all of the mapper output. By default, a background thread starts spilling the content from memory to disk after 80 percent of the buffer size is filled.<\/p>\n\n\n\n<p>For a 100 MB size buffer, the spilling will start after the content of the buffer reaches a size of 80 MB.<\/p>\n\n\n\n<p>16.<strong> How can you set the mappers and reducers for a MapReduce job?<\/strong><\/p>\n\n\n\n<p><strong>Answer: <\/strong><\/p>\n\n\n\n<p>The number of mappers and reducers can be set in the command line using:<\/p>\n\n\n\n<p>-D mapred.map.tasks=5 \u2013D mapred.reduce.tasks=2<\/p>\n\n\n\n<p>In the code, one can configure JobConf variables:<\/p>\n\n\n\n<p>job.setNumMapTasks(5); \/\/ 5 mappers<\/p>\n\n\n\n<p>job.setNumReduceTasks(2); \/\/ 2 reducers<\/p>\n\n\n\n<p>17<strong>. What happens when a node running a map task fails before sending the output to the reducer?<\/strong><\/p>\n\n\n\n<p><strong>Answer: <\/strong>If this ever happens, map tasks will be assigned to a new node, and the entire task will be rerun to re-create the map output. In Hadoop v2, the YARN framework has a temporary daemon called application master, which takes care of the execution of the application. If a task on a particular node failed due to the unavailability of a node, it is the role of the application master to have this task scheduled on another node.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"698\" height=\"299\" src=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/Capture1.png\" alt=\"\" class=\"wp-image-26794\" srcset=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/Capture1.png 698w, https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/Capture1-300x129.png 300w\" sizes=\"auto, (max-width: 698px) 100vw, 698px\" \/><\/figure>\n\n\n\n<p>18.<strong> Can we write the output of MapReduce in different formats?<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong> Yes. Hadoop supports various input and output File formats, such as:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>TextOutputFormat <\/strong>&#8211; This is the default output format and it writes records as lines of text.<\/li><li><strong>SequenceFileOutputFormat <\/strong>&#8211; This is used to write sequence files when the output files need to be fed into another MapReduce job as input files.<\/li><li><strong>MapFileOutputFormat<\/strong> &#8211; This is used to write the output as map files.<\/li><li><strong>SequenceFileAsBinaryOutputFormat <\/strong>&#8211; This is another variant of SequenceFileInputFormat. It writes keys and values to a sequence file in binary format.<\/li><li><strong>DBOutputFormat &#8211;<\/strong> This is used for writing to relational databases and HBase. This format also sends the reduced output to a SQL table.<\/li><\/ul>\n\n\n\n<p>19.<strong> What benefits did YARN bring in Hadoop 2.0 and how did it solve the issues of MapReduce v1?<\/strong><\/p>\n\n\n\n<p><strong>Answer: <\/strong><\/p>\n\n\n\n<p>In Hadoop v1, MapReduce performed both data processing and resource management; there was only one master process for the processing layer known as JobTracker. JobTracker was responsible for resource tracking and job scheduling.<\/p>\n\n\n\n<p>Managing jobs using a single JobTracker and utilization of computational resources was inefficient in MapReduce 1. As a result, JobTracker was overburdened due to handling, job scheduling, and resource management. Some of the issues were scalability, availability issue, and resource utilization. In addition to these issues, the other problem was that non-MapReduce jobs couldn\u2019t run in v1.<\/p>\n\n\n\n<p>To overcome this issue, Hadoop 2 introduced YARN as the processing layer. In YARN, there is a processing master called ResourceManager. In Hadoop v2, you have ResourceManager running in high availability mode. There are node managers running on multiple machines, and a temporary daemon called application master. Here, the ResourceManager is only handling the client connections and taking care of tracking the resources.<\/p>\n\n\n\n<p>In Hadoop v2, the following features are available:<\/p>\n\n\n\n<p><strong>Scalability &#8211; <\/strong>You can have a cluster size of more than 10,000 nodes and you can run more than 100,000 concurrent tasks.<br><strong>Compatibility &#8211;<\/strong> The applications developed for Hadoop v1 run on YARN without any disruption or availability issues.<br><strong>Resource utilization <\/strong>&#8211; YARN allows the dynamic allocation of cluster resources to improve resource utilization.<br><strong>Multitenancy <\/strong>&#8211; YARN can use open-source and proprietary data access engines, as well as perform real-time analysis and run ad-hoc queries.<\/p>\n\n\n\n<p>20. <strong>Which of the following has replaced JobTracker from MapReduce v1?<\/strong><\/p>\n\n\n\n<p><strong>Answer: <\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>NodeManager<\/li><li>ApplicationManager<\/li><li>ResourceManager<\/li><li>Scheduler<\/li><li>The answer is <strong>ResourceManager<\/strong>. It is the name of the master process in Hadoop v2.<\/li><\/ul>\n\n\n\n<p>21. <strong>Write the YARN commands to check the status of an application and kill an application.<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong> The commands are as follows:<\/p>\n\n\n\n<p>a) To check the status of an application:<\/p>\n\n\n\n<p>yarn application -status ApplicationID<\/p>\n\n\n\n<p>b) To kill or terminate an application:<\/p>\n\n\n\n<p>yarn application -kill ApplicationID<\/p>\n\n\n\n<p>22. <strong>Can we have more than one ResourceManager in a YARN-based cluster?<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong><\/p>\n\n\n\n<p>Yes, Hadoop v2 allows us to have more than one ResourceManager. You can have a high availability YARN cluster where you can have an active ResourceManager and a standby ResourceManager, where the ZooKeeper handles the coordination.<\/p>\n\n\n\n<p>There can only be one active ResourceManager at a time. If an active ResourceManager fails, then the standby ResourceManager comes to the rescue.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"783\" height=\"169\" src=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/image-3.png\" alt=\"\" class=\"wp-image-26797\" srcset=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/image-3.png 783w, https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/image-3-300x65.png 300w, https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/image-3-768x166.png 768w\" sizes=\"auto, (max-width: 783px) 100vw, 783px\" \/><\/figure>\n\n\n\n<p>23. <strong>What happens if a ResourceManager fails while executing an application in a high availability cluster?<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong> In a high availability cluster, there are two ResourceManagers: one active and the other standby. If a ResourceManager fails in the case of a high availability cluster, the standby will be elected as active and instructs the ApplicationMaster to abort. The ResourceManager recovers its running state by taking advantage of the container statuses sent from all node managers.<\/p>\n\n\n\n<p>24. <strong>What are the different schedulers available in YARN?<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong><\/p>\n\n\n\n<p>The different schedulers available in YARN are:<\/p>\n\n\n\n<p><strong>FIFO scheduler<\/strong> &#8211; This places applications in a queue and runs them in the order of submission (first in, first out). It is not desirable, as a long-running application might block the small running applications<br><strong>Capacity scheduler<\/strong> &#8211; A separate dedicated queue allows the small job to start as soon as it is submitted. The large job finishes later compared to using the FIFO scheduler<br><strong>Fair scheduler <\/strong>&#8211; There is no need to reserve a set amount of capacity since it will dynamically balance resources between all the running jobs<\/p>\n\n\n\n<p>25. <strong>In a cluster of 10 DataNodes, each having 16 GB RAM and 10 cores, what would be the total processing capacity of the cluster?<\/strong><\/p>\n\n\n\n<p><strong>Answer: <\/strong>Every node in a Hadoop cluster will have one or multiple processes running, which would need RAM. The machine itself, which has a Linux file system, would have its own processes that need a specific amount of RAM usage. Therefore, if you have 10 DataNodes, you need to allocate at least 20 to 30 percent towards the overheads, Cloudera-based services, etc. You could have 11 or 12 GB and six or seven cores available on every machine for processing. Multiply that by 10, and that&#8217;s your processing capacity.<\/p>\n\n\n\n<p>26. <strong>What are the different components of a Hive architecture?<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong> The different components of the Hive are:<\/p>\n\n\n\n<p>User Interface: This calls the execute interface to the driver and creates a session for the query. Then, it sends the query to the compiler to generate an execution plan for it<br>Metastore: This stores the metadata information and sends it to the compiler for the execution of a query<br>Compiler: This generates the execution plan. It has a DAG of stages, where each stage is either a metadata operation, a map, or reduces a job or operation on HDFS<br>Execution Engine: This acts as a bridge between the Hive and Hadoop to process the query. Execution Engine communicates bidirectionally with Metastore to perform operations, such as creating or dropping tables.<\/p>\n\n\n\n<p>27. <strong>What is the difference between an external table and a managed table in Hive?<\/strong><\/p>\n\n\n\n<p><strong><strong>Answer:<\/strong><\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"856\" height=\"514\" src=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/external-table-internal-table.png\" alt=\"\" class=\"wp-image-26806\" srcset=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/external-table-internal-table.png 856w, https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/external-table-internal-table-300x180.png 300w, https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/external-table-internal-table-768x461.png 768w\" sizes=\"auto, (max-width: 856px) 100vw, 856px\" \/><\/figure>\n\n\n\n<p>28. <strong>What is a partition in Hive and why is partitioning required in Hive<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong><\/p>\n\n\n\n<p>Partition is a process for grouping similar types of data together based on columns or partition keys. Each table can have one or more partition keys to identify a particular partition.<\/p>\n\n\n\n<p>Partitioning provides granularity in a Hive table. It reduces the query latency by scanning only relevant partitioned data instead of the entire data set. We can partition the transaction data for a bank based on month \u2014 January, February, etc. Any operation regarding a particular month, say February, will only have to scan the February partition, rather than the entire table data.<\/p>\n\n\n\n<p>29. <strong>What are the components used in Hive query processors?<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong> The components used in Hive query processors are:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Parser<\/li><li>Semantic Analyzer<\/li><li>Execution Engine<\/li><li>User-Defined Functions<\/li><li>Logical Plan Generation<\/li><li>Physical Plan Generation<\/li><li>Optimizer<\/li><li>Operators<\/li><li>Type checking<\/li><\/ul>\n\n\n\n<p>30. <strong>Why does Hive not store metadata information in HDFS?<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong> We know that the Hive\u2019s data is stored in HDFS. However, the metadata is either stored locally or it is stored in RDBMS. The metadata is not stored in HDFS, because HDFS read\/write operations are time-consuming. As such, Hive stores metadata information in the megastore using RDBMS instead of HDFS. This allows us to achieve low latency and is faster.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"782\" height=\"192\" src=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/low-latency.png\" alt=\"\" class=\"wp-image-26808\" srcset=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/low-latency.png 782w, https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/low-latency-300x74.png 300w, https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/low-latency-768x189.png 768w\" sizes=\"auto, (max-width: 782px) 100vw, 782px\" \/><\/figure>\n\n\n\n<p>31. <strong>What are the different ways of executing a Pig script?<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong> The different ways of executing a Pig script are as follows:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Grunt shell<\/li><li>Script file<\/li><li>Embedded script<\/li><\/ul>\n\n\n\n<p>32<strong>. What are the major components of a Pig execution environment?<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong><\/p>\n\n\n\n<p>The major components of a Pig execution environment are:<\/p>\n\n\n\n<p><strong>Pig Scripts:<\/strong> They are written in Pig Latin using built-in operators and UDFs, and submitted to the execution environment.<br><strong>Parser: <\/strong>Completes type checking and checks the syntax of the script. The output of the parser is a Directed Acyclic Graph (DAG).<br><strong>Optimizer: <\/strong>Performs optimization using merge, transform, split, etc. Optimizer aims to reduce the amount of data in the pipeline.<br><strong>Compiler:<\/strong> Converts the optimized code into MapReduce jobs automatically.<br>Execution Engine: MapReduce jobs are submitted to execution engines to generate the desired results.<\/p>\n\n\n\n<p>33.<strong> State the usage of the group, order by, and distinct keywords in Pig scripts.<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong> The group statement collects various records with the same key and groups the data in one or more relations.<\/p>\n\n\n\n<p>Example: Group_data = GROUP Relation_name BY AGE<\/p>\n\n\n\n<p>The order statement is used to display the contents of relation in sorted order based on one or more fields.<\/p>\n\n\n\n<p>Example: Relation_2 = ORDER Relation_name1 BY (ASC|DSC)<\/p>\n\n\n\n<p>Distinct statement removes duplicate records and is implemented only on entire records, and not on individual records.<\/p>\n\n\n\n<p>Example: Relation_2 = DISTINCT Relation_name1<\/p>\n\n\n\n<p>34. <strong>Write the code needed to open a connection in HBase.<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong> The following code is used to open a connection in HBase:<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-1\" data-shcb-language-name=\"HTML, XML\" data-shcb-language-slug=\"xml\"><span><code class=\"hljs language-xml\"><span class=\"hljs-tag\">&lt;<span class=\"hljs-name\">strong<\/span>&gt;<\/span>Configuration myConf = HBaseConfiguration.create();<span class=\"hljs-tag\">&lt;\/<span class=\"hljs-name\">strong<\/span>&gt;<\/span><\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-1\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">HTML, XML<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">xml<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-2\" data-shcb-language-name=\"HTML, XML\" data-shcb-language-slug=\"xml\"><span><code class=\"hljs language-xml\"><span class=\"hljs-tag\">&lt;<span class=\"hljs-name\">strong<\/span>&gt;<\/span>HTableInterface usersTable = new HTable(myConf, \u201cusers\u201d);<span class=\"hljs-tag\">&lt;\/<span class=\"hljs-name\">strong<\/span>&gt;<\/span><\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-2\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">HTML, XML<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">xml<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>35. <strong>What does replication mean in terms of HBase?<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong><\/p>\n\n\n\n<p>The replication feature in HBase provides a mechanism to copy data between clusters. This feature can be used as a disaster recovery solution that provides high availability for HBase.<\/p>\n\n\n\n<p>The following commands alter the hbase1 table and set the replication_scope to 1. A replication_scope of 0 indicates that the table is not replicated.<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-3\" data-shcb-language-name=\"PHP\" data-shcb-language-slug=\"php\"><span><code class=\"hljs language-php\">disable \u2018hbase1\u2019\n\nalter \u2018hbase1\u2019, {NAME =&gt; \u2018family_name\u2019, REPLICATION_SCOPE =&gt; \u2018<span class=\"hljs-number\">1<\/span>\u2019}\n\nenable \u2018hbase1\u2019<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-3\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">PHP<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">php<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>36.<strong> What is compaction in HBase?<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"851\" height=\"303\" src=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/compaction.png\" alt=\"\" class=\"wp-image-26818\" srcset=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/compaction.png 851w, https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/compaction-300x107.png 300w, https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/compaction-768x273.png 768w\" sizes=\"auto, (max-width: 851px) 100vw, 851px\" \/><\/figure>\n\n\n\n<p>37. <strong>How does Bloom filter work?<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong> The HBase Bloom filter is a mechanism to test whether an HFile contains a specific row or row-col cell. The Bloom filter is named after its creator, Burton Howard Bloom. It is a data structure that predicts whether a given element is a member of a set of data. These filters provide an in-memory index structure that reduces disk reads and determines the probability of finding a row in a particular file.<\/p>\n\n\n\n<p>38. <strong>How does the Write Ahead Log (WAL) help when a RegionServer crashes?<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong> If a RegionServer hosting a MemStore crash, the data that existed in memory, but not yet persisted, is lost. HBase recovers against that by writing to the WAL before the write completes. The HBase cluster keeps a WAL to record changes as they happen. If HBase goes down, replaying the WAL will recover data that was not yet flushed from the MemStore to the HFile.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"737\" height=\"358\" src=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/region-server.png\" alt=\"\" class=\"wp-image-26819\" srcset=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/region-server.png 737w, https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/region-server-300x146.png 300w\" sizes=\"auto, (max-width: 737px) 100vw, 737px\" \/><\/figure>\n\n\n\n<p>39. <strong>What are catalog tables in HBase?<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong><\/p>\n\n\n\n<p>The catalog has two tables: hbasemeta and -ROOT-<\/p>\n\n\n\n<p>The catalog table hbase:meta exists as an HBase table and is filtered out of the HBase shell\u2019s list command. It keeps a list of all the regions in the system and the location of hbase:meta is<\/p>\n\n\n\n<p>stored in ZooKeeper. The -ROOT- table keeps track of the location of the .META table.<\/p>\n\n\n\n<p>40. <strong>How is Sqoop different from Flume?<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"851\" height=\"597\" src=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/spoof-difference.png\" alt=\"\" class=\"wp-image-26820\" srcset=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/spoof-difference.png 851w, https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/spoof-difference-300x210.png 300w, https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/spoof-difference-768x539.png 768w\" sizes=\"auto, (max-width: 851px) 100vw, 851px\" \/><\/figure>\n\n\n\n<p>41. <strong>What is the importance of the eval tool in Sqoop?<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong> The Sqoop eval tool allows users to execute user-defined queries against respective database servers and preview the result in the console.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"750\" height=\"273\" src=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/Capture.png\" alt=\"\" class=\"wp-image-26823\" srcset=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/Capture.png 750w, https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/02\/Capture-300x109.png 300w\" sizes=\"auto, (max-width: 750px) 100vw, 750px\" \/><\/figure>\n\n\n\n<p>42.<strong> What is a checkpoint?<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong> In brief, \u201cCheckpointing\u201d is a process that takes an FsImage, edit log and compacts them into a new FsImage. Thus, instead of replaying an edit log, the NameNode can load the final in-memory state directly from the FsImage. This is a far more efficient operation and reduces NameNode startup time. Checkpointing is performed by Secondary NameNode.<\/p>\n\n\n\n<p>43. <strong>How is HDFS fault-tolerant?<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong> When data is stored over HDFS, NameNode replicates the data to several DataNode. The default replication factor is 3. You can change the configuration factor as per your need. If a DataNode goes down, the NameNode will automatically copy the data to another node from the replicas and make the data available. This provides fault tolerance in HDFS.<\/p>\n\n\n\n<p>44. <strong>Why do we use HDFS for applications having large data sets and not when there are a lot of small files?<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong> HDFS is more suitable for large amounts of data sets in a single file as compared to a small amount of data spread across multiple files. As you know, NameNode stores the metadata information regarding the file system in the RAM. Therefore, the amount of memory produces a limit to the number of files in my HDFS file system. In other words, too many files will lead to the generation of too much metadata. And, storing this metadata in the RAM will become a challenge. As a thumb rule, metadata for a file, block, or directory takes 150 bytes.<\/p>\n\n\n\n<p>45. <strong>What does \u2018jps\u2019 command do?<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong> The \u2018jps\u2019 command helps us to check if the Hadoop daemons are running or not. It shows all the Hadoop daemons i.e namenode, datanode, resourcemanager, nodemanager etc. that are running on the machine.<\/p>\n\n\n\n<p>46.<strong> What is \u201cspeculative execution\u201d in Hadoop?<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong> If a node appears to be executing a task slower, the master node can redundantly execute another instance of the same task on another node. Then, the task which finishes first will be accepted and the other one is killed. This process is called \u201cspeculative execution\u201d.<\/p>\n\n\n\n<p>47. <strong>What is the difference between an \u201cHDFS Block\u201d and an \u201cInput Split\u201d?<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong> The \u201cHDFS Block\u201d is the physical division of the data while \u201cInput Split\u201d is the logical division of the data. HDFS divides data into blocks for storing the blocks together, whereas, for processing, MapReduce divides the data into the input split and assigns it to the mapper function.<\/p>\n\n\n\n<p>48.  <strong>How do \u201creducers\u201d communicate with each other?<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong> This is a tricky question. The \u201cMapReduce\u201d programming model does not allow \u201creducers\u201d to communicate with each other. \u201cReducers\u201d run in isolation.<\/p>\n\n\n\n<p>49. <strong>How will you write<\/strong> <strong>a custom partitioner?<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong> Custom partitioner for a Hadoop job can be written easily by following the below steps:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Create a new class that extends Partitioner Class<\/li><li>Override method \u2013 get a partition, in the wrapper that runs in MapReduce.<\/li><li>Add the custom partitioner to the job by using the method set Partitioner or add the custom partitioner to the job as a config file.<\/li><\/ul>\n\n\n\n<p>50. <strong>What do you know about \u201cSequenceFileInputFormat\u201d?<\/strong><\/p>\n\n\n\n<p><strong>Answer:<\/strong> <\/p>\n\n\n\n<p>\u201cSequenceFileInputFormat\u201d is an input format for reading within sequence files. It is a specific compressed binary file format which is optimized for passing the data between the outputs of one \u201cMapReduce\u201d job to the input of some other \u201cMapReduce\u201d job.<\/p>\n\n\n\n<p>Sequence files can be generated as the output of other MapReduce tasks and are an efficient intermediate representation for data that is passing from one MapReduce job to another.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A quick discussion about Hadoop Apache Hadoop is an open-source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data&#8230;. <\/p>\n","protected":false},"author":1,"featured_media":26832,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_joinchat":[],"footnotes":""},"categories":[2,52],"tags":[641,5449,7290,766,3347,7264,482,7225],"class_list":["post-26791","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorised","category-interview-questions-answers","tag-answers","tag-devopsschool","tag-hadoop","tag-interview","tag-interview-questions-answers","tag-learning","tag-questions","tag-top-50"],"_links":{"self":[{"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/26791","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/comments?post=26791"}],"version-history":[{"count":4,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/26791\/revisions"}],"predecessor-version":[{"id":26834,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/26791\/revisions\/26834"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/media\/26832"}],"wp:attachment":[{"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/media?parent=26791"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/categories?post=26791"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/tags?post=26791"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}