Short description about Hive
Hive is a data warehouse system that is used to analyze structured data. It is built on the top of Hadoop. It was developed by Facebook.
Hive provides the functionality of reading, writing, and managing large datasets residing in distributed storage. It runs SQL-like queries called HQL (Hive query language) which gets internally converted to MapReduce jobs.
Using Hive, we can skip the requirement of the traditional approach of writing complex MapReduce programs. Hive supports Data Definition Language (DDL), Data Manipulation Language (DML), and User Defined Functions (UDF).
Moving to Questions and Answers:
1. WHAT ARE THE DIFFERENT TABLES AVAILABLE IN THE HIVE?
Answer: There are two types of tables available in Hive – managed and external.
2. WHAT IS THE DIFFERENCE BETWEEN EXTERNAL AND MANAGED TABLES?
Answer: While external tables give data control to Hive but not control of a schema, managed tables give both schema and data control.
3. WHAT APPLICATIONS ARE SUPPORTED BY HIVE?
Answer: Hive supports client applications based on Java, PHP, Python, C, and Ruby coding languages.
4. CAN HIVE BE USED IN OLTP SYSTEMS?
Answer: Since Hive does not support row-level data insertion, it is not suitable for use in OLTP systems.
5. CAN A TABLE NAME BE CHANGED IN HIVE?
Answer: Yes, you can change a table name in Hive. You can rename a table name by using: Alter Table table_name RENAME TO new_name.
6. WHERE DOES THE DATA OF A HIVE TABLE GET STORED?
Answer: The Hive table gets stored in an HDFS directory – /user/hive/warehouse, by default. You can adjust it by setting the desired directory in the configuration parameter hive.metastore.warehouse.dir in hive-site.xml.
7. WHAT IS A HIVE METASTORE?
Answer: A Metastore is a relational database that stores the metadata of Hive partitions, tables, databases, and so on.
8. CAN THE DEFAULT LOCATION OF A MANAGED TABLE BE CHANGED IN HIVE?
Answer: Yes, the default managed table location can be changed in Hive by using the LOCATION ‘’ clause.
9. WHERE IS HIVE TABLE DATA STORED?
Answer: Hive table data is stored in an HDFS directory by default – user/hive/warehouse. This can be altered.
10. WHAT IS THE DEFAULT APACHE HIVE METASTORE DATABASE?
Answer: The default database for metastore is the embedded Derby database provided by Hive which is backed by the local disk.
11. WHAT IS THE DIFFERENCE BETWEEN LOCAL AND REMOTE META STORES?
Answer: Local meta stores run on the same Java Virtual Machine (JVM) as the Hive service whereas remote meta stores run on a separate, distinct JVM.
12. WHAT ARE THE TYPES OF META STORES?
Answer: Local and Remote meta stores are the two types of Hive meta stores.
13. WHAT ARE THE THREE DIFFERENT MODES IN WHICH HIVE CAN BE OPERATED?
Answer: The three modes in which Hive can be operated are Local mode, distributed mode, and pseudo-distributed mode.
14. IS THERE A DATA TYPE IN THE HIVE TO STORE DATE INFORMATION?
Answer: The TIMESTAMP data type in Hive stores all data information in the java.sql.timestamp format.
15. CAN MULTIPLE USERS USE ONE METASTORE?
Answer: No, metastore sharing is not supported by Hive.
16. CAN YOU AVOID MAPREDUCE ON THE HIVE?
Answer: You can make Hive avoid MapReduce to return query results by setting the hive.exec.mode.local.auto property to ‘true’.
17. WHAT IS A VIEW IN A HIVE?
Answer: A view is a logical construct that allows search queries to be treated as tables.
18. CAN A CARTESIAN JOIN BE CREATED BETWEEN TWO HIVE TABLES?
Answer: This is not possible as it cannot be implemented in MapReduce programming.
19. WHAT IS INDEXING IN HIVE?
Answer: Hive indexing is a query optimization technique to reduce the time needed to access a column or a set of columns within a Hive database.
20. CAN WE USE THE LOAD OR INSERT COMMAND TO VIEW?
Answer: No, these commands cannot be used with respect to a view in Hive.
21. CAN THE NAME OF A VIEW BE THE SAME AS A HIVE TABLE NAME?
Answer: No, the name of the view must always be unique in the database.
22. WHAT IS THE HIVE OBJECT INSPECTOR FUNCTION?
Answer: It helps to analyze the structure of individual columns and rows and provides access to the complex objects that are stored within the database.
23. ARE MULTI-LINE COMMENTS SUPPORTED BY HIVE?
Answer: No, multi-line comments are supported by Hive.
24. HOW CAN YOU VIEW THE INDEXES OF A HIVE TABLE?
Answer: By using the following command: SHOW INDEX ON table_name
25. WHAT IS BUCKETING?
Answer: Bucketing is the process of hashing the values in a column into several user-defined buckets which helps avoid over-partitioning.
26. WHAT IS HCATALOG?
Answer: Hcatalog is a tool that helps to share data structures with other external systems in the Hadoop ecosystem
27. CAN YOU SPECIFY THE NAME OF THE TABLE CREATOR IN HIVE?
Answer: Yes, by using the TBLPROPERTIES clause. For example – TBLPROPERTIES (‘creator’= ‘john’)
28. WHAT IS UDF IN HIVE?
Answer: UDF is a user-designed function created with a Java program to address a specific function that is not part of the existing Hive functions.
29. WHAT ARE THE LIMITATIONS OF HIVE?
Answer: Hive has the following limitations:
- Real-time queries cannot be executed and it has no row-level support.
- Hive cannot be used for online transaction processing.
30. NAME THE COMPONENTS OF A HIVE QUERY PROCESSOR?
Answer: Following are the components of a Hive query processor:
- Logical Plan of Generation.
- Physical Plan of Generation.
- Execution Engine.
- UDF’s and UDAF.
- Semantic Analyzer.
- Type Checking.
31. HOW DO ORC FORMAT TABLES HELP HIVE TO ENHANCE THE PERFORMANCE?
Answer: You can easily store the Hive Data with the ORC (Optimized Row Column) format as it helps to streamline several limitations.
32. Name five components of a hive architecture.
Answer: Following are the five components of a Hive Architecture:
- User Interface: It helps the user to send queries to the Hive system and other operations. The user interface provides hive Web UI, Hive Command-Line and Hive HDInsight.
- Driver: It designs a session handle for the query, and then the queries are sent to the compiler for the execution plan.
- Metastore: It contains the organized data and information on various warehouse tables and partitions.
- Compiler: It creates the execution plan for the queries, performs semantic analysis on different query blocks, and generates query expression.
- Execution Engine: It implements the execution plans created by the compiler.
33. What is the difference between local and remote metastore?
It is the metastore service that runs in the same JVM in which the Hive service is running and connects to a database running in a separate JVM. Either on the same machine or on a remote machine.
In this configuration, the metastore service runs on its own separate JVM and not in the Hive service JVM.
34. When should we use SORT BY instead of ORDER BY?
Answer: Despite ORDER BY we should use SORT BY. Especially while we have to sort huge datasets. The reason is SORT BY clause sorts the data using multiple reducers. ORDER BY sorts all of the data together using a single reducer.
35. Why do we perform partitioning in Hive?
Answer: In a Hive table, Partitioning provides granularity. Hence, by scanning only relevant partitioned data instead of the whole dataset it reduces the query latency.
36. How does Hive distribute the rows into buckets?
Answer: By using the formula: hash_function (bucketing_column) modulo (num_of_buckets) Hive determines the bucket number for a row. Basically, hash_function depends on the column data type. Although, hash_function for integer data type will be:
hash_function (int_type_column)= value of int_type_column
37. What is the use of Hcatalog?
Answer: Basically, to share data structures with external systems we use Hcatalog. It offers access to hive metastore to users of other tools on Hadoop. Hence, they can read and write data to the hive’s data warehouse.
38. Explain the different types of join in Hive.
Answer: There are 4 different types of joins in HiveQL –
- JOIN- It is very similar to Outer Join in SQL
- FULL OUTER JOIN – This join Combines the records of both the left and right outer tables. Basically, that fulfills the join condition.
- LEFT OUTER JOIN- Through this Join, All the rows from the left table are returned even if there are no matches in the right table.
- RIGHT OUTER JOIN – Here also, all the rows from the right table are returned even if there are no matches in the left table
39. Wherever (Different Directory) I run the hive query, it creates new metastore_db, please explain the reason for it?.
Answer: Basically, it creates the local metastore, while we run the hive in embedded mode. Also, it looks at whether metastore already exists or not before creating the metastore. Hence, in configuration file hive-site.xml. Property is “javax.jdo.option.ConnectionURL” with default value “jdbc:derby:;databaseName=metastore_db;create=true” this property is defined. Hence, to change the behavior change the location to the absolute path, thus metastore will be used from that location.
40. Features of Hive.
Answer: Features of Hive
- The best feature is it offers data summarization, query, and analysis in much easier manner.
- To process data without actually storing in HDFS, Hive supports external tables.
- Moreover, it fits the low-level interface requirement of Hadoop perfectly.
41. Write a query to rename a table Student to Student_New.
Answer: Alter Table Student RENAME to Student_New
42. Explain the difference between partitioning and bucketing.
Answer: Partitioning and Bucketing of tables are done to improve the query performance. Partitioning helps execute queries faster, only if the partitioning scheme has some common range filtering i.e. either by timestamp ranges, by location, etc. Bucketing does not work by default.
Partitioning helps eliminate data when used in the WHERE clause. Bucketing helps organize data inside the partition into multiple files so that the same set of data will always be written in the same bucket. Bucketing helps in joining various columns.
In the partitioning technique, a partition is created for every unique value of the column and there could be a situation where several tiny partitions may have to be created. However, with bucketing, one can limit it to a specific number and the data can then be decomposed in those buckets.
Basically, a bucket is a file in Hive whereas a partition is a directory.
43. Differentiate between describing and describe extended.
Answer: Describe database/schema- This query displays the name of the database, the root location on the file system, and comments if any.
Describe extended database/schema- Gives the details of the database or schema in a detailed manner.
44. I want to see the present working directory in UNIX from the hive. Is it possible to run this command from the hive?
Answer: Hive allows execution of UNIX commands with the use of exclamatory (!) symbol. Just use the ! Symbol before the command to be executed at the hive prompt. To see the present working directory in UNIX from hive run !pwd at the hive prompt.
45. In case of embedded Hive, can the same metastore be used by multiple users?
Answer: We cannot use metastore in sharing mode. It is suggested to use standalone real database like PostGreSQL and MySQL.
46. What will be the output of cast (‘XYZ’ as INT)?
Answer: It will return a NULL value.
47. What is the default database provided by Hive for Metastore ?
Answer: Derby is the default database.
48. What is the purpose of partitioning in Hive?
Answer: Partitioning in Hive refers to the process of dividing the table into smaller parts based on the values of a particular column, such as date, city, age, or country. Partitioning is done so that data can be stored in slices, enabling the query response to become faster. Since Hadoop is used to handle large amounts of data, it is important to find ways to access the data using the best approach.
49. Can Hive be used in OLTP systems?
Answer: Hive cannot be used to support online transaction processing (OLTP), but can only be used for online analytical processing (OLAP). Hive does not support update or delete at the row level and is not suitable for real-time query processing.
50. Is there any method to delete the DBPROPERTY in Hive?
Answer: There is no way to delete or “unset” the DBPROPERTY in Hive.