I am sharing my knowledge about hive partitioning and will describe you with example .Hope you all get benefited from this.
Partitioning is a way of organizing a table into different parts (partitions) based on the values of particular columns (partitioned column).It is helpful when the table has one or more partition keys which are the basic elements to identify the particular partition.
When large amount of data stored in HDFS and we are going to query over the data set from HDFS then it slows down the query performance .Thus it is recommended to create partitions in tables. Apache Hive helps us by implementing partitions by its automatic partition scheme at the time of table creation.It is a optimization technique that improves performance in Hive.
Hive supports two types of partitioning
1.Static Partitioning
2.Dynamic Partitioning
In static partitioning we can directly load data into partitioned table .However,Dynamic partitioning is handled by hive itself.Hive does not have idea about raw file data.Hence,It first gets loaded into a temporary table then by use of temporary table we have to create partitioned table.
Static Partitioning:
If we want to use static partition the we need to the set property
set hive.mapred.mode = strict
This property is set by default in hive-site.xml .Static partition is preferable while loading big files data to hive table.
Input data:
/home/hadoop/table.csv
eid, ename, phone_no, loc
1, surya, 9373365498, DEL
2, animesh, 9365845214, KOL
3, sameer, SC, 2010569823,BANG
4, karthik, TP, 9010528694,KOL
First of all create an external table or managed table:
CREATE EXTERNAL TABLE employee( eid int,ename String,phone_no bigint) PARTITIONED BY (loc String) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
Then load data over partitioned table.
LOAD DATA LOCAL INPATH '/home/hadoop/table.csv' OVERWRITE INTO TABLE employee PARTITION (loc='KOL') ;
Verify data by using following query
SELECT * from employee where loc ='KOL';
Dynamic Partitioning:
Hive is be default enabled dynamic partitioning.
If not then we need to set property
set hive.exec.dynamic.partition=true;
By default dynamic partitioning is strict means we need to do one level of STATIC partitioning before doing hive dynamic partitions.hence we need to set it nonstrict mode to enable full partitioning.
set hive.exec.dynamic.partition.mode=nonstrict;
Number of partitions allowed by default for single node is 100 and cluster is 1000 so if your hive runs in pseudo distributed mode with single node and your partitioned column cardinality is more than 100, than you have to set the below hive property as well.
set hive.exec.max.dynamic.partitions.pernode=1000;
first of all create a temporary table of employee with all of its attributes.
CREATE TEMPORARY TABLE employee(ename String, eid int,loc String,phone_no bigint) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
then ,load data into employee table
LOAD DATA LOCAL INPATH '/home/hadoop/table.csv' INTO TABLE employee;
Now create partition table
CREATE TABLE part(ename String,eid int,phone_no bigint)PARTITIONED BY (loc String)
STORED AS SEQUENCEFILE;
then load data into partitioned table
INSERT INTO TABLE part PARTITION (loc) SELECT ename,eid,phone_no,loc FROM employee;
Verify data as well
SELECT ename,phone_no FROM part WHERE loc='kolkata'
To check no of partitions
show partitions part;
To describe employee table in detail
describe formatted employee;
Happy learning !!
Debashree
0 Comments
Please do not enter any spam link in the comment box