大数据开发之Hive篇19-Hive分区表详解

墨天轮 · · 1269 次点击 · · 开始浏览

这是一个创建于的文章，其中的信息可能已经有所发展或是发生改变。

备注:
Hive 版本 2.1.1

一.Hive分区表概述
二.静态分区

2.1 单分区测试
2.2 多分区测试

三.动态分区

3.1 动态分区测试
3.2 动态分区和静态分区混合使用

四.分区的其它操作

4.1 恢复分区
4.2 归档分区
4.3 交换分区

参考

一.Hive分区表概述

数据分区的概念以及存在很久了，通常使用分区来水平分散压力，将数据从物理上移到和使用最频繁的用户更近的地方，以及实现其目的。 hive中有分区表的概念，我们可以看到分区具重要性能优势，而且分区表还可以将数据以一种符合逻辑的方式进行组织，比如分层存储

Hive官方网站对Hive partition的介绍:
可以使用Partitioned BY子句创建分区表。一个表可以有一个或多个分区列，并且为分区列中的每个不同的值组合创建一个单独的数据目录。此外，可以使用按列聚集的方式对表或分区进行存储，并且可以通过按列排序的方式在存储区内对数据进行排序。这可以提高某些查询的性能。

如果在创建分区表时，出现这样的错误:“FAILED: error in semantic analysis: Column repeated in partitioning columns”，这意味着您试图将分区的列包含在表本身的数据中。您可能确实定义了列。但是，您创建的分区会生成一个可以查询的伪列，因此必须将表列重命名为其他东西(用户不应该查询的东西!)。

分区表分别有静态分区和动态分区

创建分区表语法:

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name    -- (Note: TEMPORARY available in Hive 0.14.0 and later)
  [(col_name data_type [column_constraint_specification] [COMMENT col_comment], ... [constraint_specification])]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  [SKEWED BY (col_name, col_name, ...)                  -- (Note: Available in Hive 0.10.0 and later)]
     ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
     [STORED AS DIRECTORIES]
  [
   [ROW FORMAT row_format] 
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  -- (Note: Available in Hive 0.6.0 and later)
  ]
  [LOCATION hdfs_path]
  [TBLPROPERTIES (property_name=property_value, ...)]   -- (Note: Available in Hive 0.6.0 and later)
  [AS select_statement];   -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)

修改分区表语法:
通过在ALTER TABLE语句中使用PARTITION子句，可以添加、重命名、交换(移动)、删除或归档分区

-- 新增分区
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION 'location'][, PARTITION partition_spec [LOCATION 'location'], ...];
 
partition_spec:
  : (partition_column = partition_col_value, partition_column = partition_col_value, ...)

-- 重命名分区
ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION partition_spec;

-- 删除分区
ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec;

-- 删除分区(回收站不保留，直接删除)
ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec PURGE;

-- 归档和还原归档分区表
ALTER TABLE table_name ARCHIVE PARTITION partition_spec;
ALTER TABLE table_name UNARCHIVE PARTITION partition_spec;

二.静态分区

可以根据PARTITIONED BY创建分区表，一个表可以拥有一个或者多个分区，每个分区以文件夹的形式单独存在表文件夹的目录下。
分区是以字段的形式在表结构中存在，通过describe table命令可以查看到字段存在，但是该字段不存放实际的数据内容，仅仅是分区的表示。
分区建表分为2种，一种是单分区，也就是说在表文件夹目录下只有一级文件夹目录。另外一种是多分区，表文件夹下出现多文件夹嵌套模式。

2.1 单分区测试

代码:

-- 创建分区表
create table test_part(key int,value string) partitioned by (dt string);
-- 查看分区表
describe formatted test_part;
-- 往分区表录入数据
insert into test_part partition(dt = '2020-12-29') values (1,'abc');
insert into test_part partition(dt = '2020-12-30') values (2,'def');
insert into test_part partition(dt = '2020-12-31') values (3,'ghi');

-- 新增分区
alter table  test_part add partition(dt = '2021-01-01');
-- 重命名分区
alter table test_part partition(dt = '2021-01-01') rename to partition(dt = '2021-01-02');
-- 删除分区
alter table test_part drop partition(dt='2020-12-31');
-- 删除分区(回收站不保留，直接删除)
alter table test_part drop partition(dt='2020-12-30') purge;

测试记录:

hive> 
    > create table test_part(key int,value string) partitioned by (dt string);
OK
Time taken: 0.087 seconds
hive> describe formatted test_part;
OK
# col_name              data_type               comment             
                 
key                     int                                         
value                   string                                      
                 
# Partition Information          
# col_name              data_type               comment             
                 
dt                      string                                      
                 
# Detailed Table Information             
Database:               test                     
OwnerType:              USER                     
Owner:                  root                     
CreateTime:             Tue Dec 29 15:59:02 CST 2020     
LastAccessTime:         UNKNOWN                  
Retention:              0                        
Location:               hdfs://nameservice1/user/hive/warehouse/test.db/test_part        
Table Type:             MANAGED_TABLE            
Table Parameters:                
        COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
        numFiles                0                   
        numPartitions           0                   
        numRows                 0                   
        rawDataSize             0                   
        totalSize               0                   
        transient_lastDdlTime   1609228742          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        serialization.format    1                   
Time taken: 0.193 seconds, Fetched: 38 row(s)
hive> 


hive> insert into test_part partition(dt = '2020-12-29') values (1,'abc');
Query ID = root_20201229160327_aef35dcf-1aaa-4b30-b47b-cbe9ea3aec3e
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
20/12/29 16:03:28 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1609141291605_0003, Tracking URL = http://hp3:8088/proxy/application_1609141291605_0003/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1609141291605_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-29 16:03:35,297 Stage-1 map = 0%,  reduce = 0%
2020-12-29 16:03:42,522 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.85 sec
MapReduce Total cumulative CPU time: 2 seconds 850 msec
Ended Job = job_1609141291605_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://nameservice1/user/hive/warehouse/test.db/test_part/dt=2020-12-29/.hive-staging_hive_2020-12-29_16-03-27_864_460539835119685808-1/-ext-10000
Loading data to table test.test_part partition (dt=2020-12-29)
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 2.85 sec   HDFS Read: 4229 HDFS Write: 90 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 850 msec
OK
Time taken: 16.389 seconds
hive> 
    > insert into test_part partition(dt = '2020-12-30') values (2,'def');
Query ID = root_20201229160421_9da78dee-ff03-4d5a-9fef-b3808a69a3c6
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
20/12/29 16:04:22 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1609141291605_0004, Tracking URL = http://hp3:8088/proxy/application_1609141291605_0004/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1609141291605_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-29 16:04:29,214 Stage-1 map = 0%,  reduce = 0%
2020-12-29 16:04:36,442 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.99 sec
MapReduce Total cumulative CPU time: 2 seconds 990 msec
Ended Job = job_1609141291605_0004
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://nameservice1/user/hive/warehouse/test.db/test_part/dt=2020-12-30/.hive-staging_hive_2020-12-29_16-04-21_921_4429247941950558015-1/-ext-10000
Loading data to table test.test_part partition (dt=2020-12-30)
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 2.99 sec   HDFS Read: 4232 HDFS Write: 90 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 990 msec
OK
Time taken: 17.229 seconds
hive> 
    > insert into test_part partition(dt = '2020-12-31') values (3,'ghi');
Query ID = root_20201229160458_be37945d-a462-4ab1-b8fe-0180277c2399
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
20/12/29 16:04:58 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1609141291605_0005, Tracking URL = http://hp3:8088/proxy/application_1609141291605_0005/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1609141291605_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-29 16:05:06,866 Stage-1 map = 0%,  reduce = 0%
2020-12-29 16:05:13,062 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.03 sec
MapReduce Total cumulative CPU time: 3 seconds 30 msec
Ended Job = job_1609141291605_0005
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://nameservice1/user/hive/warehouse/test.db/test_part/dt=2020-12-31/.hive-staging_hive_2020-12-29_16-04-58_612_6006646373659507783-1/-ext-10000
Loading data to table test.test_part partition (dt=2020-12-31)
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 3.03 sec   HDFS Read: 4232 HDFS Write: 90 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 30 msec
OK
Time taken: 16.151 seconds
hive> 
    > alter table  test_part add partition(dt = '2021-01-01');
OK
Time taken: 0.147 seconds
hive> 
    > alter table test_part partition(dt = '2021-01-01') rename to partition(dt = '2021-01-02');
OK
Time taken: 0.353 seconds
hive> 
    > alter table test_part drop partition(dt='2020-12-31');
Dropped the partition dt=2020-12-31
OK
Time taken: 0.151 seconds
hive> alter table test_part drop partition(dt='2020-12-30') purge;
Dropped the partition dt=2020-12-30
OK
Time taken: 0.135 seconds
hive>

我们来查看下分区的存储
可以看到一份分区一个文件夹，一个分区下可能有0个或多个文件

[root@hp1 ~]# hadoop fs -ls /user/hive/warehouse/test.db/test_part
Found 2 items
drwxrwxrwt   - root hive          0 2020-12-29 16:03 /user/hive/warehouse/test.db/test_part/dt=2020-12-29
drwxrwxrwt   - root hive          0 2020-12-29 16:09 /user/hive/warehouse/test.db/test_part/dt=2021-01-02

2.2 多分区测试

代码:

-- 创建分区表
create table test_part2(key int,value string) partitioned by (prod_name string,dt string);
-- 查看分区表
describe formatted test_part2;
-- 往分区表录入数据
insert into test_part2 partition(prod_name='PROD1',dt = '2020-12-29') values (1,'abc');
insert into test_part2 partition(prod_name='PROD1',dt = '2020-12-30') values (2,'def');
insert into test_part2 partition(prod_name='PROD1',dt = '2020-12-31') values (3,'ghi');

-- 新增分区
alter table  test_part2 add partition(prod_name='PROD2',dt = '2020-12-29');
-- 重命名分区
alter table test_part2 partition(prod_name='PROD2',dt = '2020-12-29') rename to partition(prod_name='PROD3',dt = '2020-12-29');

测试记录:

hive> 
    > 
    > 
    > create table test_part2(key int,value string) partitioned by (prod_name string,dt string);
OK
Time taken: 0.09 seconds
hive> describe formatted test_part2;
OK
# col_name              data_type               comment             
                 
key                     int                                         
value                   string                                      
                 
# Partition Information          
# col_name              data_type               comment             
                 
prod_name               string                                      
dt                      string                                      
                 
# Detailed Table Information             
Database:               test                     
OwnerType:              USER                     
Owner:                  root                     
CreateTime:             Tue Dec 29 16:43:06 CST 2020     
LastAccessTime:         UNKNOWN                  
Retention:              0                        
Location:               hdfs://nameservice1/user/hive/warehouse/test.db/test_part2       
Table Type:             MANAGED_TABLE            
Table Parameters:                
        COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
        numFiles                0                   
        numPartitions           0                   
        numRows                 0                   
        rawDataSize             0                   
        totalSize               0                   
        transient_lastDdlTime   1609231386          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        serialization.format    1                   
Time taken: 0.074 seconds, Fetched: 39 row(s)
hive> insert into test_part2 partition(prod_name='PROD1',dt = '2020-12-29') values (1,'abc');
Query ID = root_20201229164322_9ffab4bf-8e7b-449f-ac07-91f73016da13
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
20/12/29 16:43:22 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1609141291605_0006, Tracking URL = http://hp3:8088/proxy/application_1609141291605_0006/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1609141291605_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-29 16:43:29,435 Stage-1 map = 0%,  reduce = 0%
2020-12-29 16:43:35,624 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.86 sec
MapReduce Total cumulative CPU time: 3 seconds 860 msec
Ended Job = job_1609141291605_0006
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://nameservice1/user/hive/warehouse/test.db/test_part2/prod_name=PROD1/dt=2020-12-29/.hive-staging_hive_2020-12-29_16-43-22_143_1775810973752071302-1/-ext-10000
Loading data to table test.test_part2 partition (prod_name=PROD1, dt=2020-12-29)
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 3.86 sec   HDFS Read: 4319 HDFS Write: 107 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 860 msec
OK
Time taken: 16.222 seconds
hive> insert into test_part2 partition(prod_name='PROD1',dt = '2020-12-30') values (2,'def');
Query ID = root_20201229164339_1a52188f-5fcb-4275-a26b-8a36db978218
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
20/12/29 16:43:40 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1609141291605_0007, Tracking URL = http://hp3:8088/proxy/application_1609141291605_0007/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1609141291605_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-29 16:43:47,977 Stage-1 map = 0%,  reduce = 0%
2020-12-29 16:43:55,193 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.61 sec
MapReduce Total cumulative CPU time: 3 seconds 610 msec
Ended Job = job_1609141291605_0007
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://nameservice1/user/hive/warehouse/test.db/test_part2/prod_name=PROD1/dt=2020-12-30/.hive-staging_hive_2020-12-29_16-43-39_784_2410125837572976640-1/-ext-10000
Loading data to table test.test_part2 partition (prod_name=PROD1, dt=2020-12-30)
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 3.61 sec   HDFS Read: 4319 HDFS Write: 107 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 610 msec
OK
Time taken: 17.101 seconds
hive> insert into test_part2 partition(prod_name='PROD1',dt = '2020-12-31') values (3,'ghi');
Query ID = root_20201229164358_696aac78-0a49-46f0-bb8a-e408b22c48cb
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
20/12/29 16:43:58 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1609141291605_0008, Tracking URL = http://hp3:8088/proxy/application_1609141291605_0008/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1609141291605_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-29 16:44:06,382 Stage-1 map = 0%,  reduce = 0%
2020-12-29 16:44:12,570 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.46 sec
MapReduce Total cumulative CPU time: 3 seconds 460 msec
Ended Job = job_1609141291605_0008
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://nameservice1/user/hive/warehouse/test.db/test_part2/prod_name=PROD1/dt=2020-12-31/.hive-staging_hive_2020-12-29_16-43-58_167_5034454344356683760-1/-ext-10000
Loading data to table test.test_part2 partition (prod_name=PROD1, dt=2020-12-31)
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 3.46 sec   HDFS Read: 4319 HDFS Write: 107 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 460 msec
OK
Time taken: 17.146 seconds
hive> alter table  test_part2 add partition(prod_name='PROD2',dt = '2020-12-29');
OK
Time taken: 0.114 seconds
hive> alter table test_part2 partition(prod_name='PROD2',dt = '2020-12-29') rename to partition(prod_name='PROD3',dt = '2020-12-29');
OK
Time taken: 0.309 seconds
hive>

我们来查看下分区的存储
可以看到多级分区目录为多级，每一个prod_name一个文件夹，产品文件夹下面，每一个dt是一个目录。
多分区虽然会产生比较多的文件，如果数据量不大的情况下，谨慎使用。

[root@hp1 ~]# hadoop fs -ls /user/hive/warehouse/test.db/test_part2
Found 3 items
drwxrwxrwt   - root hive          0 2020-12-29 16:43 /user/hive/warehouse/test.db/test_part2/prod_name=PROD1
drwxrwxrwt   - root hive          0 2020-12-29 16:44 /user/hive/warehouse/test.db/test_part2/prod_name=PROD2
drwxrwxrwt   - root hive          0 2020-12-29 16:44 /user/hive/warehouse/test.db/test_part2/prod_name=PROD3
[root@hp1 ~]# hadoop fs -ls /user/hive/warehouse/test.db/test_part2/prod_name=PROD1
Found 3 items
drwxrwxrwt   - root hive          0 2020-12-29 16:43 /user/hive/warehouse/test.db/test_part2/prod_name=PROD1/dt=2020-12-29
drwxrwxrwt   - root hive          0 2020-12-29 16:43 /user/hive/warehouse/test.db/test_part2/prod_name=PROD1/dt=2020-12-30
drwxrwxrwt   - root hive          0 2020-12-29 16:44 /user/hive/warehouse/test.db/test_part2/prod_name=PROD1/dt=2020-12-31
[root@hp1 ~]# hadoop fs -ls /user/hive/warehouse/test.db/test_part2/prod_name=PROD1/dt=2020-12-29
Found 1 items
-rwxrwxrwt   3 root hive          6 2020-12-29 16:43 /user/hive/warehouse/test.db/test_part2/prod_name=PROD1/dt=2020-12-29/000000_0
[root@hp1 ~]#

三.动态分区

官方文档对静态分区和动态分区的简单解释:

Static Partition (SP) columns: in DML/DDL involving multiple partitioning columns, the columns whose values are known at COMPILE TIME (given by user).
Dynamic Partition (DP) columns: columns whose values are only known at EXECUTION TIME.

静态分区(SP)列:在涉及多个分区列的DML/DDL中，这些列的值在编译时已知(由用户给出)。
Dynamic Partition (DP)列:在执行时才知道其值的列。

上面我们测试静态分区的时候，可以看到操作分区表的时候一定要指定分区，动态分区就解决了这个问题。

3.1 动态分区测试

开启动态分区，需要调整如下配置:

set hive.exec.dynamic.partition=true;   --开启动态分区 默认为true，开启
set hive.exec.dynamic.partition.mode=nonstrict; --指定动态分区模式，默认为strict，即必须指定至少一个分区为静态分区，nonstrict模式表示允许所有的分区字段都可以使用动态分区

代码:

 CREATE TABLE emp_dynamic_partition (
 empno int,
 ename string,
 job string,
 mgr int,
 hiredate string,
 salary double,
 comm double
 )
 PARTITIONED BY (deptno int);

insert into emp_dynamic_partition partition(deptno) select * from emp;

测试记录:

hive> 
    > 
    >  CREATE TABLE emp_dynamic_partition (
    >  empno int,
    >  ename string,
    >  job string,
    >  mgr int,
    >  hiredate string,
    >  salary double,
    >  comm double
    >  )
    >  PARTITIONED BY (deptno int);
OK
Time taken: 0.106 seconds
hive> 
    > insert into emp_dynamic_partition select * from emp;
FAILED: SemanticException 1:12 Need to specify partition columns because the destination table is partitioned. Error encountered near token 'emp_dynamic_partition'
hive> 
    > set hive.exec.dynamic.partition.mode=nonstrict; 
hive> 
    > 
    > insert into emp_dynamic_partition select * from emp;
FAILED: SemanticException 1:12 Need to specify partition columns because the destination table is partitioned. Error encountered near token 'emp_dynamic_partition'
hive> 
    > 
    > insert into emp_dynamic_partition partition(deptno) select * from emp;
Query ID = root_20201229170212_e550dbfe-6d2e-49b0-83c8-9bc7d88c3144
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
20/12/29 17:02:13 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1609141291605_0009, Tracking URL = http://hp3:8088/proxy/application_1609141291605_0009/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1609141291605_0009
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-12-29 17:02:20,034 Stage-1 map = 0%,  reduce = 0%
2020-12-29 17:02:26,274 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.77 sec
MapReduce Total cumulative CPU time: 6 seconds 770 msec
Ended Job = job_1609141291605_0009
Stage-4 is filtered out by condition resolver.
Stage-3 is selected by condition resolver.
Stage-5 is filtered out by condition resolver.
Launching Job 3 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
20/12/29 17:02:28 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1609141291605_0010, Tracking URL = http://hp3:8088/proxy/application_1609141291605_0010/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1609141291605_0010
Hadoop job information for Stage-3: number of mappers: 3; number of reducers: 0
2020-12-29 17:02:39,398 Stage-3 map = 0%,  reduce = 0%
2020-12-29 17:02:44,560 Stage-3 map = 67%,  reduce = 0%, Cumulative CPU 3.5 sec
2020-12-29 17:02:48,675 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 5.11 sec
MapReduce Total cumulative CPU time: 5 seconds 110 msec
Ended Job = job_1609141291605_0010
Loading data to table test.emp_dynamic_partition partition (deptno=null)


         Time taken to load dynamic partitions: 0.18 seconds
         Time taken for adding to write entity : 0.001 seconds
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 6.77 sec   HDFS Read: 13606 HDFS Write: 1041 HDFS EC Read: 0 SUCCESS
Stage-Stage-3: Map: 3   Cumulative CPU: 5.11 sec   HDFS Read: 11072 HDFS Write: 635 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 880 msec
OK
Time taken: 37.464 seconds
hive>

可以看到根据deptno自动创建了3个分区

[root@hp1 ~]# hadoop fs -ls /user/hive/warehouse/test.db/emp_dynamic_partition
Found 3 items
drwxrwxrwt   - root hive          0 2020-12-29 17:02 /user/hive/warehouse/test.db/emp_dynamic_partition/deptno=10
drwxrwxrwt   - root hive          0 2020-12-29 17:02 /user/hive/warehouse/test.db/emp_dynamic_partition/deptno=20
drwxrwxrwt   -

3.2 动态分区和静态分区混合使用

动态分区和静态分区可以混合使用

假设此时我有一个订单表，订单有一个状态。
我需要做一个快照表，每天凌晨把所有订单的状态做一个快照，但是数据量比较大，除了按天分区外，我还想根据状态进行分区，此时我们可以将两种分区混合使用

代码:

create table test1(user_id int,prod_name string,apply_status string);
insert into test1 values (1,'Prod1','申请中');
insert into test1 values (2,'Prod1','审批中');
insert into test1 values (3,'Prod1','完成');
insert into test1 values (4,'Prod1','拒绝中');

create table test_mix_partition
(user_id    int,
 prod_name  string
)
partitioned by (dt string,apply_status string);

insert into test_mix_partition partition(dt='2020-12-29',apply_status) select user_id,prod_name,apply_status from test1;

测试记录

hive> 
    > 
    > create table test_mix_partition
    > (user_id    int,
    >  prod_name  string
    > )
    > partitioned by (dt string,apply_status string);
OK
Time taken: 0.069 seconds
hive> insert into test_mix_partition partition(dt='2020-12-29',apply_status) select user_id,prod_name,apply_status from test1;
Query ID = root_20201229172136_c5b5e88f-f2ad-4863-944e-c1dc89be7030
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
20/12/29 17:21:36 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1609141291605_0015, Tracking URL = http://hp3:8088/proxy/application_1609141291605_0015/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1609141291605_0015
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-12-29 17:21:43,261 Stage-1 map = 0%,  reduce = 0%
2020-12-29 17:21:49,445 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.77 sec
MapReduce Total cumulative CPU time: 3 seconds 770 msec
Ended Job = job_1609141291605_0015
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://nameservice1/user/hive/warehouse/test.db/test_mix_partition/dt=2020-12-29/.hive-staging_hive_2020-12-29_17-21-36_213_8707202021987000828-1/-ext-10000
Loading data to table test.test_mix_partition partition (dt=2020-12-29, apply_status=null)


         Time taken to load dynamic partitions: 0.175 seconds
         Time taken for adding to write entity : 0.001 seconds
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 3.77 sec   HDFS Read: 9390 HDFS Write: 423 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 770 msec
OK
Time taken: 14.95 seconds
hive> 


[root@hp1 ~]# hadoop fs -ls /user/hive/warehouse/test.db/test_mix_partition
Found 1 items
drwxrwxrwt   - root hive          0 2020-12-29 17:21 /user/hive/warehouse/test.db/test_mix_partition/dt=2020-12-29
[root@hp1 ~]# 
[root@hp1 ~]# hadoop fs -ls /user/hive/warehouse/test.db/test_mix_partition/dt=2020-12-29
Found 4 items
drwxrwxrwt   - root hive          0 2020-12-29 17:21 /user/hive/warehouse/test.db/test_mix_partition/dt=2020-12-29/apply_status=完成
drwxrwxrwt   - root hive          0 2020-12-29 17:21 /user/hive/warehouse/test.db/test_mix_partition/dt=2020-12-29/apply_status=审批中
drwxrwxrwt   - root hive          0 2020-12-29 17:21 /user/hive/warehouse/test.db/test_mix_partition/dt=2020-12-29/apply_status=拒绝中
drwxrwxrwt   - root hive          0 2020-12-29 17:21 /user/hive/warehouse/test.db/test_mix_partition/dt=2020-12-29/apply_status=申请中
[root@hp1 ~]#

四.分区的其它操作

4.1 恢复分区

语法:

MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];

msck repair table命令执行后Hive会检测如果HDFS目录下存在但表的metastore中不存在的partition元信息，更新到metastore中。如果有一张表已经存放好几年了，用这个命令去执行的话半天都反应不了，所以这个命令太暴力了，生产中不推荐使用。可以用Add partition来添加分区。

4.2 归档分区

语法:

ALTER TABLE table_name ARCHIVE PARTITION partition_spec;
ALTER TABLE table_name UNARCHIVE PARTITION partition_spec;

归档分区可以减轻 name node的压力，当然，数据量非常大、文件非常多(千万级别)的时候，再考虑做归档。

4.3 交换分区

EXCHANGE PARTITION命令将一个分区从源表移动到目标表，并更改每个表的元数据。

约束:

目标表不能包含要交换的分区。
如果有索引，操作将失败。
交换分区不允许将事务表作为源或目标。或者，使用LOAD DATA或INSERT OVERWRITE命令在事务性表之间移动分区。
此命令要求源表名和目标表名具有相同的表模式。
如果模式不同，抛出以下异常:
The tables have different schemas. Their partitions cannot be exchanged

语法:

ALTER TABLE <dest_table> EXCHANGE PARTITION (<[partial] partition spec>) WITH TABLE <src_table>

例1:

-- 创建两个分区表
CREATE TABLE T1(a string, b string) PARTITIONED BY (ds string);
CREATE TABLE T2(a string, b string) PARTITIONED BY (ds string);
ALTER TABLE T1 ADD PARTITION (ds='1');
 
-- 将t1表的分区 ds=‘1' 移动到t2表
ALTER TABLE T2 EXCHANGE PARTITION (ds='1') WITH TABLE T1;

测试记录:

hive> 
    > CREATE TABLE T1(a string, b string) PARTITIONED BY (ds string);
OK
Time taken: 0.081 seconds
hive> CREATE TABLE T2(a string, b string) PARTITIONED BY (ds string);
OK
Time taken: 0.085 seconds
hive> ALTER TABLE T1 ADD PARTITION (ds='1');
OK
Time taken: 0.103 seconds
hive> ALTER TABLE T2 EXCHANGE PARTITION (ds='1') WITH TABLE T1;
OK
Time taken: 0.303 seconds
hive>

例2

-- 创建2个多分区列的分区表
CREATE TABLE T1 (a string, b string) PARTITIONED BY (ds string, hr string);
CREATE TABLE T2 (a string, b string) PARTITIONED BY (ds string, hr string);
ALTER TABLE T1 ADD PARTITION (ds = '1', hr = '00');
ALTER TABLE T1 ADD PARTITION (ds = '1', hr = '01');
ALTER TABLE T1 ADD PARTITION (ds = '1', hr = '03');
 
--  ds='1' 的三个分区同时移动到 t1分区表
ALTER TABLE T2 EXCHANGE PARTITION (ds='1') WITH TABLE T1;

测试记录:

hive> 
    > 
    > CREATE TABLE T1 (a string, b string) PARTITIONED BY (ds string, hr string);
OK
Time taken: 0.092 seconds
hive> CREATE TABLE T2 (a string, b string) PARTITIONED BY (ds string, hr string);
OK
Time taken: 0.073 seconds
hive> ALTER TABLE T1 ADD PARTITION (ds = '1', hr = '00');
OK
Time taken: 0.106 seconds
hive> ALTER TABLE T1 ADD PARTITION (ds = '1', hr = '01');
OK
Time taken: 0.122 seconds
hive> ALTER TABLE T1 ADD PARTITION (ds = '1', hr = '03');
OK
Time taken: 0.093 seconds
hive> ALTER TABLE T2 EXCHANGE PARTITION (ds='1') WITH TABLE T1;
OK
Time taken: 0.337 seconds
hive>

参考

1.https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PartitionedTables

本文来自：墨天轮

感谢作者：墨天轮

查看原文：大数据开发之Hive篇19-Hive分区表详解

1269 次点击

加入收藏微博

收入我的专栏

上一篇：Elasticsearch 聚合数据结果不精确，怎么破？_铭毅天下的博客

下一篇：正则表达式(括号)、[中括号]、{大括号}的区别小结_Kou_Hi的博客

分区表

测试

apache

重命名

0 回复

添加一条新回复（您需要登录后才能回复没有账号？）

请尽量让自己的回复能够对别人有帮助
支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
支持 @ 本站用户；支持表情（输入 : 提示），见 Emoji cheat sheet
图片支持拖拽、截图粘贴等方式上传

大数据开发之Hive篇19-Hive分区表详解

Table of Contents

一.Hive分区表概述

二.静态分区

2.1 单分区测试

2.2 多分区测试

三.动态分区

3.1 动态分区测试

3.2 动态分区和静态分区混合使用

四.分区的其它操作

4.1 恢复分区

4.2 归档分区

4.3 交换分区

参考

用户登录

今日阅读排行

一周阅读排行

Table of Contents

一.Hive分区表概述

二.静态分区

2.1 单分区测试

2.2 多分区测试

三.动态分区

3.1 动态分区测试

3.2 动态分区和静态分区混合使用

四.分区的其它操作

4.1 恢复分区

4.2 归档分区

4.3 交换分区

参考

大数据开发之Hive篇19-Hive分区表详解

Table of Contents

一.Hive分区表概述

二.静态分区

2.1 单分区测试

2.2 多分区测试

三.动态分区

3.1 动态分区测试

3.2 动态分区和静态分区混合使用

四.分区的其它操作

4.1 恢复分区

4.2 归档分区

4.3 交换分区

参考

用户登录

今日阅读排行

一周阅读排行

给该专栏投稿 写篇新文章

收入到我管理的专栏 新建专栏

Table of Contents

一.Hive分区表概述

二.静态分区

2.1 单分区测试

2.2 多分区测试

三.动态分区

3.1 动态分区测试

3.2 动态分区和静态分区混合使用

四.分区的其它操作

4.1 恢复分区

4.2 归档分区

4.3 交换分区

参考

给该专栏投稿写篇新文章

收入到我管理的专栏新建专栏