文章详情|HIVE数据模型

HIVE数据模型 所属分类 bigdata 浏览量 1331
内部表(Table)
外部表(External Table)
分区(Partition)
桶(Bucket)


内部表
Table 将数据保存到Hive 自己的数据仓库目录中：/usr/hive/warehouse
每一个Table在Hive中数据仓库目录下都有一个相应的目录存储数据
所有的Table数据都存储在该目录

# 创建表
create table if not exists aiops.appinfo (
    appname string,
    level string,
    leader string,
    appline string,
    dep string,
    ips  array
)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ' '
    COLLECTION ITEMS TERMINATED BY ',';

# 自定义文件和记录格式
# 使用create table创建表，最后使用保存成sequence格式[默认是text格式]
stored as sequencefile

# 数据库授权
grant create on database dbname to user hadoop;

#导入数据  
# hive是读时检查，上传的数据文件一定要符合格式，mysql是写时检查

#hdfs中数据的导入，本质是就是文件的移动
load data inpath  'hdfs://hdfs-name/sure.csv' overwrite into table aiops.appinfo;
#本地导入，数据copy到hdfs中
load data local inpath '/home/hdfs/online_state1' overwrite into table online_state PARTITION (end_dt='99991231');

# 查看表结构
describe extended bgops;
describe bgops;

# 修改列名
## 这个命令可以修改表的列名，数据类型，列注释和列所在的位置顺序，FIRST将列放在第一列，AFTER col_name将列放在col_name后面一列
ALTER TABLE aiops.appinfo CHANGE hostnum ipnum int comment 'some 注释' AFTER col3;

# 修改表结构
ALTER TABLE appinfo replace columns (appname string,appline string,level string,leader string,dep string,idcnum int,idcs array,hostnum int,ips array);

# 增加表的列字段(默认增加到最后一列，可以使用change column 来调整位置)
hive> alter table appinfo add columns (appclass string comment 'app_perf_class');

# 增加表的列字段(默认增加到最后一列，可以使用change column 来调整位置)
alter table appinfo add columns (appclass string comment 'app_perf_class');

# 导出表查询结果(会将结果导出到testoutput目录下)
insert overwrite local directory './testoutput'
row format delimited fields terminated by "\t"
select ip,appname,leader from appinfo  LATERAL VIEW explode(ips) tmpappinfo  AS ip;



外部表
External Table 外部表需要指定数据读取的目录，而内部表创建的时候存放数据到默认路径，内部表将数据和元数据全部删除，外部表只删除元数据，数据文件不会删除。外部表和内部表在元数据的组织上是相同的。外部表加载数据和创建表同时完成，并不会移动到数据仓库目录中。

外部表和内部表的应用场景：

如果hdfs中已经存在数据文件，推荐使用外部表（使用较多）
如果表先创建，之后向表中插入数据，推荐使用内部表
其实外部表在日常开发中我们用的最多，比如原始日志文件或同时被多个部门同时操作的数据集，需要使用外部表，而且如果不小心将meta data删除了，HDFS上的数据还在，可以恢复，增加了数据的安全性。


## 外部表的创建
create external table psn
(id int,
name string,
likes array,
address map
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
location '/usr/';


分区表
分区表通常分为静态分区表和动态分区表，前者需要导入数据时静态指定分区，后者可以直接根据导入数据进行分区。分区的好处是可以让数据按照区域进行分类，避免了查询时的全表扫描。

# 创建分区表
create table psn
(
id int,
name string,
likes array,
address map
)
partitioned by (age int,gender string)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':';

# 加载数据
load data local inpath '/root/data/data' into table psn partition(age=20，gender='man');

# 增加分区（其实就创建分区目录，里面没数据）
# 增加分区列的值的时候，如果是多分区，必须要包含所有的分区列
alter table psn add partition(age=20,gender='women');

# 删除分区，可以指定根据一部分的分区条件删除
alter table psn drop partition(gender='man');

# 修复分区
msck repair table psn;

# 静态分区的使用
insert overwrite table partition_test partition(stat_date='20110527',province='liaoning') select member_id,name from partition_test_input;


动态分区的启用


# 使用动态分区要先设置hive.exec.dynamic.partition参数值为true，默认值为false，即不允许使用：
set hive.exec.dynamic.partition=true;
# 其次分区模型必须设置为非严格模式(默认strict)
set hive.exec.dynamic.partition.mode=nostrict;

#在strict模式下，动态分区的使用必须在至少一个静态分区确认的情况下，其他分区可以是动态；
# 即不允许分区列全部是动态的，这是为了防止用户有可能原意是只在子分区内进行动态建分区，但是由于疏忽忘记为主分区列指定值了，这将导致一个dml语句在短时间内创建大量的新的分区（对应大量新的文件夹），对系统性能带来影响。
# 这一理念就是hadoop的防止好人做错事

#插入数据 静态分区需要指定分区
# 动态分区的使用方法很简单，假设我想向stat_date='20110728'这个分区下面插入数据，至于province插入到哪个子分区下面让数据库自己来判断，那可以这样写：
insert overwrite table partition_test partition(stat_date='20110728',province)
select member_id,name,province from partition_test_input where stat_date='20110728';


几个重要的参数
hive.exec.max.dynamic.partitions.pernode （缺省值100）
每一个mapreduce job允许创建的分区的最大数量，如果超过了这个数量就会报错

hive.exec.max.dynamic.partitions （缺省值1000）
一个dml语句允许创建的所有分区的最大数量

hive.exec.max.created.files （缺省值100000）
所有的mapreduce job允许创建的文件的最大数量


桶表
将同一个目录下的一个文件拆分成多个文件，每个文件包含一部分数据，方便获取值，提高检索效率
实现方式：
获取表的某一个列或者部分列，获取hashcode，按照hashcode值/buckets的个数，来决定每条数据放置到哪一个文件中。

桶中的数据可以根据一个或多个列另外进行排序。由于这样对每个桶的连接变成了高效的归并排序(merge-sort), 因此可以进一步提升map端连接的效率。

# 开启支持分桶
set hive.enforce.bucketing=true;

# 创建原始数据表：
CREATE TABLE psn( 
id INT, name STRING, age INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

# 排序的桶
CREATE TABLE bucketed_users (
id INT, name STRING
) 
CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS; 

# 创建分桶表：
create table psn_bucket(
id int ,name string,age int
) 
clustered by(age) into 4 buckets 
row format delimited fields terminated by ',';

# 向分桶表中添加数据：
insert into psn_bucket select * from psn;

# 对桶中的数据进行采样
SELECT * FROM bucketed_users TABLESAMPLE(BUCKET 1 OUT OF 4 ON id); 


应用场景：
1、数据抽样
tablesample(bucket x out of y)

y必须是table总bucket数的倍数或者因子。hive根据y的大小，决定抽样的比例（表的分桶总数/y）。
例如，table总共分了64份，当y=32时，抽取 (64/32=)2个bucket的数据，当y=128时，抽取(64/128=)1/2个bucket的数据。

x表示从哪个bucket开始抽取。
例如，table总bucket数为32，tablesample(bucket 3 out of 16)，表示总共抽取（32/16=）2个bucket的数据，分别为第3个bucket和第（3+16=）19个bucket的数据。



分桶表可以在分区表的基础之上创建，也可以只创建分桶表。

分区表和分桶表的区别
1.Hive 数据表可以根据某些字段进行分区操作，细化数据管理，可以让部分查询更快。

2.表和分区也可以进一步被划分为 Buckets，分桶表的原理和 MapReduce 编程中的HashPartitioner 的原理类似。

3.分区和分桶都是细化数据管理，由于 Hive 是读模式，所以对添加进分区的数据不做模式校验，分桶表中的数据是按照某些分桶字段进行 hash 散列形成的多个文件，所以数据的准确性也高很多。
elasticsearch中的DocValues

zab协议

hive

spark

5G简介

git pull --rebase 使用