HIVE 第八章 schema -

blackproof

浏览: 1380654 次
性别:
来自: 北京

最近访客更多访客>>

lingxiajiudu

youtao531

mengjingwo

xuycan

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

HIVE 第八章 schema

博客分类：

hadoop hive

hive schema

schema设计

hive pattern && hive anti-pattern

1.Table by day 按照天分割数据，在relation中，这个参数不推荐，在hive中使用

create table supply(id int,part string,quantity int) partitioned by (int day)

alter table supply add partition (day=20120102)

partition的负面影响:

1.namenode limition

但是partition产生的子目录，子文件都会保存在hdfs中，namenode会存在内存中，所以这得负面效果是namenode的filesystem的容量上限(hadoop has this upper limit on the total number of file,mapr and amazon s3 don't have this limitation)

2.一个job分解成几个task，每个task是一个jvm实例，每一个file对应一个独立的task，每个task是jvm中独立的一个实例（进程），过多的实例会给jvm压力（start up and tear down），这使得计算速度降低

因此不能有太多partition，每个文件要尽可能的大

一个好的table by day的设计，是设计出相似大小的数据在不同的时间间断，时间间断可以适当增大。同时保证每个file大于filesystem block size。目的是让partition足够的大。另一种方法，是用多维度的partition分解数据。

2.unique keys and normalization 主键，格式化数据

关系数据库最爱用地策略，但是在hive中没有这种概念。因为hive可以存储denormalized data非格式化的数据，如array,map,struct。这样可以避免one-to-many的关联关系，加快了io速度。但是也pay the penalty of denormalization，比如数据复制，数据不一致的概率

3.making multiple passes over the same data 同数据源的操作优化

insert overwrite table sales

select * from history where action='purchased';

insert overwrite table credits

select * from history where action='returned';

from history

insert overwrite sales select * where action='purchased'

insert overwrite credits select * where action = 'returned'

4.the case for partitioning every table

为了避免job fail而使得数据被删除，在insert数据的时候可以使用table pardae table1 partition(day=20120102).但是需要删除这个中间换转者partition

5.bucketing table data storage

当table没有明显的partition特征时，或是减轻filesystem的负担,可以使用bucketing,他的优点是不会随着增加数据使得文件个数变动，而且对于取样sample是很容易的，对于一些joins操作也比较便利。

create table weblog(user_id int,url string,source_ip string) partition by (dt string) clustered by (user_id) into 96 buckets;

为了生成正确个数的reducer对应hash出得bucket

在查询的时候设置 set hive.enforce.bucketing=true;

from raw_logs或是设置reduce数直接等于bucket数set mapred.reduce.tasks=96

insert overwrite table weblog partition(dt='2009-02-25') select user_id,url,source_ip where dt='2009-02-25'

6.adding colums to a table

hive是没有格式化的数据仓库，随着数据需求可以增加一列，数据少于期待列数，则填补null，数据多于，则舍弃。

create table weblogs(version long,url string) partitioned by (hit_data int) row format delimited fields terminated by '\t'

加载数据，可以用int补上缺少的数据

load data local inpath 'log1.txt' int weblogs partition(20110101)

7.(almost)always use compression

分享到：

hadoop 自定义inputformat和outputformat | HIVE 第七章索引

2013-02-13 22:17
浏览 7314
评论(0)
分类:企业架构
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

HIVE 第八章 schema

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

HIVE 第八章 schema

评论

发表评论

相关推荐

hive sql优化

hive修改inputformat

hive压缩

hive报错 Exception thrown obtaining schema column information from datastore

hive row_number分组排序top

hive函数

hive与hbase安装

hive join

hive建表

hive not in

MapredLocalTask报错

hive UDAF

hive查询导出到hdfs，hive，file

hive经验

hive常用函数

HIVE 第七章 索引

HIVE 第六章 视图

HIVE 第五章 查询

HIVE 第四章 数据操作

HIVE 第二章 目录和表

最近访客更多访客>>

HIVE 第七章索引

HIVE 第六章视图

HIVE 第五章查询

HIVE 第四章数据操作

HIVE 第二章目录和表