编程技术

Hive的某字段去重导入

作者:admin 日期:2017-05-17

字体大小: 小中大

需求：
1从源数据导入的数据按日期递增分区存储
2.在实际业务使用时只取最早录入的记录即入库里
3.定时过滤并存储到另一张表内。

关键：
每天从源数据导入的业务数据是全量更新但存储的时候是增量更新务必会存在某字段数据重复需要处理

查询源数据

要得到的数据

要使用到的Hive 函数 row_number() not in

查询所有最早时间录入的源数据

复制内容到剪贴板

程序代码

select
distinct tt.imei,
tt.id,
tt.code,
substring(tt.createdtime,1,10),
tt.rn
from
(
select t.*,row_number() over(distribute by imei sort by createdtime asc) rn
from t_sn t
--where 在这里增加条件
)tt
where tt.rn=1

得到的结果是OK的

创建新表

复制内容到剪贴板

程序代码

create table IF NOT EXISTS t_channel_code
(
imei string,
firstore string,
firstoretime string
)
partitioned by
(
partdate string
);

[导入到新表]

复制内容到剪贴板

程序代码

insert overwrite table t_channel_code
partition
(
partdate='date20170516'
)

select
distinct trim(tt.imei),
trim(tt.code) as firstore,
substring(tt.createdtime,1,10) as firstoretime
from
(
select distinct t.imei,t.code,t.createdtime,1,10,row_number() over(distribute by t.imei sort by substring(t.createdtime,1,10) asc) rn
from t_FirstStocksn t where substring(t.createdtime,1,10) ='2017-05-16' and t.imei not in (select tcai.imei from t_channel_agent_instock tcai)
)tt
where tt.rn=1

关键点在于子查询
select distinct t.imei,t.agentcode,t.createdtime,1,10,row_number() over(distribute by t.imei sort by substring(t.createdtime,1,10) asc) rn
from t_FirstStocksn t where substring(t.createdtime,1,10) ='2017-05-16' and t.imei not in (select tcai.imei from t_channel_code tcai)

结果验证

参考资料
Hive row_number()函数用法详解及示例
http://blog.sina.com.cn/s/blog_5ceb51480102wabj.html

http://stackoverflow.com/questions/7677333/how-to-write-subquery-and-use-in-clause-in-hive

上一篇: 2017.04.22 深圳磨房100公里

下一篇: 2017.05.14. 2017凯乐石广州龙洞越野

文章来自: 本站原创

引用通告: 查看所有引用 | 我要引用此文章

Tags:

相关日志:

评论: 0 | 引用: 0 | 查看次数: 8074

发表评论

昵　称:	记住我的信息
密　码:	游客发言不需要密码.
邮　箱:	邮件地址支持Gravatar头像,邮箱地址不会公开.
网　址:	输入网址便于回访.
内　容:	正在加载编辑器...
验证码:	点击获取验证码
选　项:	禁止表情转换禁止自动转换链接禁止自动转换关键字

虽然发表评论不用注册，但是为了保护您的发言权，建议您注册帐号. 字数限制 30 字 \| UBB代码关闭 \| [img]标签关闭

Hive的某字段去重导入

作者:admin 日期:2017-05-17

Recent Article

分类

最新评论

文章列表

控制面板

支持