hive日期字符串类函数及一些特殊处理记录

1.两个日期相减精确到秒:

select 
unix_timestamp('2022-12-22 14:00:00') - unix_timestamp('2022-12-22 13:00:00')

_c0
3600

如果是年月日和时分秒两个string类型拼起来即:

1	unix_timestamp(concat(trim(wc.operatedate ),' ', trim(wc.operatetime))) - unix_timestamp(concat(trim(wc.receivedate ),' ', trim(wc.receivetime))) operating_duration

2.将用字符串拼起来的时间转换为timestamp类型以插入表(字段建表给的timestamp类型)

1	from_unixtime(to_unix_timestamp(concat(trim(wrb.lastoperatedate),' ', trim(wrb.lastoperatetime)),'yyyy-MM-dd HH:mm:ss'))

3.获取当前UNIX时间戳函数: unix_timestamp

语法: unix_timestamp()
返回值: bigint
说明: 获得当前时区的 UNIX 时间戳
hive> select unix_timestamp() from tableName;
1616906976

语法: unix_timestamp(string date)
说明: 返回指定日期格式的时间戳
注意: 如果后面只有date参数，date的形式必须为'yyyy-MM-dd HH:mm:ss'的形式。
hive (default)> select unix_timestamp('2020-05-01');
NULL
hive (default)> select unix_timestamp('2020-05-01 00:00:00');
1464710400

4.UNIX 时间戳转日期函数: from_unixtime

语法: from_unixtime(bigint unixtime[, string format])
返回值: string
说明: 转化 UNIX 时间戳（从 1970-01-01 00:00:00 UTC 到指定时间的秒数）到当前时区的时间格式
hive> select from_unixtime(1616906976,'yyyyMMdd')
20210328

hive> select from_unixtime(unix_timestamp(),'yyyy-MM-dd HH:mm:ss')
2023-01-04 03:39:55

(1).3和4混合使用:

1
2
3

hive> select from_unixtime(unix_timestamp(sua.apply_date),'yyyy-MM-dd HH:mm:ss')
其中sua.apply_date是string格式的年月日时分秒毫秒
返回结果：

(2).将yyyyMMdd格式的字符串转换为yyyy-MM-dd的格式的日期类型字段

1
2
3

hive>select to_date(from_unixtime(UNIX_TIMESTAMP(work_date_str,'yyyyMMdd'))) work_date
其中:	work_date_str为'20220801'
结果:	work_date:2022-08-01

5.日期格式化，按照格式返回字符串: date_format(date/timestamp/string, string fmt)

将日期格式化为：2020-05-01和2020-05
返回类型：string
hive (default)> select date_format('2020-05-01 12:00:00','yyyy-MM-dd');
_c0
2020-05-01
hive (default)> select date_format('2020-05-01 12:00:00','yyyy-MM');
_c0
2020-05

场景(1)

1
2
3

case when dayofmonth(work_date) > 25 
then substr(add_months(date_format(work_date,'yyyy-MM-dd'),1),1,7)
else date_format(work_date,'yyyy-MM') end as r_month, --月份

6.字符串转日期函数，默认转为yyyy-MM-dd格式: to_date(string date)

当前时间转为日期格式
hive (default)> select to_date(current_timestamp);
注：current_timestamp：2023-01-04 01:09:18.383
_c0
2023-01-04

hive (default)> select to_date('2020-05-01 00:00:00') ;
2020-05-01
 
hive (default)> select to_date('2020-05-01');
2020-05-01

注意：参数必须是yyyy-MM-dd或yyyy-MM-dd HH:mm:ss格式，如果是yyyy-MM,则为null
hive (default)> select to_date('2016-09');  --有问题，结果为null

7.日期加法函数: date_add(date,number)

将日期增加或减少4天；
hive (default)> select date_add('2019-05-09',4);
_c0
2019-05-13
Time taken: 0.034 seconds, Fetched: 1 row(s)
hive (default)> select date_add('2019-05-09',-4);
_c0
2019-05-05
Time taken: 0.049 seconds, Fetched: 1 row(s)

8.日期减法函数: date_sub(date,number)

将日期增加或减少4天
hive (default)> select date_sub('2019-05-09',4);
_c0
2019-05-05
Time taken: 0.04 seconds, Fetched: 1 row(s)
hive (default)> select date_sub('2019-05-09',-4);
_c0
2019-05-13
Time taken: 0.058 seconds, Fetched: 1 row(s)

注: 7、8一般选择其一即可。

9.取该日期的下一个周几的日期: next_day(date,dayofweek)

星期一到星期日的英文

（Monday，Tuesday、Wednesday、Thursday、Friday、Saturday、Sunday）

取5.1后的下一个周五
hive (default)> select next_day('2020-05-01',"Fri");
_c0
2020-05-08

10.取当月的最后一天日期: last_day(date)

取5月的最后一天日期
hive (default)> select last_day('2020-05-01');
_c0
2020-05-31
hive (default)> select last_day('2020-04-01');
_c0
2020-04-30

11.日期比较函数，第一个日期减去第二个时期数字，为正，则前者大于后者，为负，则前者小于后者； datediff(date1,date2)

比较5月1日和5月5日的大小
hive (default)> select datediff('2020-05-01','2020-05-05');
_c0
-4

12.查询该日期在本月的第多少天: dayofmonth(date)

5月6号在五月是第多少天
hive (default)> select dayofmonth('2020-05-06');
_c0
6

13.获取当前日期: current_date

1
2
3

hive (default)> select current_date;
_c0
2020-05-14

14.获取当前时间: current_timestamp

1
2
3

hive (default)> select current_timestamp;
_c0
2020-05-14 10:26:57.613

15.日期加一个月: add_months(date,number)

2020-05-03加一个月
hive (default)> select add_months('2020-05-03',1);
_c0
2020-06-03

16.返回时间的最开始年份或月份: trunc()

trunc(string date, string format) 返回时间的最开始年份或月份

返回类型：string

select trunc('2019-04-08','YY') --2019-01-01
_c0
2019-01-01

select trunc('2019-04-08','MM') --2019-04-01
_c0
2019-04-01

17.返回下个月第1天: trunc(add_months(current_timestamp(),1),’MM’)

1
2
3

select trunc(add_months('2020-05-03',1),'MM');
_c0
2020-06-01

18.获取时间的年份year(date)、月份month(date)、天day(date)、小时hour(time)、分钟minute(time)、秒second(time)

当前日期所在的年份
hive (default)> select year('2020-05-14 12:00:00');
_c0
2020

当前日期所在的月份
hive (default)> select month('2020-05-14 12:00:00');
_c0
5

获取日期中的天
hive (default)> select day('2020-05-14 12:00:00');
_c0
14

获取时间中的小时
hive (default)> select hour('2020-05-14 12:00:00');
_c0
12

获取时间中的分钟
hive (default)> select minute('2020-05-14 12:04:50');
_c0
4

获取时间中的秒
hive (default)> select second('2020-05-14 12:04:53');
_c0
53

19.日期所在年份的第多少周: weekofyear(date)

当前日期是所在年份的第多少周
hive (default)> select weekofyear(current_date);
_c0
1

20.过滤多个模糊匹配的数据时:

select * from dim.dim_old_new_workflow_name  where workflow_name not REGEXP '无效|测试|预算编制' 

复杂写法:
select * from dim.dim_old_new_workflow_name where workflow_name not like '%无效%' and workflow_name not like '%测试%' and workflow_name not like '%预算编制%'

21.字符串拼接:

1.concat
语法:concat(str1,str2,...)
注意事项:如果拼接的参数里面有一个为null，则结果整体返回null

2.concat_ws  
语法:concat_ws(separator,str1,str2)
说明:第一个参数指定分隔符，分隔符不能为null，否则整体返回null；即时该函数后面待拼接的字符串参数存在为null也不影响整体结果，结果会跳过它继续拼接

3.hive没有group_concat()函数

22.instr函数

instr(str,substr) 返回substr在str第一次出现的位置（从1开始计数），如果substr在str中不存在则返回0

1 2	select instr('abcdefghijklmn','j') -- 10 > 10

配合substr使用：

1 2	select substr('abcdefghijklmn',instr('abcdefghijklmn','j'),3) >jkl

23.colect_list与collect_set

它们都是将分组中的某列转为一个数组返回，不同的是collect_list不去重而collect_set去重。

比如已知有以下表user_watch_movie记录部分用户在某月内看过的电影:

1 2	按用户分组，取出每个用户这个月看过的所有电影名称 select user_name,collect_list(movie) movie_array from user_watch_movie group by user_name;

但上面的查询结果有点问题，因为星际穿越太好看了，李四在第一次看完后几天又去看了一次，导致了观看列表有重复的，这时候就应该用collect_set增加去重。

1	select user_name,collect_set(movie) movie_array from user_watch_movie group by user_name;

突破group by限制

hive中，凡是在select后面没有使用到聚合函数的列都必须出现在group by后面，即select的列必须作为分组依据，但有时候想要根据A分组，然后随便取出每个组织中的一个B，运用到此例中，即:

1	select user_name,collect_list(movie)[1] movie from user_watch_movie group by user_name;

movie不是分组列，但依然能取出该列数据。

一些处理记录

1.行转列:

函数	描述
CONCAT(string A/col,string B/col…)	返回输入字符串连接后的结果，支持任意个输入字符串
CONCAT_WS(separator,str1, str2,…)	第一个参数参数间的分隔符，如果分隔符是 NULL，返回值也将为NULL。这个函数会跳过分隔符参数后的任何 NULL 和空字符串。分隔符将被加到被连接的字符串之间。
COLLECT_SET(col)	将某字段的值进行去重汇总，产生array类型字段
COLLECT_LIST(col)	函数只接受基本数据类型，它的主要作用是将某字段的值进行不去重汇总，产生array类型字段。

2.列转行(一列转多行):

Split(str, separator)： 将字符串按照后面的分隔符切割，转换成字符array。

EXPLODE(col)：将hive一列中复杂的array或者map结构拆分成多行。

LATERAL VIEW

1 2	用法： LATERAL VIEW udtf(expression) tableAlias A S columnAlias

解释:lateral view用于和split, explode等UDTF一起使用，它能够将一行数据拆成多行数据，在此基础上可以对拆分后的数据进行聚合。

lateral view首先为原始表的每行调用UDTF，UDTF会把一行拆分成一或者多行，lateral view再把结果组合，产生一个支持别名表的虚拟表。

例表如下：

1	select user_name,movie_name from user_watch_movie2 lateral view explode(split(movie,',')) movie_name_tmp as movie_name

实际项目运用中的部分代码：

将存有多个id并且以逗号分割的managerstr拆分，行转列后再与id关联

from ods.ods_oa_hrmresource hr
left join(
	select 
	   h.id,
	   h.managerstr,
	   concat_ws(',',collect_list(hr.workcode))  workcode,
	   concat_ws(',',collect_list(hr.lastname))  lastname
	 from 
	  (select
		   id,
		   managerstr,
		   single_alias 
	  from ods.ods_oa_hrmresource lateral view explode(split(managerstr, ',')) alias as single_alias) h 
	  left join ods.ods_oa_hrmresource hr on hr.id = h.single_alias
	  group by 
	   h.id,
	   h.managerstr
)h
on hr.id = h.id

3.order by和case when联用:

例子：

有如下user表

1	select * from user

需要安装status排序，但是排序规则是按照自定义指定的，状态4排在最前面，顺序依次为：4、2、3、6、1、0、5

select * from user
order by 
	case when status = 4 then 0
			 when status = 2 then 1
			 when status = 3 then 2
			 when status = 6 then 3
			 when status = 1 then 4
			 when status = 0 then 5
			 when status = 5 then 6
	else status end