[Hive] Sort

less than 1 minute read


Sort

  • order by
  • sort by
  • distribute by
  • cluster by

order by

  • uses single reducer to do the full sorting of data
  • used with limit often since one single reducer is used(데이터가 너무 많으면 시간이 너무 오래 걸리기에)
select * 
from ordering
order by col2
limit 5;

sort by

  • uses multiple reducer
    • overall sorting is not done
    • 즉, 각 reducer 안에서만 sorting이 이루어짐
set mapreduce.job.reduces=2; -- 2개의 reducer로 set

select * 
from ordering 
sort by col2
limit 5;

sort by

  • uses multiple reducer
    • overall sorting is not done
    • 즉, 각 reducer 안에서만 sorting이 이루어짐
set mapreduce.job.reduces=2; -- 2개의 reducer로 set

select * 
from ordering 
sort by col2
limit 5;

distribute by

  • distribute the ‘key:value’ pairs amongst the reducers
    • 즉, 특정 value들이 특정 reducer에 모이는 것임
    • 따라서 실제 sorting을 위해서는 distribute by + sort by를 해야 함
select * 
from ordering 
distribute by col2 
sort by col2;

cluster by

  • distribute by + sort by
select * 
from ordering 
cluster by col2;

Tags:

Categories:

Updated: