Sunday, June 03, 2012

A smarter way to calculate distinct counts

I was recently writing a pig script to calculate distinct count over three fields on a big set of data and was getting an out of memory error on the reducer. The data types of these three fields are strings and The issue was the single reducer usage to calculate the distinct count. I couldn't figure out a way around the single reducer and instead used the below approach

Approach throwing the OOM error


A = LOAD '$inp' using PigStorage('\t');
H = FOREACH A GENERATE $1,$3,$23;
uq_pid = DISTINCT H parallel 20;
guq_pid = GROUP uq_pid ALL;
itr_uq_pid = foreach guq_pid {
    generate COUNT_STAR(uq_pid);
}
store itr_uq_pid into '$otp/uq_metric';


Modified approach using a constant


A = LOAD '$inp' using PigStorage('\t');
pid = FOREACH A GENERATE $1,$3,$23,1;
uq_pid = DISTINCT pid parallel 20;
constant_uq_pid = FOREACH uq_pid GENERATE $3;
guq_pid = GROUP constant_uq_pid BY $0;
itr_uq_pid = foreach guq_pid {
    generate COUNT_STAR(constant_uq_pid);
}
store itr_uq_pid into '$otp/uq_metric';

No comments:

Post a Comment