Group by and distinct handle duplicate rows differently in distributed databases. Group by only sends unique values between AMPs, while distinct redistributes all rows, which can move more data. However, group by wastes time checking for duplicates that may not exist, while distinct does not have this inefficiency. The document then outlines the step-by-step processes for de-duplication using distinct and group by.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
151 views1 page
Difference Between Distinct and Group by
Group by and distinct handle duplicate rows differently in distributed databases. Group by only sends unique values between AMPs, while distinct redistributes all rows, which can move more data. However, group by wastes time checking for duplicates that may not exist, while distinct does not have this inefficiency. The document then outlines the step-by-step processes for de-duplication using distinct and group by.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 1
Since DISTINCT redistributes the rows immediately, more data may move between the AMPs, where as
GROUP BY that only sends unique values between the AMPs.
So, we can say that GROUP BY sounds more efficient. But when you assume that data is nearly unique in a table, GROUP BY will spend more time attempting to eliminate duplicates that do not exist at all.Therefore, it is wasting its time to check for duplicates the first time. Then, it must redistribute the same amount of data . Let us see how these steps are used in each case for elimination of Duplicates (can be found out using explain plan) DISTINCT 1. It reads each row on AMP 2. Hashes the column value identified in the distinct clause of select statement. 3. Then redistributes the rows according to row value into appropriate AMP 4. Once redistribution is completed , it a. Sorts data to group duplicates on each AMP b. Will remove all the duplicates on each amp and sends the original/unique value P.s: There are cases when "Error : 2646 No more Spool Space " . In such cases try using GROUP BY. GROUP BY 1. It reads all the rows part of GROUP BY 2. It will remove all duplicates in each AMP for given set of values using "BUCKETS" concept 3. Hashes the unique values on each AMP 4. Then it will re-distribute them to particular /appropriate AMP's 5. Once redistribution is completed , it a. Sorts data to group duplicates on each AMP b. Will remove all the duplicates on each amp and sends the original/unique value Hence it is better to go for GROUP BY - when Many duplicates DISTINCT - when few or no duplicates GROUP BY - SPOOL space is exceeded