Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Removing Duplicates

00:00 In the previous lesson, I showed you how to reconcile two datasets. In this lesson, I’ll show you how to deal with duplicates in your data. I brushed over an important fact in the previous lesson.

00:10 Both the issued and cashed check CSV files have a column named Amount. If you were paying close attention in the last lesson, you might have seen the fact that the resulting combined data had these columns renamed.

00:23 Off to the REPL to look at just what NumPy does. Onscreen is the combined dataset resulting from joining the issued and cashed check CSV files from the previous lesson.

00:35 At the end of that lesson, I sliced through a couple of columns. What happens if I do that with the amount?

00:50 Well, you get an error because there’s no amount field in the combined data. When I printed the combined array out in the REPL above, it included the data type information, but you can also ask for that directly.

01:03 This is the structural information. Note that amount from issued became amount 1, and amount from cashed became amount 2. For our data, these two columns were identical, so it doesn’t matter which you use, but NumPy doesn’t want to assume that. If I want to include the amount information in a slice of the combined data, I simply choose to use amount 1 instead.

01:36 Instead of using the rec_join() to get a subset of data, you can also loop through the contents. Say, for example, you wanted to find all the uncashed checks.

01:58 This list comprehension looks for all the IDs in issued, but not in cashed.

02:05 The result is a list of the outstanding IDs. Note that the data types are NumPy 64-bit ints. You can cast them if you want Python data types instead.

02:22 NumPy speed comes from trying to do things using its mechanisms as much as possible. You can always loop through data like this, but it typically will be slower than using a NumPy specific method instead. In addition to having duplicate columns, you might have duplicate row data as well.

02:40 You can find this kind of data using the find_ duplicates() function. One complication is that this function expects a masked array.

02:49 I briefly mentioned these when explaining filtering rows earlier. A masked array is a NumPy array with extra metadata that indicates whether a row is participating in a calculation or not.

03:01 You can convert your regular array into a masked one by passing it to the asarray() function in the masked module. If you don’t want to find duplicates, but you just want to remove them, instead of using find_, you can use unique() instead. I’ve got another CSV file to play with.

03:20 This one is called issued_dupe.csv. I created it by copying issued_checks and duplicating one of the rows. Let’s head into the REPL and clean out our dupe data.

03:32 That’s our usual imports and a list of data type tuples from our CSV file. Now, I’ll load it

03:56 and there it is. Note ID 1344 is in there twice. To find dupes, I use the find_duplicates() call passing in a masked version of our array.

04:11 The result is a new masked array that shows the duplicate data. Instead, if I want to get rid of the duplicates, I can call unique(),

04:25 and this is the result with only a single row containing ID 1344.

04:33 That’s all for the second example. Next up, I’ll start the third example, charting hierarchical data.

Avatar image for avinash12papad

avinash12papad on March 24, 2025

Hi Chris,

How to have reconciliation with single amount column instead of amount1 & amount2? I guess we can do it by passing parameter in join? Please share your thoughts as this is the most common problem in real world data analysis. Thanks in advance.

Avatar image for Christopher Trudeau

Christopher Trudeau RP Team on March 24, 2025

Hi Avinash12papad,

I’m not quite following your question. In this case, the data between the cashed and issues checks datasets are redundant. In the case of a check, there is no way for the payment to be different than what was on the check. So, amount1 and amount2 are dupes, and in this case you’d just delete (or ignore, like we did) one of the columns.

If the scenario were different, where you had two combined columns with the same name but different values, you’d need the result to have two different names.

I’m not sure I answered what you were asking. If not, give me an example I can play with and maybe I can help some more.

Become a Member to join the conversation.