Skip to content

ENH: Make categorical repr nicer. #4368

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

jseabold
Copy link
Contributor

Make looking at Categorical types a little nicer. Needs some tests still, but works fine locally so far.

@jseabold
Copy link
Contributor Author

Added tests.

@jreback
Copy link
Contributor

jreback commented Jul 26, 2013

@jseabold this is nice......can you hook up travis? (prob just need to flip the switch), then

git checkin --amend -C HEAD to recommit last commit

@jseabold
Copy link
Contributor Author

Should be going now. Ping me if it's not. Maybe needed a setup lag.

@jreback
Copy link
Contributor

jreback commented Jul 26, 2013

does not appear to have taken.....

@jseabold
Copy link
Contributor Author

Appears to be going now.

@jseabold
Copy link
Contributor Author

Well, I don't see the banner here, but I see it running on travis. I dunno.

@cpcloud
Copy link
Member

cpcloud commented Jul 26, 2013

travis usually takes a couple of minutes to actually start, that's how it rolls

@cpcloud
Copy link
Member

cpcloud commented Jul 26, 2013

i see the banner

@jseabold
Copy link
Contributor Author

Anyone know off the top of their head the 2.6 errors? I assume it's a unicode comparison issue...

@cpcloud
Copy link
Member

cpcloud commented Jul 26, 2013

that kind of looks like a bug

python 2.6:

Categorical:
[a, b, b, a, a, c, c, c]
Levels (3): Index([a, b, c], dtype=object)

In [10]: factor.levels
Out[10]: Index([u'a', u'b', u'c'], dtype=object)

python 2.7

In [1]: factor = Categorical.from_array(['a','b','b','a','a','c','c','c'])

In [2]: factor
Out[2]:
Categorical:
[a, b, b, a, a, c, c, c]
Levels (3): Index(['a', 'b', 'c'], dtype=object)

In [3]: factor.levels
Out[3]: Index([u'a', u'b', u'c'], dtype=object)

@cpcloud
Copy link
Member

cpcloud commented Jul 26, 2013

or possibly a difference in numpy since categorical repr calls np.array_repr...my python 2.6 has np 1.6.1 and py 2.7 up there has 1.7.1

def _repr_footer(self):
levheader = 'Levels (%d): ' % len(self.levels)
#TODO: should max_line_width respect a setting?
levstring = np.array_repr(self.levels, max_line_width=60)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep...this should be com.pprint_thing(self.levels)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe self.levels.format()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.levels.format() does the correct thing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm almost inclined to just fix the tests here and live with the numpy inconsistency unless there's another way around this. pprint_thing drops the object name and I like knowing levels is an Index. E.g.,

>>> np.array_repr(np.arange(3))
'array([0, 1, 2])'

>>> com.pprint_thing(np.arange(3))
u'[0, 1, 2]'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's wrong with self.levels.format()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a list (for 1.8.x)? So either array_repr fails with an error or I'm back to pprint_thing. Either way I lose the Index([...]) information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or of course, leaving it, I lose the Index info. I could just do something like "Index(%s)" but I hoped to avoid this in case levels is ever not an Index for any reason, but I guess the tests will catch a change like that now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh duh sorry yes you're right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a delicious bit of irony I'm the one that changed the unicode repr of numpy object arrays.

numpy/numpy#459

@cpcloud
Copy link
Member

cpcloud commented Jul 26, 2013

yep np.array2string is different in those two versions of numpy

@jseabold
Copy link
Contributor Author

Thanks. For the record, y'all beat my attempts to build Python 2.6, the whole numpy stack, and pandas in the background of my work.

@cpcloud
Copy link
Member

cpcloud commented Jul 26, 2013

@jseabold building the stack is not fun. i get strange issues with matplotlib and tkinter so i can't do any plotting...i'm also accumulating a bit of enmity for the timedelta64 (non-)functionality of numpy 1.6 (it silently doesn't and then sometimes does convert units). i would suggest avoiding numpy 1.6 like the plague if you have the option. python 2.6 is probably ok

@jreback
Copy link
Contributor

jreback commented Jul 26, 2013

hah....we have had fair share of issues with py2.6 lately!

here's a fun bug in python (that was actually not fixed): http://bugs.python.org/issue2325

@jseabold
Copy link
Contributor Author

Yes, I switched to openblas recently and my hacked together numpy/distutils is not working for scipy (numpy tests pass) for 2.6.

@jseabold
Copy link
Contributor Author

Fixed the tests.

@cpcloud
Copy link
Member

cpcloud commented Jul 26, 2013

gets bit unwieldy for a large number of levels (i tried 100), but my opinion here doesn't matter that much i never use Categorical

@jseabold
Copy link
Contributor Author

I left a TODO in there for this. It's admittedly not handled, but I can't imagine having categorical variables with too many categories. The degrees of freedom loss is too prohibitive in estimation. Maybe it could be useful in some machine learning contexts, but I don't know how ML people use factors in R. I suspect they don't.

@cpcloud
Copy link
Member

cpcloud commented Jul 26, 2013

totally fine by me. like i said, i never found a need for this

@jreback
Copy link
Contributor

jreback commented Aug 23, 2013

@jseabold this looks ready? maybe squash a bit

@cpcloud ?

@cpcloud
Copy link
Member

cpcloud commented Aug 23, 2013

fine by me

@hayd
Copy link
Contributor

hayd commented Aug 26, 2013

Too late to throw out case the repr being valid python (so it can eval itself) :s ?

@jreback
Copy link
Contributor

jreback commented Sep 24, 2013

@jseabold this almost got lost...

can you rebase and we can merge it in....

thxs

@jseabold
Copy link
Contributor Author

Rebased. There was a merge conflict, which I didn't check too closely, so make sure tests pass.

@jseabold
Copy link
Contributor Author

Any idea what's going on here?

@jseabold
Copy link
Contributor Author

Looks like a circular import introduced in 85f191c?

@jreback
Copy link
Contributor

jreback commented Sep 26, 2013

merged via 7c76086

thanks @jseabold

@jreback jreback closed this Sep 26, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants