Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to deal with unicode? #34

Open
jseabold opened this issue Feb 17, 2014 · 8 comments
Open

how to deal with unicode? #34

jseabold opened this issue Feb 17, 2014 · 8 comments

Comments

@jseabold
Copy link
Member

Dealing with just Python 2 for now, I understand that patsy expects strings. But the data containers might not have this design. So what's the recommended way for handling this? Should we be messing with the data keys under the hood, or should patsy? The only way I can think to handle this (other than statsmodels doing it under the hood) is for patsy to accept unicode but also an encoding so the formula and they data keys can both be encoded correctly. E.g., this fails

import pandas as pd 
import patsy

data = pd.DataFrame({
    u'àèéòù' : np.random.randn(100),
    'x' : np.random.randn(100)})

formula = u"Q('àèéòù') ~ x".encode('utf-8')
dmatrices = patsy.dmatrices(formula, data=data)

But if we also encode the data keys, it's fine. So should dmatrices and whatever other entry points also take an encoding? Am I missing something?

@njsmith
Copy link
Member

njsmith commented Feb 17, 2014

The formula parser depends on the Python lexer, and the python lexer works
only on str objects. Does py2 even accept raw utf8 embedded between quote
marks in source code? A call to Q(...) is just vanilla python code and
subject to its usual constraints. At the very least for this to work you
should be writing Q(u'...')...

I'm not sure how much we can really do to fix this within py2. Sure you
don't just want to tell people who depend on unicode to upgrade to py3? :-)
On 17 Feb 2014 11:32, "Skipper Seabold" [email protected] wrote:

Dealing with just Python 2 for now, I understand that patsy expects
strings. But the data containers might not have this design. So what's the
recommended way for handling this? Should we be messing with the data keys
under the hood, or should patsy? The only way I can think to handle this
(other than statsmodels doing it under the hood) is for patsy to accept
unicode but also an encoding so the formula and they data keys can both be
encoded correctly. E.g., this fails

import pandas as pd
import patsy

data = pd.DataFrame({
u'àèéòù' : np.random.randn(100),
'x' : np.random.randn(100)})

formula = u"Q('àèéòù') ~ x".encode('utf-8')
dmatrices = patsy.dmatrices(formula, data=data)

But if we also encode the data keys, it's fine. So should dmatrices and
whatever other entry points also take an encoding? Am I missing something?


Reply to this email directly or view it on GitHubhttps://github.com//issues/34
.

@jseabold
Copy link
Member Author

I'm just trying to clean up a PR that has been sitting around for a while, and it tries to support unicode. It also dawned on us that we have no tests for unicode formula input, so I imagine it won't quite work for non-ascii characters.

I'll let you go through the permutations, but like I said, AFAICT, this is the only thing that "works." It'd be nice if patsy did it under the hood, so I don't have to decode things on the way back out to return unicode, but you know better than me.

data = pd.DataFrame({
    u'àèéòù'.encode('utf-8') : np.random.randn(100),
    'x' : np.random.randn(100)})
formula = u"Q('àèéòù') ~ x".encode('utf-8')
dmatrices = patsy.dmatrices(formula, data=data)

@njsmith
Copy link
Member

njsmith commented Feb 18, 2014

There's really no reliable way for patsy to somehow reach inside the 'data'
object and replace unicode keys with str keys.

Two options that work now with the original DataFrame with unicode keys:

Assumes source code is in utf-8

dmatrices("Q('àèéòù'.decode('utf-8')) ~ x", data=data)

Works in general

dmatrices(u"Q(u'àèéòù') ~ x".encode("unicode-escape"), data=data)

Neither of these gives you nice term names, but that seems impossible
AFAICT.

Some things that patsy could do:

  • If it receives a unicode string on py2, automatically call
    .encode("unicode-escape") on it, so that if you're very careful to write
    your formulas exactly in the form u"Q(u'àèéòù') ~ x", then they'll work.
  • If it tries to look up a variable name and finds that it doesn't exist,
    then try calling .decode("utf-8") on the variable name and try again.

I'm really reluctant to implement either of these because they're both
really horrible hacks that don't really solve the problem at all. OTOH
switching to py3 is a clean solution that just works...

On Mon, Feb 17, 2014 at 3:25 PM, Skipper Seabold
[email protected]:

I'm just trying to clean up a PR that has been sitting around for a while,
and it tries to support unicode. It also dawned on us that we have no tests
for unicode formula input, so I imagine it won't quite work for non-ascii
characters.

I'll let you go through the permutations, but like I said, AFAICT, this is
the only thing that "works." It'd be nice if patsy did it under the hood,
so I don't have to decode things on the way back out to return unicode, but
you know better than me.

data = pd.DataFrame({
u'àèéòù'.encode('utf-8') : np.random.randn(100),
'x' : np.random.randn(100)})
formula = u"Q('àèéòù') ~ x".encode('utf-8')
dmatrices = patsy.dmatrices(formula, data=data)


Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-35319605
.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@jseabold
Copy link
Member Author

On Mon, Feb 17, 2014 at 7:10 PM, njsmith [email protected] wrote:

There's really no reliable way for patsy to somehow reach inside the
'data'
object and replace unicode keys with str keys.

Two options that work now with the original DataFrame with unicode keys:

Assumes source code is in utf-8

dmatrices("Q('àèéòù'.decode('utf-8')) ~ x", data=data)

Works in general

dmatrices(u"Q(u'àèéòù') ~ x".encode("unicode-escape"), data=data)

Neither of these gives you nice term names, but that seems impossible
AFAICT.

Some things that patsy could do:

  • If it receives a unicode string on py2, automatically call
    .encode("unicode-escape") on it, so that if you're very careful to write
    your formulas exactly in the form u"Q(u'àèéòù') ~ x", then they'll work.
  • If it tries to look up a variable name and finds that it doesn't exist,
    then try calling .decode("utf-8") on the variable name and try again.

Yeah, we tried both and the latter was my "solution" given that it's easier
on users. Why is this not reliable?

I'm really reluctant to implement either of these because they're both
really horrible hacks that don't really solve the problem at all. OTOH
switching to py3 is a clean solution that just works...

I agree that it's completely a least worst solution, and I understand if
you don't want to implement it. I'm just not sure I see the harm in trying
a fallback except from a code purity standpoint. We'll likely have to do
this more systematically, if we continue to have PRs from international
users, which means the hack goes up a level and likely has to touch more
code.

@njsmith
Copy link
Member

njsmith commented Feb 25, 2014

If you just put unicode characters into a string literal in py2, what even
happens? Don't they end up encoded in the user's locale charset or
something? I just don't understand enough about this to know if or why or
when using the utf8 decode back would even work.
On 17 Feb 2014 19:23, "Skipper Seabold" [email protected] wrote:

On Mon, Feb 17, 2014 at 7:10 PM, njsmith [email protected]
wrote:

There's really no reliable way for patsy to somehow reach inside the
'data'
object and replace unicode keys with str keys.

Two options that work now with the original DataFrame with unicode keys:

Assumes source code is in utf-8

dmatrices("Q('àèéòù'.decode('utf-8')) ~ x", data=data)

Works in general

dmatrices(u"Q(u'àèéòù') ~ x".encode("unicode-escape"), data=data)

Neither of these gives you nice term names, but that seems impossible
AFAICT.

Some things that patsy could do:

  • If it receives a unicode string on py2, automatically call
    .encode("unicode-escape") on it, so that if you're very careful to write
    your formulas exactly in the form u"Q(u'àèéòù') ~ x", then they'll work.
  • If it tries to look up a variable name and finds that it doesn't
    exist,
    then try calling .decode("utf-8") on the variable name and try again.

Yeah, we tried both and the latter was my "solution" given that it's
easier
on users. Why is this not reliable?

I'm really reluctant to implement either of these because they're both
really horrible hacks that don't really solve the problem at all. OTOH
switching to py3 is a clean solution that just works...

I agree that it's completely a least worst solution, and I understand if
you don't want to implement it. I'm just not sure I see the harm in trying
a fallback except from a code purity standpoint. We'll likely have to do
this more systematically, if we continue to have PRs from international
users, which means the hack goes up a level and likely has to touch more
code.


Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-35335698
.

@jankatins
Copy link

Yikes, just had the same problem: I had a big list column names, which were in unicode and then constructed the formula like formula = "%s ~ %s" % (depended, " + ".join(independent) which resulted in a unicode formula as one of the column names was unicode. This resulted in PatsyError: model is missing required outcome variables :-(

If there is no proper solution: please make this more obvious by e.g. warning in _do_highlevel_design if formula is a unicode string or one of the columns...

@BrenBarn
Copy link

Has there been any progress on this? Looking back through the comments here, I don't see an explanation of why patsy requires bytestrings in the first place.

@njsmith
Copy link
Member

njsmith commented Mar 20, 2016

Patsy does at least provide a more sensible/detailed error messages now: https://github.com/pydata/patsy/blob/master/patsy/highlevel.py#L49-L60

@BrenBarn: unfortunately, the bytestring requirement on py2 is baked into the language itself: patsy formulas contain python code, and on python 2, python code is bytestrings (specifically, if you try passing unicode to the tokenize module, it errors out, and patsy relies on this module). Not much I can do about it :-(. There's a bit about this in the manual: https://patsy.readthedocs.org/en/latest/py2-versus-py3.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants