how to deal with unicode? #34

jseabold · 2014-02-17T16:32:38Z

Dealing with just Python 2 for now, I understand that patsy expects strings. But the data containers might not have this design. So what's the recommended way for handling this? Should we be messing with the data keys under the hood, or should patsy? The only way I can think to handle this (other than statsmodels doing it under the hood) is for patsy to accept unicode but also an encoding so the formula and they data keys can both be encoded correctly. E.g., this fails

import pandas as pd 
import patsy

data = pd.DataFrame({
    u'àèéòù' : np.random.randn(100),
    'x' : np.random.randn(100)})

formula = u"Q('àèéòù') ~ x".encode('utf-8')
dmatrices = patsy.dmatrices(formula, data=data)

But if we also encode the data keys, it's fine. So should dmatrices and whatever other entry points also take an encoding? Am I missing something?

The text was updated successfully, but these errors were encountered:

njsmith · 2014-02-17T19:47:10Z

The formula parser depends on the Python lexer, and the python lexer works
only on str objects. Does py2 even accept raw utf8 embedded between quote
marks in source code? A call to Q(...) is just vanilla python code and
subject to its usual constraints. At the very least for this to work you
should be writing Q(u'...')...

I'm not sure how much we can really do to fix this within py2. Sure you
don't just want to tell people who depend on unicode to upgrade to py3? :-)
On 17 Feb 2014 11:32, "Skipper Seabold" [email protected] wrote:

Dealing with just Python 2 for now, I understand that patsy expects
strings. But the data containers might not have this design. So what's the
recommended way for handling this? Should we be messing with the data keys
under the hood, or should patsy? The only way I can think to handle this
(other than statsmodels doing it under the hood) is for patsy to accept
unicode but also an encoding so the formula and they data keys can both be
encoded correctly. E.g., this fails

import pandas as pd
import patsy

data = pd.DataFrame({
u'àèéòù' : np.random.randn(100),
'x' : np.random.randn(100)})

formula = u"Q('àèéòù') ~ x".encode('utf-8')
dmatrices = patsy.dmatrices(formula, data=data)

But if we also encode the data keys, it's fine. So should dmatrices and
whatever other entry points also take an encoding? Am I missing something?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/34
.

jseabold · 2014-02-17T20:25:31Z

I'm just trying to clean up a PR that has been sitting around for a while, and it tries to support unicode. It also dawned on us that we have no tests for unicode formula input, so I imagine it won't quite work for non-ascii characters.

I'll let you go through the permutations, but like I said, AFAICT, this is the only thing that "works." It'd be nice if patsy did it under the hood, so I don't have to decode things on the way back out to return unicode, but you know better than me.

data = pd.DataFrame({
    u'àèéòù'.encode('utf-8') : np.random.randn(100),
    'x' : np.random.randn(100)})
formula = u"Q('àèéòù') ~ x".encode('utf-8')
dmatrices = patsy.dmatrices(formula, data=data)

njsmith · 2014-02-18T00:10:47Z

There's really no reliable way for patsy to somehow reach inside the 'data'
object and replace unicode keys with str keys.

Two options that work now with the original DataFrame with unicode keys:

Assumes source code is in utf-8

dmatrices("Q('àèéòù'.decode('utf-8')) ~ x", data=data)

Works in general

dmatrices(u"Q(u'àèéòù') ~ x".encode("unicode-escape"), data=data)

Neither of these gives you nice term names, but that seems impossible
AFAICT.

Some things that patsy could do:

If it receives a unicode string on py2, automatically call
.encode("unicode-escape") on it, so that if you're very careful to write
your formulas exactly in the form u"Q(u'àèéòù') ~ x", then they'll work.
If it tries to look up a variable name and finds that it doesn't exist,
then try calling .decode("utf-8") on the variable name and try again.

I'm really reluctant to implement either of these because they're both
really horrible hacks that don't really solve the problem at all. OTOH
switching to py3 is a clean solution that just works...

On Mon, Feb 17, 2014 at 3:25 PM, Skipper Seabold
[email protected]:

I'm just trying to clean up a PR that has been sitting around for a while,
and it tries to support unicode. It also dawned on us that we have no tests
for unicode formula input, so I imagine it won't quite work for non-ascii
characters.

I'll let you go through the permutations, but like I said, AFAICT, this is
the only thing that "works." It'd be nice if patsy did it under the hood,
so I don't have to decode things on the way back out to return unicode, but
you know better than me.

data = pd.DataFrame({
u'àèéòù'.encode('utf-8') : np.random.randn(100),
'x' : np.random.randn(100)})
formula = u"Q('àèéòù') ~ x".encode('utf-8')
dmatrices = patsy.dmatrices(formula, data=data)

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-35319605
.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

jseabold · 2014-02-18T00:23:10Z

On Mon, Feb 17, 2014 at 7:10 PM, njsmith [email protected] wrote:

There's really no reliable way for patsy to somehow reach inside the
'data'
object and replace unicode keys with str keys.

Two options that work now with the original DataFrame with unicode keys:

Assumes source code is in utf-8

dmatrices("Q('àèéòù'.decode('utf-8')) ~ x", data=data)

Works in general

dmatrices(u"Q(u'àèéòù') ~ x".encode("unicode-escape"), data=data)

Neither of these gives you nice term names, but that seems impossible
AFAICT.

Some things that patsy could do:

If it receives a unicode string on py2, automatically call
.encode("unicode-escape") on it, so that if you're very careful to write
your formulas exactly in the form u"Q(u'àèéòù') ~ x", then they'll work.

If it tries to look up a variable name and finds that it doesn't exist,
then try calling .decode("utf-8") on the variable name and try again.

Yeah, we tried both and the latter was my "solution" given that it's easier
on users. Why is this not reliable?

I'm really reluctant to implement either of these because they're both
really horrible hacks that don't really solve the problem at all. OTOH
switching to py3 is a clean solution that just works...

I agree that it's completely a least worst solution, and I understand if
you don't want to implement it. I'm just not sure I see the harm in trying
a fallback except from a code purity standpoint. We'll likely have to do
this more systematically, if we continue to have PRs from international
users, which means the hack goes up a level and likely has to touch more
code.

njsmith · 2014-02-25T04:08:15Z

If you just put unicode characters into a string literal in py2, what even
happens? Don't they end up encoded in the user's locale charset or
something? I just don't understand enough about this to know if or why or
when using the utf8 decode back would even work.
On 17 Feb 2014 19:23, "Skipper Seabold" [email protected] wrote:

On Mon, Feb 17, 2014 at 7:10 PM, njsmith [email protected]
wrote:

There's really no reliable way for patsy to somehow reach inside the
'data'
object and replace unicode keys with str keys.

Two options that work now with the original DataFrame with unicode keys:

Assumes source code is in utf-8

dmatrices("Q('àèéòù'.decode('utf-8')) ~ x", data=data)

Works in general

dmatrices(u"Q(u'àèéòù') ~ x".encode("unicode-escape"), data=data)

Neither of these gives you nice term names, but that seems impossible
AFAICT.

Some things that patsy could do:

If it receives a unicode string on py2, automatically call
.encode("unicode-escape") on it, so that if you're very careful to write
your formulas exactly in the form u"Q(u'àèéòù') ~ x", then they'll work.

If it tries to look up a variable name and finds that it doesn't
exist,
then try calling .decode("utf-8") on the variable name and try again.

Yeah, we tried both and the latter was my "solution" given that it's
easier
on users. Why is this not reliable?

I'm really reluctant to implement either of these because they're both
really horrible hacks that don't really solve the problem at all. OTOH
switching to py3 is a clean solution that just works...

I agree that it's completely a least worst solution, and I understand if
you don't want to implement it. I'm just not sure I see the harm in trying
a fallback except from a code purity standpoint. We'll likely have to do
this more systematically, if we continue to have PRs from international
users, which means the hack goes up a level and likely has to touch more
code.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-35335698
.

jankatins · 2015-07-10T10:29:39Z

Yikes, just had the same problem: I had a big list column names, which were in unicode and then constructed the formula like formula = "%s ~ %s" % (depended, " + ".join(independent) which resulted in a unicode formula as one of the column names was unicode. This resulted in PatsyError: model is missing required outcome variables :-(

If there is no proper solution: please make this more obvious by e.g. warning in _do_highlevel_design if formula is a unicode string or one of the columns...

BrenBarn · 2016-03-20T04:32:37Z

Has there been any progress on this? Looking back through the comments here, I don't see an explanation of why patsy requires bytestrings in the first place.

njsmith · 2016-03-20T06:33:58Z

Patsy does at least provide a more sensible/detailed error messages now: https://github.com/pydata/patsy/blob/master/patsy/highlevel.py#L49-L60

@BrenBarn: unfortunately, the bytestring requirement on py2 is baked into the language itself: patsy formulas contain python code, and on python 2, python code is bytestrings (specifically, if you try passing unicode to the tokenize module, it errors out, and patsy relies on this module). Not much I can do about it :-(. There's a bit about this in the manual: https://patsy.readthedocs.org/en/latest/py2-versus-py3.html

jseabold mentioned this issue Feb 17, 2014

ENH: Add facet plots statsmodels/statsmodels#1388

Closed

aenfield mentioned this issue Jan 16, 2015

Updated to avoid 'PatsyError: model is missing required outcome variables' AllenDowney/ThinkStats2#15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to deal with unicode? #34

how to deal with unicode? #34

jseabold commented Feb 17, 2014

njsmith commented Feb 17, 2014

jseabold commented Feb 17, 2014

njsmith commented Feb 18, 2014

jseabold commented Feb 18, 2014

Assumes source code is in utf-8

Works in general

njsmith commented Feb 25, 2014

Assumes source code is in utf-8

Works in general

jankatins commented Jul 10, 2015

BrenBarn commented Mar 20, 2016

njsmith commented Mar 20, 2016

how to deal with unicode? #34

how to deal with unicode? #34

Comments

jseabold commented Feb 17, 2014

njsmith commented Feb 17, 2014

jseabold commented Feb 17, 2014

njsmith commented Feb 18, 2014

Assumes source code is in utf-8

Works in general

jseabold commented Feb 18, 2014

Assumes source code is in utf-8

Works in general

njsmith commented Feb 25, 2014

Assumes source code is in utf-8

Works in general

jankatins commented Jul 10, 2015

BrenBarn commented Mar 20, 2016

njsmith commented Mar 20, 2016