(Data & Variable Management) Converting Many String Variables To Numeric Variables
(Data & Variable Management) Converting Many String Variables To Numeric Variables
| Stata FAQ
All of the examples on this page use the same dataset, so let’s start by examining the data. The example
dataset, hsbs, is a subset of the High School and Beyond data file with all of the variables as string
variables.
As you see from the describe command below, the variables are all defined as string variables (e.g.,
science is str2, a string of length 2).
describe
obs: 20
vars: 6 25 Nov 1999 19:09
size: 400 (99.6% of memory free)
-------------------------------------------------------------------------------
1. id str3 %3s
2. gender str1 %5s
3. race str5 %9s
4. schtyp str3 %5s
5. read str2 %5s
6. science str2 %5s
-------------------------------------------------------------------------------
Sorted by:
Now that we know the variables are string variables, we can use the list command to see what the
strings stored in these variables look like. Although the variable science is defined as str2, you can see
from the list below that it contains just numeric values. Even so, because the variable is defined as str2,
Stata cannot perform any kind of numerical analysis of the variable science. The same is true for the
variable read.
list
1. 70 m 1 pub 45 47
2. 121 f 1 pub 68 63
3. 86 m 1 pub 44 58
4. 141 m 1 pub 63 53
5. 172 m 1 pub 47 53
6. 113 m 1 pub 44 63
7. 50 m 3 pub 50 53
8. 11 m 2 pub 34 39
9. 84 m 1 pub 63 .
10. 48 m 3 pub 57 50
11. 75 m 1 pub 60 53
12. 60 m X pub 57 63
13. 95 m 1 pub 73 61
14. 104 m 1 pub 54 55
15. 38 m 3 pub 45 31
16. 115 m 1 pub 42 50
17. 76 m 1 pub 47 50
18. 195 m 1 pri 57
19. 114 m 1 pub 68 55
20. 85 m 1 pub 55 53
+---------------+
| read read_n |
|---------------|
1. | 45 45 |
2. | 68 68 |
3. | 44 44 |
4. | 63 63 |
5. | 47 47 |
+---------------+
A second method of achieving the same result is the command destring. Let’s try using the destring
command and see how it works. The first line of syntax loads the dataset again, so that we are starting
with a dataset containing only string variables again. The second line of syntax runs the destring
command.
As you can see from the describe command below, the destring command converted all of the variables
to numeric, except for race, gender and schtyp. Since these variables had characters in them, the
destring command left such variables alone. If there had been any numeric variables in the dataset, they
would remain unchanged.
describe
obs: 20
vars: 6 25 Nov 1999 19:09
size: 340 (98.5% of memory free)
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
id int %10.0g
gender str1 %5s
race str5 %9s
schtyp str3 %5s
read byte %10.0g
science byte %10.0g
-------------------------------------------------------------------------------
Sorted by:
Note: dataset has changed since last saved
Both of the techniques described above have attributes that in some situations are advantages and in
other situations may be disadvantages. The command destring can be run on an entire dataset in one
step, the method using the function “real” requires issuing a command for each variable to be converted
(although this can be done with a loop rather than typing out the syntax for each variable). One potential
advantage to using the function “real” (the first method) is that if the function “real” encounters a non-
numeric value, it sets the variable equal to missing in that case and moves on. To some extent destring
can be made to behave similarly, but not identically. In order to convert a string variable containing any
non-numeric value using destring one must list the characters that should be ignored (e.g. “,” or “.”).
Additionally, rather than setting values for those cases containing non-numeric values to missing (what
the function “real” does), destring removes the specified non-numeric characters. destring will extract
the specified strings and then convert, meaning that “a4” can be converted to “4”. destring‘s behavior is
very good if one has numeric values stored as strings that occasionally contain things like commas (e.g.
4,354), but there may be situations where this behavior is undesirable.
Notice in the describe command below that gender2 and schtyp2 are numeric variables and they have
labels associated with them (called gender2 and schtyp2).
describe
If we list out the data, it appears that gender2 and schtyp2 are identical to gender and schtyp, however
they are really numeric and what you are seeing are the value labels associated with the variables.
list
1. 70 m 1 pub 45 47 m pub
2. 121 f 1 pub 68 63 f pub
3. 86 m 1 pub 44 58 m pub
4. 141 m 1 pub 63 53 m pub
5. 172 m 1 pub 47 53 m pub
6. 113 m 1 pub 44 63 m pub
7. 50 m 3 pub 50 53 m pub
8. 11 m 2 pub 34 39 m pub
9. 84 m 1 pub 63 . m pub
10. 48 m 3 pub 57 50 m pub
11. 75 m 1 pub 60 53 m pub
12. 60 m X pub 57 63 m pub
13. 95 m 1 pub 73 61 m pub
14. 104 m 1 pub 54 55 m pub
15. 38 m 3 pub 45 31 m pub
16. 115 m 1 pub 42 50 m pub
17. 76 m 1 pub 47 50 m pub
18. 195 m 1 pri 57 m pri
19. 114 m 1 pub 68 55 m pub
20. 85 m 1 pub 55 53 m pub
Below we use the nolabel option and you see that gender2 and schtyp2 are really numeric.
list, nolabel
1. 70 m 1 pub 45 47 2 2
2. 121 f 1 pub 68 63 1 2
3. 86 m 1 pub 44 58 2 2
4. 141 m 1 pub 63 53 2 2
5. 172 m 1 pub 47 53 2 2
6. 113 m 1 pub 44 63 2 2
7. 50 m 3 pub 50 53 2 2
8. 11 m 2 pub 34 39 2 2
9. 84 m 1 pub 63 . 2 2
10. 48 m 3 pub 57 50 2 2
11. 75 m 1 pub 60 53 2 2
12. 60 m X pub 57 63 2 2
13. 95 m 1 pub 73 61 2 2
14. 104 m 1 pub 54 55 2 2
15. 38 m 3 pub 45 31 2 2
16. 115 m 1 pub 42 50 2 2
17. 76 m 1 pub 47 50 2 2
18. 195 m 1 pri 57 2 1
19. 114 m 1 pub 68 55 2 2
20. 85 m 1 pub 55 53 2 2
What about the variable race? It is still a character variable because our prior destring command saw the
X in the data and did not attempt to convert it because it had non-numeric values. Below we can convert
it to numeric by include the ignore(X) option that tells destring to convert the variable to numeric and
when it encounters X to convert that to a missing value. You can see the results in the list command
below.
list
As you have seen, we can use destring to convert string variables that contain numbers into numeric
variables, and it can handle situations where some values are stored as a character (like the X we saw
with race). If you have a character variable that is stored as all characters, you can use encode to
convert the character variable to numeric and it will create value labels that have the values that were
stored with the character variable.
Fore more information, see the help or reference manual about the destring
(http://www.stata.com/help.cgi?destring) and encode (http://www.stata.com/help.cgi?encode)
commands.