FELIX: Automatic and Interpretable Feature Engineering Using LLMs
FELIX: Automatic and Interpretable Feature Engineering Using LLMs
1 Introduction
Fig. 1. In this example, FELIX learns features for detecting fake news articles: FELIX
takes unstructured input data together with ground truth labels and a context/task
description and converts it into structured numerical or categorical data. While doing
this, FELIX learns both, appropriate features and their values. FELIX-learned features
can serve as input for further data analysis.
typically dedicate a large share of their time to engineering and evaluating suit-
able features. According to a survey by Forbes, data scientists spend about 70%
of their time preparing data and mining it for patterns [35]. These activities
often require in-depth domain knowledge and complex judgment, complicating
automation and thus making feature engineering costly.
The massive amount of knowledge embedded in pre-trained Large Language
Models (LLMs) [27] potentially makes them a key ingredient in future Auto-
mated Machine Learning (AutoML) systems [29]. As identified by [38], three
opportunities arise to use LLMs for AutoML: (1) simplifying human interac-
tion with AutoML systems (especially setup and interpretation), (2) suggesting
good initial configurations based on distilled knowledge, and (3) acting as com-
ponents of an AutoML system. However, the required prompt engineering can
once again become time-consuming [7,38]. In line with these opportunities, we
propose F eature E ngineering with LLMs for I nterpretability and E x plainability
(FELIX), a method to automatically transform unstructured text data into a
structured tabular representation with human-interpretable features. Figure 1
shows an example of how to use FELIX to transform text data. Our contribu-
tion can be summarized as follows:
to generate candidate features from labeled example pairs, filter out redun-
dant features, and accordingly assign feature values to each input instance.
2. We evaluate the relevance of FELIX-learned features across five downstream
classification tasks, showing that FELIX outperforms traditional text feature
extraction methods such as TF-IDF and LLM embeddings as well as state-
of-the-art LLM zero-shot capabilities and a fine-tuned RoBERTa model [21].
3. We thoroughly analyze FELIX with regards to sample efficiency and general-
ization capabilities to unseen data demonstrating that it is a highly sample-
efficient feature engineering method and robust on out-of-domain data.
2 Related Work
We point to related works in three different areas: (1) automated feature engi-
neering without LLMs, (2) automated feature engineering with LLMs, and (3)
LLMs in data science and AutoML.
LLMs in Data Science and AutoML. Beyond using LLMs for feature extrac-
tion, several papers have explored opportunities to utilize LLMs for other steps
of the data science workflow and for AutoML. Tornede et al. [38] explore the
opportunities, challenges, and risks of using LLMs to improve AutoML and using
AutoML to improve LLMs. Chopra et al. [7] identify challenges of data scientists
conversing with LLM-powered chatbots. They find that data scientists spend a
significant amount of time (64%) constructing prompts, including gathering and
expressing the overarching domain context. Hassani et al. [12] present another
overview of the opportunities and challenges associated with using ChatGPT in
data science. They conclude that ChatGPT has the potential to greatly enhance
the productivity and accuracy of data science workflows. Hassan et al. [11]
developed a ChatGPT-based end-to-end system for conversational data science.
AutoML-GPT Zhang et al. [41] automates several process steps of the machine
learning pipeline using GPT. Ma et al. [22] introduce InsightPilot, an automated
data exploration system that combines an LLM with an insight engine integrat-
ing different insight discovery tools. Other works apply LLMs for data wrangling
[17,30] and data pre-processing—e.g. error detection, data imputation, schema
matching, and entity matching [40].
FELIX stands out from previous work with its end-to-end, generalized app-
roach to automatic feature extraction from unstructured text data. Unlike exist-
ing methods, FELIX leverages LLM reasoning capabilities to directly analyze
unstructured data, generating meaningful and interpretable classification fea-
tures without intermediate code artifacts. Additionally, hierarchically grouping
and selecting candidate features introduces a filtering step that ensures stronger
independence between the features and the information they convey.
3 FELIX
We introduce FELIX as a new approach to learn interpretable data represen-
tations for textual data using LLMs. FELIX comes in two variants, one learns
numerical features (with integer values ranging from 0 to 10) and the other
one learns categorical features (taking exactly one value out of a set of 2–5
possible values). FELIX aims to discover highly-contextualized, sometimes com-
plex or abstract features in the data (e.g., concepts such as logical cohesiveness)
that are simple to grasp and interpret for human users. These features are found
through a three-phase process (Fig. 2 shows a visual overview):
234 S. Malberg et al.
Fig. 2. This Figure shows FELIX’s model architecture. First, features are generated
from pairs of examples from different classes. Second, the generated features are con-
solidated into a final feature set by clustering them based on their semantic embed-
dings. Third, each example is converted into a feature representation by using an LLM
to assign a value for each feature. Finally, classical machine learning models can be
trained on the obtained data representation.
The feature generation phase may have resulted in a large number mcand of
feature candidates (in our experience, typically 5–10 features per example pair).
FELIX: Automatic and Interpretable Feature Engineering Using LLMs 235
A large feature set would be computationally costly in the following steps and
might yield undesirable effects of high dimensionality [3]. FELIX builds on the
fact that the feature set was generated through multiple independent prompts
and likely contains several semantically redundant features. FELIX performs a
three-step consolidation of the feature set to select msel ≤ mcand unique features:
In the value assignment phase, FELIX transforms examples from a textual form
into a structured feature representation by assigning concrete values to features.
As assigning appropriate values to such highly-contextualized features typically
requires reasoning, some form of general world understanding, and possibly
domain knowledge, FELIX uses a pre-trained LLM to assign values to each
feature. The LLM is provided with one example as well as the msel schemas of
the selected features and prompted to return exactly one value per feature, i.e.,
an integer for numerical features or one of the allowed categorical values.
In rare cases, the LLM may return non-compliant values, i.e., values not in
the set of allowed values. FELIX treats these occasions as missing values. For
the purpose of this paper, we handle missing values as follows:
– For numerical features, we drop all feature columns that have more than
10% missing values. We fill the remaining missing values using k-Nearest-
Neighbors (KNN) imputations [39] with scikit-learn’s KNNImputer imple-
mentation based on k = 5 nearest neighbors and a distance weighting.
– For categorical features, we simply convert the data into a one-hot encoded
data representation where missing values result in 0 values on all correspond-
ing dummy columns.
236 S. Malberg et al.
In both cases, examples are converted into a purely numeric tabular format
that can be handled by Logistic Regression and Random Forest while still ensur-
ing traceability back to the LLM-generated feature schemas for explainability.
As with the feature selection phase, we provide another validity analysis of using
an LLM for assigning feature values in the supplementary material.
4 Experiments
To assess the quality of the features and data transformations generated by
FELIX, we perform several experiments where FELIX transformations serve as
input to a classifier. We measure end-to-end classification performance using F1
scores. As classifiers, we use the scikit-learn [33] implementations of Logistic
Regression (LR) [8] and Random Forest (RF) [5] with default hyperparam-
eters. Both classifiers are commonly used in research and industry and com-
plement FELIX with suitable explanation insights through their learned coeffi-
cients and feature importance values, respectively. As LLMs, we use OpenAI’s
gpt-3.5-turbo-0613 [31], falling back to gpt-3.5-turbo-16k-0613 for longer-
context prompts, and gpt-4-1106-preview [32]. We use OpenAI’s default tem-
perature of 0.7 during the feature generation phase to obtain sufficiently creative
features and a temperature of 0.0 during the value assignment phase for con-
sistent behavior. Feature embeddings are created using chunk_size=1000 with
OpenAI’s text-embedding-ada-002-v2, which work well with both cosine and
euclidean distance metrics1 . We choose the latter for HDBSCAN clustering and
selecting the feature closest to the cluster centroid from each cluster.
In the following, Sect. 4.1 describes the datasets, Sect. 4.2 describes the base-
lines, and Sects. 4.3, 4.4, and 4.5 describe the experiments evaluating the rel-
evance of FELIX features, sample efficiency, and generalization capabilities.
Lastly, Sect. 4.6 discusses the interpretability of FELIX-generated features.
4.1 Datasets
We select five binary text classification datasets from different domains and sam-
ple a balanced subset from each to evaluate FELIX under various circumstances:
as either hate speech (i.e., disparaging a target group of people based on some
characteristic such as race or gender) or no hate speech. We only use the text
column as classification input.
3. Amazon/Yelp: The Amazon Review Polarity dataset4 [24] contains product
reviews from Amazon. Each review is labeled as negative (1–2-star rating) or
positive (4–5-star rating). The Yelp Polarity dataset5 [42] contains negative
(1–2-star rating) and positive (3–4-star rating) reviews from Yelp. We use it
together with the Amazon dataset for our generalization experiment.
4. Fake news: The Fake News TFG dataset6 contains news articles that are
labeled as false (i.e., fake news) or true (i.e., real news).
5. Papers: The IDMGSP dataset7 [1] contains the titles, abstracts, introduc-
tions, and conclusions of scientific papers, which are either human-written or
machine-generated using SCIgen, GPT-2, GPT-3, ChatGPT, and Galactica
models. Our sample includes an equal number of examples from each machine
generator.
4.2 Baselines
4
https://huggingface.co/datasets/amazon_polarity.
5
https://huggingface.co/datasets/yelp_polarity.
6
https://huggingface.co/datasets/GonzaloA/fake_news.
7
https://huggingface.co/datasets/tum-nlp/IDMGSP.
238 S. Malberg et al.
Setup. From each of the five datasets, we sample a training set of size
ntrain = 50 and a test set of size ntest = 100. We fit/fine-tune FELIX, TF-IDF,
and RoBERTa to the training dataset. We then transform all examples in the
training and test datasets into feature representations using the fitted FELIX,
TF-IDF, and text embedding models, respectively. We fit both, LR and RF
classifiers to the transformed training datasets and measure classification per-
formance (F1 scores) on the corresponding transformed test datasets. For the
zero-shot LLM classification baselines, we use the training set only to extract
the set of unique classes and then classify the test dataset zero-shot into these
classes. We repeat all FELIX, zero-shot LLM, and RoBERTa runs three times
and report the average performance.
Results. Table 1 shows the results of the feature relevance experiment. Accord-
ing to the cross-dataset averages, all FELIX-based classifiers achieve a better
classification performance than TF-IDF- and embedding-based classifiers as well
as the fine-tuned RoBERTa classifier. With the exception of FELIX GPT-3.5
Numerical, all FELIX-based classifiers also beat their zero-shot baseline on aver-
age, suggesting that FELIX can further improve classification performance inde-
pendent of the underlying LLM. The fine-tuned RoBERTa model (R-1) is not
able to learn accurate classifications from the limited training data. We saw it
FELIX: Automatic and Interpretable Feature Engineering Using LLMs 239
requiring up to 100 epochs of training (R-100) to catch up with some of the other
models. For FELIX, we observe some variance across the datasets. Categorical
variants of FELIX seem to achieve slightly more consistent performance results
while numerical FELIX variants show higher performance variability.
Looking at individual datasets, the tweet sentiment dataset is the only
dataset where all FELIX-based classifiers fall noticeably behind their zero-shot
LLM counterparts but still clearly beat the TF-IDF, text embeddings, and fine-
tuned RoBERTa baselines. We hypothesize that this is because sentiment classi-
fication is a relatively narrow-facetted NLP task where FELIX does not extract
any other highly predictive features besides multiple variations of sentiment.
On the hate speech and fake news datasets, FELIX variants using GPT-3.5 or
numerical features do not consistently outperform their zero-shot baselines. How-
ever, FELIX GPT-4 Categorical demonstrates a consistent performance advan-
tage over zero-shot GPT-4 on both datasets.
The Amazon dataset seems to be the least challenging among the datasets,
with most classifiers achieving 90+% F1 scores. The zero-shot LLM classi-
fiers already achieve near-perfect performance (99.0% and 97.7%), leaving some
FELIX variants to perform slightly better or identically and other FELIX vari-
ants to perform slightly worse, however all reaching F1 scores greater than 94%.
The scientific papers dataset is the most difficult classification task, according
to the measured zero-shot LLM classification performance. Zero-shot GPT-3.5
even achieves worse classification performance than the TF-IDF and text embed-
dings baselines. For both, GPT-3.5 and GPT-4, FELIX adds a significant per-
formance improvement over a zero-shot LLM classification. This indicates that
classifiers greatly profit from the features extracted by FELIX on this dataset.
FELIX GPT-4 variants achieve the best classification performance by far.
We conclude that FELIX learns relevant features with predictive value on
all tested datasets. FELIX GPT-4 variants seem to show a consistently high
ability to learn such features. When using FELIX with the less powerful GPT-
3.5, performance results seem more consistent when learning categorical features,
whereas results are more variable when learning numerical features.
The results from experiment A suggest that FELIX can more efficiently extract
features from limited training data than simpler methods such as TF-IDF or fine-
tuned text classifiers such as RoBERTa. To further investigate this hypothesis,
we run a sample efficiency analysis to evaluate classifier performance for different
training set sizes.
Setup. The experimental setup is the same as for experiment A. The only differ-
ence is the size of the training set which we vary to be ntrain ∈ {10, 20, 50, 100}.
Classification performance is still evaluated against a test set of size ntest = 100
in all cases. To limit our API spending, we perform each run only once (except
the ntrain = 50 runs already performed threefold for experiment A) and average
240 S. Malberg et al.
performance across all five datasets. We do not evaluate FELIX GPT-4 variants
with ntrain = 100 on all five datasets as they tend to become fairly expensive at
this sample size.
Results. Figure 3 shows the classification performance for different training set
sizes. As expected, classification performance generally improves the more train-
ing examples are available. FELIX outperforms TF-IDF and text embeddings for
basically all tested sample sizes while the advantage is largest for lower sample
sizes of 10 to 50 training examples. For 100 training examples, the performance
of embedding-based classifiers is almost on a par with the performance of FELIX
GPT-3.5 whereas TF-IDF is still significantly behind. FELIX seems to require at
least 20–50 training examples to perform competitively with its zero-shot LLM
counterpart, suggesting that the most efficient range for learning features with
FELIX lies between 20 and 100 training examples. Below 20 examples, zero-shot
LLM classification would provide better results, and above 100 examples, the
other baselines catch up with FELIX GPT-3.5’s performance. The fine-tuned
RoBERTa classifier (R-1) is not able to efficiently leverage the available train-
ing examples unless trained for a large number of epochs (R-100). Even then,
it requires more than 50 labelled training examples to reach FELIX GPT-3.5’s
level of performance while still falling far behind FELIX GPT-4 and zero-shot
GPT-4.
Fig. 3. Results of experiment B (sample efficiency). Scores show the average across all
five datasets. FELIX GPT-4 has not been evaluated on 100 training examples due to
limited budget. Zero-shot LLM performance (Raw Text GPT-3.5/4) is constant with
regards to the training sample size as no model is trained.
mance when models are trained on data from one domain and then tasked to pre-
dict classes of data from a slightly different domain. We hypothesize that using
a pre-trained LLM for feature generation and value assignment, FELIX should
adapt to out-of-domain situations better than the more direct and mechanical
feature engineering methods such as TF-IDF or text embeddings.
test sets from different datasets) settings. For text embeddings and TF-IDF, we
see significant performance drops of up to 21.7% when switching from an in-
domain to an out-of-domain setting. In contrast, more than half of the FELIX
variants even improve their classification performance in out-of-domain settings
and never lose more than 4.1% of their in-domain classification performance
when evaluated on out-of-domain test sets. This suggests that FELIX features
tend to be fairly generalizable and robust, even in out-of-domain settings.
4.6 Interpretability
Fig. 4. Exemplary analyses of features learned by FELIX GPT-4 on the hate speech
and fake news datasets. Figures 4a and 4b show the distribution of numerical feature
values for the top 4 features according to Random Forest feature importance. Fig-
ures 4c and 4d show the Logistic Regression coefficients learned for some hand-selected
dummies (i.e., feature-value pairs).
FELIX: Automatic and Interpretable Feature Engineering Using LLMs 243
5 Conclusion
Limitations and Future Work. Future work with a greater API budget
may evaluate FELIX on more datasets, larger samples, and with other LLMs
to uncover additional strengths and weaknesses of the methodology and hint at
potential aspects of FELIX to be further improved. This could include combining
8
https://github.com/simonmalberg/felix.
244 S. Malberg et al.
Ethical Considerations. While FELIX may support users throughout the feature
discovery process, it is important to note that FELIX relies on a pre-trained LLM.
This may lead to biases, gaps, and other inaccuracies in model outputs that are not
immediately obvious. Users should always sanity-check results and take specific care
when using these results in high-stakes applications. FELIX should never be used for
making automated decisions about humans.
References
1. Abdalla, M.H.I., Malberg, S., Dementieva, D., Mosca, E., Groh, G.: A benchmark
dataset to distinguish human-written and machine-generated scientific papers.
Information 14(10), 522 (2023)
2. Barbieri, F., Espinosa Anke, L., Camacho-Collados, J.: XLM-T: multilingual lan-
guage models in Twitter for sentiment analysis and beyond. In: Proceedings of
the Thirteenth Language Resources and Evaluation Conference, pp. 258–266.
European Language Resources Association, Marseille, France (2022). https://
aclanthology.org/2022.lrec-1.27
3. Bellman, R.: Dynamic programming. Science 153(3731), 34–37 (1966)
4. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing
Text with the Natural Language Toolkit. O’Reilly Media, Inc., Sebastopol (2009)
5. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
6. Chen, X., et al.: Neural feature search: a neural architecture for automated feature
engineering. In: 2019 IEEE International Conference on Data Mining (ICDM), pp.
71–80. IEEE (2019)
7. Chopra, B., et al.: Conversational challenges in AI-powered data science: obstacles,
needs, and design opportunities (2023)
8. Cox, D.R.: The regression analysis of binary sequences. J. Roy. Stat. Soc.: Ser. B
(Methodol.) 20(2), 215–232 (1958)
9. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep
bidirectional transformers for language understanding (2019)
10. de Gibert, O., Perez, N., García-Pablos, A., Cuadros, M.: Hate speech dataset
from a white supremacy forum. In: Proceedings of the 2nd Workshop on Abu-
sive Language Online (ALW2), pp. 11–20. Association for Computational Linguis-
tics, Brussels, Belgium (2018). https://doi.org/10.18653/v1/W18-5102, https://
aclanthology.org/W18-5102
11. Hassan, M.M., Knipper, A., Santu, S.K.K.: ChatGPT as your personal data sci-
entist. arXiv preprint arXiv:2305.13657 (2023)
FELIX: Automatic and Interpretable Feature Engineering Using LLMs 245
12. Hassani, H., Silva, E.S.: The role of ChatGPT in data science: how AI-assisted
conversational interfaces are revolutionizing the field. Big Data Cogn. Comput.
7(2), 62 (2023)
13. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning:
Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer,
Heidelberg (2009). https://books.google.de/books?id=eBSgoAEACAAJ
14. Hegselmann, S., Buendia, A., Lang, H., Agrawal, M., Jiang, X., Sontag, D.:
TabLLM: few-shot classification of tabular data with large language models. In:
International Conference on Artificial Intelligence and Statistics, pp. 5549–5581.
PMLR (2023)
15. Hollmann, N., Müller, S., Hutter, F.: Large language models for automated data
science: Introducing CAAFE for context-aware automated feature engineering. In:
Advances in Neural Information Processing Systems, vol. 36 (2024)
16. Horn, F., Pack, R., Rieger, M.: The autofeat Python library for automated feature
engineering and selection. In: Cellier, P., Driessens, K. (eds.) ECML PKDD 2019,
Part I. CCIS, vol. 1167, pp. 111–120. Springer, Cham (2020). https://doi.org/10.
1007/978-3-030-43823-4_10
17. Jaimovitch-López, G., Ferri, C., Hernández-Orallo, J., Martínez-Plumed, F.,
Ramírez-Quintana, M.J.: Can language models automate data wrangling? Mach.
Learn. 112(6), 2053–2082 (2023)
18. Khurana, U., Samulowitz, H., Turaga, D.: Feature engineering for predictive mod-
eling using reinforcement learning. In: Proceedings of the AAAI Conference on
Artificial Intelligence, vol. 32 (2018)
19. Khurana, U., Turaga, D., Samulowitz, H., Parthasrathy, S.: Cognito: automated
feature engineering for supervised learning. In: 2016 IEEE 16th International Con-
ference on Data Mining Workshops (ICDMW), pp. 1304–1307. IEEE (2016)
20. Lin, Y., Ding, B., Jagadish, H., Zhou, J.: SmartFeat: efficient feature con-
struction through feature-level foundation model interactions. arXiv preprint
arXiv:2309.07856 (2023)
21. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach (2019)
22. Ma, P., Ding, R., Wang, S., Han, S., Zhang, D.: InsightPilot: an LLM-empowered
automated data exploration system. In: Proceedings of the 2023 Conference on
Empirical Methods in Natural Language Processing: System Demonstrations, pp.
346–352 (2023)
23. MacQueen, J., et al.: Some methods for classification and analysis of multivariate
observations. In: Proceedings of the fifth Berkeley Symposium on Mathematical
Statistics and Probability, vol. 1, pp. 281–297. Oakland, CA, USA (1967)
24. McAuley, J., Leskovec, J.: Hidden factors and hidden topics: understanding rat-
ing dimensions with review text. In: Proceedings of the 7th ACM Conference on
Recommender Systems, pp. 165–172 (2013)
25. McInerney, D.J., Young, G., van de Meent, J.W., Wallace, B.C.: Chill: zero-shot
custom interpretable feature extraction from clinical notes with large language
models. arXiv preprint arXiv:2302.12343 (2023)
26. McInnes, L., Healy, J., Astels, S.: HDBScan: hierarchical density based clustering.
J. Open Source Softw. 2(11), 205 (2017). https://doi.org/10.21105/joss.00205
27. Min, B., et al.: Recent advances in natural language processing via large pre-trained
language models: a survey. ACM Comput. Surv. 56(2) (2023). https://doi.org/10.
1145/3605943
246 S. Malberg et al.
28. Mosca, E., Abdalla, M.H.I., Basso, P., Musumeci, M., Groh, G.: Distinguishing
fact from fiction: a benchmark dataset for identifying machine-generated scientific
papers in the LLM era. In: Ovalle, A., et al. (eds.) Proceedings of the 3rd Work-
shop on Trustworthy Natural Language Processing (TrustNLP 2023), pp. 190–207.
Association for Computational Linguistics, Toronto, Canada (2023). https://doi.
org/10.18653/v1/2023.trustnlp-1.17, https://aclanthology.org/2023.trustnlp-1.17
29. Mumuni, A., Mumuni, F.: Automated data processing and feature engineering for
deep learning and big data applications: a survey. J. Inf. Intell. (2024)
30. Narayan, A., Chami, I., Orr, L., Arora, S., Ré, C.: Can foundation models wrangle
your data? (2022)
31. OpenAI: ChatGPT (2022). https://openai.com/blog/chat-ai/. Accessed 26 Feb
2023
32. OpenAI: Gpt-4 Technical report (2023)
33. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn.
Res. 12, 2825–2830 (2011)
34. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
35. Press, G.: Cleaning big data: most time-consuming, least enjoyable data science
task, survey says (2016). https://www.forbes.com/sites/gilpress/2016/03/23/
data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-
says/
36. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval.
Inf. Process. Manag. 24(5), 513–523 (1988)
37. Sun, X., et al.: Text classification via large language models (2023)
38. Tornede, A., et al.: AutoML in the age of large language models: current challenges,
future opportunities and risks (2023)
39. Troyanskaya, O., et al.: Missing value estimation methods for DNA microarrays.
Bioinformatics 17(6), 520–525 (2001). https://doi.org/10.1093/bioinformatics/17.
6.520
40. Zhang, H., Dong, Y., Xiao, C., Oyamada, M.: Large language models as data
preprocessors. arXiv preprint arXiv:2308.16361 (2023)
41. Zhang, S., Gong, C., Wu, L., Liu, X., Zhou, M.: AutoML-GPT: automatic machine
learning with GPT. arXiv preprint arXiv:2305.02499 (2023)
42. Zhang, X., Zhao, J., LeCun, Y.: Character-level Convolutional Networks for Text
Classification. arXiv:1509.01626 [cs] (2015)