1. Introduction
The global trends of big data and artificial intelligence (AI) have introduced various types of data that cannot be handled by the existing analytical technologies; thus, attention on areas not centered on AI technologies has increased. Rather than relying on a single data source, methods have been proposed to solve such problems and obtain new values in data through the distribution, exchange, and linking of the data across various fields. With improvements in catalogs and portal sites available in the data markets, opportunities for users to obtain data from data holders and providers have increased. Therefore, a data market has been developed in which various stakeholders exchange data and information about the data across different fields [
1,
2]. In particular, the developments of the Internet of things and cloud computing, and the privilege of mobile, digital markets for data have emerged [
3,
4]. Various stakeholders have discussed the potential benefits of reusing and analyzing massive amounts of data [
5,
6]. However, these typically affect data privacy and security [
7,
8,
9,
10]. Moreover, it is often difficult to obtain and utilize data that are specifically related to our interests. Even if relevant information is available publicly on the web, users may find it challenging to specify areas of interest owing to information overload. From the perspective of limited human cognition, it has been observed that excessive information renders it difficult for human decision makers to derive the necessary information and discover useful knowledge [
11]. Therefore, a support system is required to obtain information related to user interests.
Another related issue is the difficulty encountered by users in obtaining data that accurately correspond to their intentions because users might not express their objects of interest using the exact terms (names of variables, outlines, etc.) used in the relevant data [
12]. A user wishing to obtain new data for a business focusing on foreign tourists, for example, may obtain street interviews and questionnaires completed by foreigners. However, if specific information of interest, specifically the nationality of foreigners, is missing from the acquired data, the procedure may have to be repeated. Owing to the costs involved in obtaining data, a reworking procedure should be avoided after data acquisition. In product management, reworking in the latter stage of a product design has been recognized as a serious risk [
13]. To avoid this risk in data design and management, it is important to specify the exact data that should be obtained to implement effective decision making.
Hayashi and Ohsawa [
14,
15] proposed a method called a Variable Quest (VQ) for inferring variables from a data outline when information on the variables are missing or unknown. Here, variables refer to one of the attributes of the data; for example, “latitude” or “longitude” are elements in the variables. The proposed method infers variables that may be present in the data by inputting the summary of the data presented in natural language. Information on the variables and data outline is extracted from the Data Jacket’s dataset. A data jacket (DJ) is a technique for sharing information on data without exposing the data itself by describing a summary of the data in natural language [
16]. The idea of a DJ is to share a “summary of the data” as metadata while reducing the data management cost and privacy risk. Information regarding the variables is included through variable labels (VLs) in a DJ; VL is the name and/or meaning of the variables in the data. The variables and values in the data in a DJ are summarized as VLs. For example, the dataset on the “UNHCR Refugee Population Statistics” obtained from the Humanitarian Data Exchange (
https://data.humdata.org/) includes the variables “country,” “origin,” “population type,” “year,” and “population,” each of which contains values (
Figure 1). Even if the data are not publicly available, we can learn and evaluate whether the data would be useful for our purposes using the data summary described in a DJ. Some data include private information, that is, values and variables such as “name,” “age,” or “address.” The description framework of a DJ allows stakeholders to learn a summary of the data from the attributes mentioned in the DJ, thus, reducing the risks inherent to data management and privacy. The DJ has been introduced to support cross-disciplinary data exchange and collaboration in the creative workshop method, i.e., Innovators Marketplace on Data Jackets (IMDJ). For further details regarding the methodology and results of the IMDJ, see references [
1,
12,
16].
VQ, however, focuses only on the variables in the data. The data possess other important attributes, namely, the types and formats. In the dataset on the “UNHCR Refugee Population Statistics,” the data format is presented as a “CSV,” whereas data types are “number” and “text,” which are important attributes for stakeholders when considering data combinations. In this study, we extend the matrix-based method for the inference of variables to a method for inferring the data attributes. The motivation and the objective of our study are to infer the related attributes of the data (types, formats, and variables) from the data outlines presented as free text. We used a dataset of DJs as the training data. The significance of our approach and the contributions of our study can be summarized as follows: The proposed method is the first approach for inferring the data attributes while focusing on the similarities of datasets using a data outline. Our method can infer the related data attributes from free text queries. In particular, it can be used not only for knowledge discovery from the data but also for decision-makers who wish to acquire new data. Our method can support the search for a useful set of variables, types, and formats used as data for decision making. Note that the definition of data in this study is a set of described abstracted events in the world. That is, as shown in
Figure 1, the data consist of sets of variables with values. In contrast, the DJs in this study are the summary of the data consisting of attributes (types, formats, and variables) with elements.
The remainder of this paper is organized as follows: In
Section 2, we briefly review the previous matrix-based method for inferring VLs and subsequently, formulate the proposed method and its inference procedures. In
Section 3, we demonstrate the effectiveness of the proposed method by comparing its performance to other methods used for this purpose. Furthermore, we analyze the characteristics of the DJs and their attributes. In
Section 4, we discuss the results obtained from the experiment. Finally, we provide some concluding remarks and discuss the areas of future work in
Section 5. The notations used herein are summarized in
Table 1.
4. Result and Discussion
Table 4,
Table 5 and
Table 6 show the evaluation results for each data attribute. With respect to the results related to the formats and types, we listed the top-five elements returned as the inferred results scored based on the similarities from each query. Furthermore, we present the top-ten VLs returned as the inferred results scored based on similarities from each query using our proposed matrices
and
, as well as TSM and Doc2vec.
By comparing the F-measures calculated from the precision and recall of each method, we observed that the results inferred using the matrices and demonstrated a better performance in inferring the type, format, and variable elements. In particular, the performance of matrix was the best in terms of the F-measure score. The results indicate that, although the data outline is an important attribute for characterizing the data, it does not always include information regarding the attributes. In other words, the string matching of ODs and each element in the attributes is insufficient to infer the elements. The performance of Doc2vec is comparatively poor. A reason is that the ODs contain few terms that describe the type, format, and variable elements. We compared the commonality of terms derived from the ODs in the corpus of the training data with the VLs, formats, and types. Subsequently, only 162 out of 7871 terms were in common with the VLs. That is, only 162 words in the ODs contributed to the discovery of the VLs. If the commonality of the terms is low, we cannot sufficiently compute the similarity even if the dimensionality of the word embedding is low compared to the one-hot vectors. In contrast, the formats and types were both included in the ODs. Consequently, the F-measures of the formats and types using TSM and Doc2vec are higher than those of the VLs.
According to Hayashi and Ohsawa [
15], the performances of the term-element matrices
and
are almost the same. However, when comparing the F-measures of the results, we found significant differences between the matrices
and
for each attribute. Using a paired t-test, we obtained
,
for the format results;
,
for the type results;
,
for the variable results. From this experiment, we concluded that a model based on the idea that “a pair of datasets whose similarity of outlines is high are similar in terms of having similar elements in the attributes,” namely, the effect of matrix
, is suitable for inferring elements in the data attributes. In other words, the information on other datasets (the relationship between the ODs and elements in each attribute) may compensate well for the missing terms in explaining the data and may be suitable for discovering elements from the outlines of data whose elements are missing in the attributes.
The inferred examples using the co-occurrence model are exemplified in the study by Hayashi and Ohsawa; the results are not contrary to human intuition. However, when evaluating the performance mechanically, the model considering only the similarity of the ODs, which does not consider the co-occurrence of elements, is better. It is thought that the frequency distribution of the elements is not a Gaussian distribution but a power distribution. Although a few types of formats and types exist, “CSV,” “TXT,” “number,” and “text” are relatively large compared to the other elements (
Figure 3 and
Figure 4). Moreover, more types of VLs exist, and it is clear that the distribution influences performance. Therefore, in this study, we conducted a detailed analysis with a threshold value for the VLs.
As shown in
Table 3, approximately 5600 types of variables exist in the training data, and the number of dimensions becomes extremely large when we create the term-VL matrices
and
. As discussed in the previous section, the distribution of variables consists of a few extremely frequent variables and many variables with low frequencies. Therefore, we compare the performance based on the threshold of the variable frequencies.
Figure 7,
Figure 8,
Figure 9,
Figure 10 and
Figure 11 show the boxplots of the F-measure using VLs appearing once, more than once, twice, thrice, and four times. The dots represent the mean values, and the lines inside each box represent the median. The top of the box is the first quartile, and the bottom is the third quartile. The bar on the top is the maximum value, and the bar on the bottom is the minimum value.
The types of VLs decrease according to the power law shown in
Figure 6. For all VLs (
Figure 7), the maximum F-measure is 0.947 for both matrices
and
. Because the medians are zero, small numbers of highly frequent VLs affect the performance, and the F-measures of most of the data are low. This is because the types of VLs are diverse, and the low-frequency VLs occupy the majority. Hence, the means and medians of the F-measures increase for all methods by setting the threshold and adjusting the frequency VLs, as shown in
Figure 8,
Figure 9,
Figure 10 and
Figure 11.
The results indicate that all performances improved by reducing the number of unique VLs; the F-measures generally improved until the threshold reached two (
Figure 8,
Figure 9 and
Figure 10). When the threshold is three, however, almost no difference between matrices
and
(
,
) are shown, and the result of matrix
increases to a higher level than that of matrix
. The results indicate that the method considering the co-occurrence of VLs may be suitable when using VLs that appear frequently with each other. A larger number of variables indicates that more noise may be included in the training data, which affects the inference performance.
In contrast, when we set the threshold to four, the F-measures of all methods tend to decrease. Reducing the number of variables represents a reduced amount of test data. In other words, although the performance for some of the data improves, it may be difficult to infer the variables of data that contain less frequent variables. These results suggest that using only 100 types of VLs is insufficient for inferring the VLs when the threshold is five.