1. Introduction
Earth Science, an interdisciplinary field, encompasses various disciplines, including geology, geophysics, atmospheric science, oceanography, and environmental science. The continuous advancement of modern Earth observation techniques has heralded the era of geoscience big data, encompassing observations of space, atmosphere, and land, as well as providing comprehensive coverage in all weather conditions and across multiple elements [
1]. The knowledge and intricate relationships within the field of Earth Science necessitate effective organization and communication [
2]. In recent years, the construction and utilization of knowledge graphs have emerged as powerful tools for capturing, integrating, and inferring knowledge. Earth Science knowledge graphs, which are applications of knowledge graph technology in the field of Earth Science [
3], can be regarded as extensions of knowledge graph technology [
4]. By representing knowledge using graphical structures, they effectively organize, integrate, and retrieve information related to Earth Science. The interconnections within Earth Science knowledge graphs facilitate the exploration of intricate relationships among geological phenomena, climate patterns, and environmental factors. This structured knowledge representation not only enhances knowledge management but also facilitates knowledge discovery, resulting in novel insights and discoveries within the realm of Earth Science.
In geological research, Earth Science knowledge graphs play a crucial role in integrating various Earth Science disciplines, including geochemistry, petrology, and structural geology, among others. Their purpose is to facilitate researchers’ comprehension of geological processes and evolution. For instance, Earth Science knowledge graphs uncover connections between distinct rock types, structural characteristics, and crustal movements, thereby aiding in understanding the causality and evolutionary aspects of geological events. Moreover, the integration of geological models, seismic data, and gravity–magnetic data within these knowledge graphs enables their application in seismic prediction and geological hazard assessment [
5,
6]. In terms of resource exploration, Earth Science knowledge graphs consolidate diverse geological information and exploration data, effectively identifying potential mineral resources and mineral deposit distributions. By linking geological elements, mineral types, geological structures, and geophysical exploration data, comprehensive perspectives are provided for resource exploration, empowering geologists to gain deeper insight into the formation and distribution patterns of mineral deposits. Subsequently, this guides their work in resource exploration and development. Additionally, Earth Science knowledge graphs serve as a foundation for constructing and predicting Earth Science models. Through the integration of geological information, geophysical data, and model outputs, knowledge graphs furnish the necessary basis and validation for Earth Science models. This enhancement concurrently enhances the accuracy, reliability, and scientific foundation of Earth Science models, contributing to resource management, disaster prevention, environmental assessment, and related fields.
In recent years, significant progress has been made in the construction of Earth Science knowledge graphs. Various research studies have focused on corpus construction [
7], knowledge extraction (including entity extraction [
8,
9] and relation extraction [
10,
11]) [
12,
13,
14,
15,
16,
17,
18,
19,
20], knowledge fusion [
3,
21], knowledge representation [
22,
23,
24], and knowledge quality evaluation [
25,
26] for the development of Earth Science knowledge graphs. In terms of practical applications, task-specific knowledge graphs have been created based on specific tasks and their objectives. Liu [
27] proposed a method for constructing knowledge graphs through feature analysis of geographic spatial data and Baidu Baike, with a primary focus on extracting geographic entities from spatial data complemented by attribute information from Baidu Baike. Zhou [
28] constructed a knowledge graph for porphyry copper deposits based on a conceptual model. Yang [
29] developed a conceptual framework for wetland knowledge modeling, taking into account factors such as wetland types and distribution patterns, thus facilitating the construction of wetland knowledge graphs. Ran [
30] established a knowledge graph and a big data sharing platform for niobium–tantalum deposits, with a specific emphasis on key metals, offering a novel perspective related to unveiling spatiotemporal evolutionary patterns of niobium–tantalum deposits. Zhang [
31] built a knowledge graph for gold mines by utilizing ontological knowledge to transform geological exploration knowledge, including geological structures and ore-seeking indicators, derived from large-scale gold mine geological exploration data. Feng [
32] leveraged the characteristics of rare-Earth minerals to define ontological knowledge and automatically constructed a knowledge graph for rare-Earth minerals. Ma [
33] integrated different versions of geological time through ontology and knowledge graph construction. Wang [
34] proposed a rapid method for constructing Earth Science knowledge graphs by mapping relational data to triplets. A series of Earth Science knowledge graphs have been developed with the support of the deep-time digital Earth framework [
35,
36,
37,
38,
39,
40,
41,
42,
43].
The research conducted in these two areas has provided substantial empirical experience in studying Earth Science knowledge graphs. However, the construction of large-scale Earth Science knowledge graphs [
4] faces technical challenges at different stages, leading to a lack of foundational instantiated graph support that combines crowdsourcing and expert decision-making. Hence, this paper presents a “pipeline”-based automated construction methodology for Earth Science knowledge graphs that incorporates functional modules such as knowledge extraction, knowledge fusion, graph construction, and visualization. Furthermore, it is applied to the construction of an iron ore deposits knowledge graph. This approach can serve as a crowdsourced process for constructing graphs, providing foundational graph support for large-scale Earth Science knowledge graphs, and offering data support for comprehensive Earth Science intelligent question-answering systems and knowledge evolution systems.
2. Methods
This section provides comprehensive descriptions of the methodology workflow employed in this paper. It encompasses the design of individual modules and sub-modules in the process, along with the design of a data pipeline that facilitates communication between these modules.
2.1. Technical Roadmap and Knowledge Graph Construction Process
The construction of a knowledge graph is an iterative process that requires continuous collection, integration, cleansing, and updating of knowledge to improve and enrich the content and quality of the knowledge graph. Meanwhile, technologies from fields such as artificial intelligence, machine learning, and natural language processing are integrated to enhance the efficiency and accuracy of knowledge graph construction. The detailed steps involved in constructing a knowledge graph are as follows:
Step 1: Requirement analysis: Clearly define the goals and requirements for knowledge graph construction, determine the domain scope and knowledge coverage, and gain an understanding of the intended use and functionality of the knowledge graph.
Step 2: Data acquisition: Collect data related to domain knowledge from various sources, including structured data such as databases and tables, as well as unstructured data such as text documents, papers, and webpages.
Step 3: Data preprocessing: Cleanse and preprocess the collected data, which may involve noise removal, format standardization, deduplication, and redundancy elimination.
Step 4: Knowledge extraction: Extract knowledge and relevant information from the preprocessed data, utilizing techniques like natural language processing for tasks such as named entity recognition, relation extraction, and event identification.
Step 5: Knowledge representation: Convert the extracted knowledge and information into a structured form for storage and querying within the knowledge graph. Techniques such as ontology modeling using languages like OWL or RDF can be employed to define concepts, properties, and relationships.
Step 6: Knowledge fusion: Integrate knowledge from different data sources and eliminate conflicts and duplicates, ensuring a consistent representation. Techniques like consistency checking and conflict resolution can be employed for handling inconsistent knowledge.
Step 7: Knowledge storage: Store the constructed knowledge graph in an appropriate knowledge graph storage repository to support subsequent queries, retrieval, and analysis. Commonly used technologies include graph databases and semantic repositories.
Step 8: Knowledge inference: Utilize inference engines to reason and infer knowledge within the knowledge graph, enabling comprehensive and in-depth knowledge discovery. Inference can be based on ontological rules, logical reasoning, and other approaches.
Step 9: Application development: Develop applications and tools based on the constructed knowledge graph to support functions such as knowledge querying, recommendation, and question answering. This may involve utilizing techniques such as natural language processing, machine learning, and data mining.
Step 10: Continuous updating: The knowledge graph requires regular updating and maintenance to accommodate new knowledge, updated data, and changing business requirements. This may include monitoring new literature publications and updates in statistical data, among other sources.
The “pipeline”-based automated construction approach is based on the traditional knowledge graph construction process. Drawing inspiration from the concept of assembly lines in automated manufacturing, the complete construction process is divided into several operational modules within a pipeline setup. The design and configuration of these modules can be adjusted to suit task requirements and technological advancements. The technical roadmap is illustrated in
Figure 1.
The “pipeline” is an approach that breaks down tasks into multiple independent steps and completes them in a specific order. Analogous to a physical production pipeline, each step has a specific function and role, enabling the entire process to operate efficiently. In the context of automating the construction of an Earth Science knowledge graph, the “pipeline” approach can divide the knowledge graph construction process into distinct stages. By assigning explicit tasks to each stage, this approach improves work efficiency, reduces repetitive labor, and provides a systematic and consistent framework for building an Earth Science knowledge graph.
In contrast to traditional methodologies, this paper proposes a categorization of the construction process into two distinct sections: human-assisted and automated operations (
Figure 2). Human-assisted tasks, such as corpus construction and ontology development, necessitate human involvement. Conversely, automated tasks encompass model training, knowledge extraction, knowledge fusion, and knowledge graph construction, which are facilitated by data transmission pipelines capable of fully autonomous operation. This classification is based on their automated nature. Moreover, each section and module can receive new data as input and produce operational results as output, allowing for dynamic control of the pipeline workflow.
2.2. Module Design
2.2.1. Human-Assisted Module
The quality of the corpus profoundly impacts deep learning. Thus, it is essential to create a professional and high-quality corpus that aligns with domain-specific characteristics. This serves as a critical foundation for knowledge extraction and the construction of a knowledge graph. To address the domain-specific characteristics of various professional fields, it is necessary to establish a constraint framework that encompasses professional content. During knowledge graph construction, this constraint framework is often achieved through the development of a domain ontology. Therefore, in the design of the human-assisted module, the construction of the domain ontology holds the utmost importance. The human-assisted module is further divided into two modules: domain ontology construction and corpus construction. Depending on the specific circumstances, these sub-modules can be further categorized into seven functional sub-modules, including ontology entity modeling, ontology semantic relation modeling, ontology modeling, data collection, unstructured data extraction and cleansing, corpus annotation, and corpus format transformation. These sub-modules are critical to the human-assisted module.
The module for ontology construction [
44] begins by determining the objectives for developing the domain ontology, thereby clarifying its intended goals and purposes. This process facilitates the organization and sharing of knowledge within the specific field of Earth Science and provides support for data processing in various applications. Subsequently, relevant knowledge resources encompassing literature, expert knowledge, and datasets are gathered to acquire the necessary information and concepts for constructing the Earth Science domain ontology. The structure of the ontology is then defined, incorporating concepts, properties, and to describe relationships among the associations and characteristics among different elements. These definitions can be encoded using languages for ontology description that encompass relationships, attribute values, and related links. Lastly, the ontology undergoes validation and refinement through communication and collaboration with experts in the field of Earth Science.
The structure of the ontology is carefully defined, encompassing concepts, properties, and relationships to illustrate the associations and unique characteristics between ideas. Such definitions are commonly expressed using ontology description languages, enabling the description of inclusion relationships, property values, and related connections. The ontology is then subject to validation and revision through communication with domain experts in geology. Experts typically follow the following steps when validating a domain ontology.
Domain Expert Involvement: Domain experts play a vital role in the validation of domain ontologies. They possess extensive knowledge and familiarity with the domain, enabling them to identify potential issues within the ontology and provide feedback.
Exploration and Evaluation: Domain experts and the ontology construction team collaboratively explore and assess the accuracy and consistency of the ontology. This involves a thorough examination of various levels and concepts within ontology to ensure their alignment with actual domain knowledge.
Real-World Application Testing: To validate the applicability of the ontology in practical scenarios, it can be integrated into relevant applications or systems and undergo real-world testing. Domain experts and users can interact with the system to examine the utility and effectiveness of the ontology.
Multiple iterations may be required during this process to ensure the accuracy and completeness of the ontology (
Figure 3,
Table 1). Equation (1) outlines the primary components of the ontology:
In the equation, the elements represent different aspects within the field of ontology.
Ontology serves as the fundamental essence, while
Conception refers to concepts or classes.
Property signifies the attributes associated with these concepts.
Axiom acts as a constraining factor, imposing regulations on the development of the ontology domain.
Value represents the attribute values, and nominal establishes the crucial link between conceptual notions and real-world instances. The ontology of geoscience is abstracted into concepts, properties, relationships, rules, and instances, thereby forming the quintuple denoted in Equation (2):
Within the equation, GOnto represents the ontology specifically designed for the domain of Earth Science. GCon corresponds to the conceptual framework that encompasses Earth Science. GProp encapsulates the diverse properties associated with Earth Science, encompassing mineralization timing, tectonic location, geological structures, spatial distribution, and scale. GRel reflects the intricate relationships among entities within the Earth Science domain, including instance relationships between entities and instances, as well as associations between instances and properties. GRul encompasses the essential rules that establish constraints on the types and combinations of concepts and instances during the construction of Earth Science ontology. GIns represents the mapping mechanism that links concepts to instances, providing concrete instantiations of entities derived from the conceptual framework.
Once the construction of the ontology is complete, standard Semantic Web technologies, including SPARQL queries and reasoning engines, can be employed to facilitate the application of the ontology. Furthermore, regular maintenance and updates of the domain ontology should be conducted over time to effectively capture the evolving domain knowledge. These maintenance and updates may encompass tasks such as incorporating new instances, refining the ontology structure, and expanding its coverage.
The process of maintaining and updating domain ontologies usually requires several iterations of improvement. However, due to the diversity and complexity of geoscience disciplines, the following challenges may be encountered throughout the iterative improvement process:
Domain complexity: The field of Earth Sciences is characterized by diversity and complexity, with intricate relationships between knowledge and concepts. In this context, accurately capturing domain knowledge and relationships is challenging.
Knowledge uncertainty: Knowledge in some disciplines of the Earth Sciences may be uncertain, e.g., vaguely defined or with incomplete relationships. Validating an ontology requires in-depth discussions and decision-making with domain experts to address this uncertainty.
Changing needs: Domain knowledge and concepts may change over time, so ontologies need to adapt and reflect these changes. Therefore, validation and improvement of ontologies requires continuous interaction and feedback with domain experts and users.
After the execution of the ontology construction sub-module, the ontology will be used to inform the main constraints and rules of the corpus creation module, thus participating in the corpus creation process.
The corpus creation module must adhere to the constraints of the ontology in order to define labels and corpora based on the domain’s knowledge system. This module encompasses four sub-modules, namely data collection, unstructured data extraction and cleaning, corpus annotation, and corpus format conversion. The data collection sub-module acquires geological texts either online through web crawlers or from local sources such as geological survey reports, mineral resource reports, geoscience research reports, and journal articles. The unstructured data extraction and cleaning sub-module preprocesses geological data by cleaning and standardizing the collected textual information. This involves tasks such as removing HTML tags, non-text characters, and segmenting the text into sentences or words. The corpus annotation sub-module involves labeling the text based on the ontology, including entity and relationship annotation. Popular open-source annotation platforms include Brat (a web-based annotation tool) [
45], GATE (a scalable open-source software framework), Prodigy, and Doccano [
46]. Lastly, the corpus format conversion sub-module plays a crucial role in corpus construction. The conventional annotation format is the “BIO” format, it may require modifications based on the model’s input requirements. As a result, the conversion sub-module primarily transforms the annotated corpus to facilitate model input.
Upon the completion of the aforementioned corpus module, adjustments and updates to the corpus can be made as new data and knowledge accumulate.
2.2.2. Automation Module
Once the corpus construction is complete, it is imported into the automation module and used to train a domain-sensitive extraction model. The knowledge extraction process is then applied to identify pertinent information. Following the extraction, similar entities are merged, and the resulting data are imported into a graph database to enable visualization. Within this module, knowledge extraction assumes a paramount role, as it determines the precision of entities, attributes, and their relationships in the knowledge graph. Consequently, the automation module is subdivided into sub-modules, including model training, knowledge extraction, knowledge fusion, and graph construction. These sub-modules can be further refined, particularly in the domains of knowledge extraction and fusion, to incorporate the latest advancements in natural language processing.
2.3. Design of Functional Modules
2.3.1. Model Training Module
The module primarily comprises sub-modules for model selection, training, evaluation, and testing. The design of the module is presented in
Figure 4. Initially, deep learning models should be selected based on the specific problem type, and the models commonly used for knowledge extraction are as follows:
Recurrent neural networks (RNNs): RNN is a model that is suitable for handling sequential data. It has the ability to capture the temporal information in the data. The RNN passes the hidden state at each time step to capture the contextual information of the input sequence. However, traditional RNNs often face the problem of vanishing or exploding gradients when dealing with long-term dependencies in the data.
Long short-term memory (LSTM): LSTM is a variant of RNN that is widely used to handle long-term dependencies. It uses gate mechanisms such as a forget gate, an input gate, and an output gate to control the flow of information, which effectively solves the problem of vanishing and exploding gradients.
Gated recurrent unit (GRU): GRU is another variant of RNN, which is similar to LSTM. It simplifies the model’s structure by using fewer gate mechanisms. GRU performs equally well as LSTM in many tasks, but it has fewer parameters and a faster training speed.
Attention mechanism (AT): Attention mechanism is a technique used to address the inconsistent processing ability of the model for different parts of the input sequence. It dynamically weights and aggregates the relevant parts of the sequence at each time step, thus improving the model’s ability to focus on important information.
Pretrained language models (PLMs) are language models that are trained on large-scale unlabeled text data. They learn rich language representations and can be used for various tasks, including knowledge extraction. Traditional pretrained language models include Word2Vec and GloVe, while more recent and powerful models include BERT and the GPT series.
Sequence labeling models are a class of models widely used in knowledge extraction tasks. Among them, conditional random fields (CRFs) is a model that is commonly used for identifying and labeling specific information, such as named entities or entity relationships, from text. Recently, sequence labeling methods that combine pretrained language models have achieved good results.
In comparison, LSTM and GRU have similar performances, but LSTM handles long-term dependencies better. The attention mechanism can improve the model’s focus on the important parts of the input sequence. Pretrained language models provide richer language representations, improving performance in knowledge extraction tasks. Sequence labeling models, combined with CRF and pretrained language models, allow for more accurate entity recognition and relationship extraction.
In order to improve the performance of knowledge extraction, appropriate deep learning models and algorithms need to be selected based on the specific requirements of the task and characteristics of the dataset. Additionally, ensemble methods or transfer learning techniques can be utilized to enhance the effectiveness of knowledge extraction. The model’s architecture, activation functions, loss functions, and optimizers should also be carefully chosen to ensure optimal performance. Model parameters are initialized using either random initialization or pretrained weights. The input data flow through the network layers, undergoing nonlinear transformations and activation functions to generate the final output. Subsequently, the model’s output is compared against the target labels, and the loss function is computed. The computed loss value is then employed to calculate gradients using the backpropagation algorithm, enabling an assessment of each parameter’s impact on the loss function. Finally, the model’s parameters are updated using an optimization algorithm (e.g., stochastic gradient descent) based on the computed gradients. This optimization algorithm adjusts the parameter values considering the gradient direction and learning rate, intending to minimize the loss function. Following each training iteration, the adequacy of the training progress is evaluated, and if deemed satisfactory, the model parameters are output. Otherwise, the model undergoes another iteration.
Upon completing the training phase, the performance of the trained model is assessed using a validation set. Based on the evaluation results, necessary adjustments are made to the model’s hyperparameters and architecture, including the learning rate, regularization parameters, and the number of layers, aiming to enhance its accuracy and robustness. Precision and recall are commonly used metrics for evaluating knowledge extraction tasks. Precision refers to the ratio of correctly extracted entities or relations to the total number of entities or relations in the model’s extraction results (Equation (3)). Recall is the ratio of correctly extracted entities or relations to the total number of entities or relations in the manual annotations (Equation (4)). The F1 score (F1-score) is often calculated by combining precision and recall to comprehensively evaluate the performance of the system (Equation (5)). The meaning of each parameter in the formula is shown in
Table 2.
Subsequently, an independent test dataset is employed to conduct a comprehensive evaluation of the model’s performance in real-world scenarios. The outcomes derived from the test data analysis enable an assessment of the model’s effectiveness and its adherence to the application requirements. Once the requirements are fulfilled, the model can be deployed in a production environment, where continuous monitoring and updates are carried out.
2.3.2. Knowledge Extraction Module
The design of the knowledge extraction module is contingent upon the selection of knowledge extraction methods, as they directly impact the precision of the extraction process. As a result, the knowledge extraction module is the most critical functional component of the automation module. Depending on task requirements, data characteristics, and available resources, knowledge extraction can be classified into two frameworks: the pipeline method and joint extraction.
The pipeline method involves breaking down the knowledge extraction task into multiple subtasks, which are processed sequentially. Each subtask is responsible for extracting specific knowledge, and its output serves as the input for the subsequent task. This approach offers advantages such as independent development and debugging of each subtask, as well as the flexibility to incorporate new subtasks as needed. For instance, a common pipeline method combines named entity recognition (NER) and relation extraction as two subtasks. NER identifies entity types in the text, while relation extraction determines relationships between the entities based on the output of NER. The pipeline method has notable benefits, including a clear structure, modularity, and ease of construction and maintenance.
Joint extraction involves collectively modeling multiple knowledge extraction tasks. This method considers the interdependencies between tasks and addresses conflicts and competition through joint optimization. Graph models, such as conditional random fields (CRFs) or Graph Neural Networks (GNNs), can be utilized to encode and infer relationships among different tasks. Compared to the pipeline method, joint extraction leverages contextual information across different tasks, resulting in improved accuracy and consistency of extraction. However, the challenge here lies in modeling and training complex joint models that encompass cross-task interactions and optimization. The accompanying figure illustrates the design of the corresponding module.
The choice between the two extraction frameworks should be determined based on task requirements. If better utilization of inter-task contextual relationships is desired, along with addressing the challenges of designing and training an effective joint model, and with sufficient annotated data and computational resources, the module design can be based on joint extraction. Conversely, if independent development and optimization of each subtask is preferred, leading to a more straightforward understanding and debugging of the module, the design can be based on the pipeline method.
Figure 5 presents a comparison of the workflows for the two knowledge extraction methods.
2.3.3. Knowledge Fusion Module
Upon completion of the knowledge extraction process, a substantial amount of knowledge may be susceptible to duplication and redundancy issues, resulting in redundant and bewildering information. Knowledge fusion enables the identification and elimination of duplicate and redundant knowledge, thereby reducing information redundancy and enhancing information utilization efficiency. This facilitates more effective knowledge management and utilization, mitigating wastage of resources and repetition of labor. There are five commonly employed fusion methods:
String matching: String matching methods involve comparing the similarity between entity names or identifiers to facilitate matching and merging. For instance, algorithms like edit distance and Jaccard similarity can be employed to calculate the similarity between entity names, and a predetermined threshold can be set to determine whether they belong to the same entity. It is suitable for situations that require exact or fuzzy matching, such as recognizing and linking named entities, terminology matching, and information extraction. Although string matching algorithms are intuitive and effective, they may face efficiency issues in complex and large-scale scenarios. They are unable to handle semantic and contextual information, which can lead to poor performance when long strings are involved or when the match similarity is low.
Feature vector matching: Feature vector matching methods represent entities as feature vectors and assess the similarity between them. These feature vectors can encompass entity attributes, relationships, contextual information, and other relevant features. Typically, approaches like the bag-of-words model, TF-IDF, and Word2Vec are utilized to generate feature vectors, and similarity measurement techniques such as cosine similarity are used to compare the vectors. By establishing a similarity threshold, it becomes possible to ascertain whether the entities belong to the same category. Feature vector matching is commonly used in knowledge fusion tasks based on feature and similarity measurement. This method is applicable to various fusion scenarios, such as entity alignment, relation extraction, and link prediction in knowledge graphs. This method has flexibility and scalability, and can adapt to different types of features and similarity measurements. However, it is sensitive to the selection of features and similarity calculation methods, and needs to be fine-tuned according to specific tasks.
Context matching: Context-matching methods take into account the surrounding contextual information of entities to establish their identity. More specifically, matching and merging can be accomplished by analyzing co-occurrence patterns, relative positions, syntactic dependency relationships, and other contextual factors in the text. Context matching is commonly used to improve the accuracy of string matching and knowledge fusion by utilizing surrounding contextual information. It can identify the semantic and contextual consistency among strings in specific language environments by considering contextual and contextualized vocabulary. However, this requires that appropriate context windows, contextual information, and matching strategies be designed according to the domain and tasks. Some complex context-matching methods may have higher computational complexity issues.
Graph matching: Graph-matching methods consolidate entities by comparing their relational connections. This method represents entities and their relationships using graph structures and employs graph-matching algorithms to identify similar graphs and subgraphs. Graph-matching algorithms can be based on principles such as subgraph isomorphism and graph isomorphism. Graph-matching methods can capture complex relationships between entities. These are commonly used for entity and relationship matching and alignment tasks within knowledge graphs. They can perform matching and fusion operations by establishing the structure within the knowledge graph and using graph algorithms, such as entity alignment, information propagation, and graph pruning. Graph matching is applied in scenarios including knowledge graph fusion, graph data analysis, and link prediction. Graph matching can comprehensively consider the topological relationships and semantic similarity between nodes. However, in large-scale and highly dynamic graph structures, graph-matching algorithms may face challenges with regard to computational efficiency and scalability.
Machine learning methods: Machine learning methods approach entity merging as classification or clustering problems. By representing entity information as feature vectors and utilizing machine learning algorithms like support vector machines, random forests, and clustering algorithms for classification or clustering, identical entities can be allocated to the same category. Machine learning methods can automatically learn shared features and patterns from data, but their performance relies on the quality and accuracy of the training data. Furthermore, machine learning methods can be combined with other approaches for improved accuracy and robustness in the merging process. Machine learning methods are widely used in knowledge fusion for feature learning, pattern recognition, and decision inference. It can automatically learn and infer the relationships and fusion rules between different knowledge sources through training models. Application scenarios include knowledge graph construction, relation extraction, and knowledge integration. Machine learning methods have the ability to automatically learn and adapt to different data, allowing them to handle complex relationships and patterns. However, machine learning methods require a large amount of annotated data and training time. In the application, it is necessary to consider feature selection, model selection, and the issue of overfitting.
The comprehensive process of a knowledge fusion module consists of the following steps:
Step 1: Data collection and cleaning: Initially, collect entity information from diverse data sources or texts and proceed to clean the data. The cleaning process encompasses eliminating duplicate entities, rectifying spelling errors, addressing aliases and abbreviations, and so on. Data cleaning aims to ensure data consistency and accuracy, laying a reliable foundation for subsequent entity fusion.
Step 2: Entity matching and identification: Match and identify the collected entity information, thereby determining which entities represent the same real-world entity. This can be achieved through methods like similarity calculation and comparing entity attributes. For example, calculating the similarity between entity names allows us to consider them as the same entity when the similarity exceeds a pre-defined threshold.
Step 3: Feature extraction and representation: Extract pertinent features from different data sources or texts for the matched entities. These features can include entity attributes, relationships, contextual information, and more. The objective of feature extraction is to provide specific information and a basis for making judgments in the subsequent entity fusion process.
Step 4: Similarity calculation and threshold setting: Calculate the similarity between entities based on their features. Various measurement methods such as string similarity, vector similarity, and context matching can be employed for similarity calculation. By setting a similarity threshold according to the specific application scenarios and requirements, it becomes possible to determine whether entities belong to the same entity.
Step 5: Collision handling and decision-making: Address collisions between similar entities, i.e., develop strategies to handle entities with similarity values exceeding the threshold. Decisions can be made based on prioritizing information from a certain data source or employing manual review. The decision-making process can be modified and optimized based on the actual situation.
Step 6: Entity merging and integration: Merge and integrate the matched and decided entities to form the final fused entity. This merging process may involve merging entity attributes, relationships, and related operations to ensure the integrity and consistency of entity information.
Step 7: Post-processing and validation: Conduct post-processing and validation on the merged entities to ensure the accuracy and consistency of the fusion results. Post-processing activities may encompass removing redundancies, resolving conflicts, updating attributes, and more. The validation process may involve manual review or validation by domain experts.
The triplets that undergo entity fusion already possess all the essential elements required to form a knowledge graph and serve as the foundational data for constructing intelligent question-answering systems and knowledge recommendation systems based on the knowledge graph.
2.3.4. Knowledge Graph Construction Module
Upon completing knowledge fusion, it becomes imperative to visualize the data stored in the knowledge graph, thereby enhancing the comprehension and exploration of information and relationships within the graph. In the domain of graph visualization research, graph databases serve as database systems for efficient storage and processing of graph-structured data. Concepts, entities, and attributes undergo a transformation into nodes, while relationships between various entities and attributes are represented as edges, forming structured triplets. Unlike conventional relational or document-oriented databases, graph databases place emphasis on relationships (edges) and the topological structure among nodes (entities), with a focus on addressing intricate graph querying and analysis tasks. Prominent graph databases encompass Neo4j (recognized for its high performance, reliability, and robust graph query capabilities), Amazon Neptune (distinguished by its scalability, persistence, and exceptional availability), TigerGraph (providing support for parallel computing), ArangoDB (characterized by its versatility as a multi-model database), and Sparksee (delivering rapid graph query and analysis capabilities). Leveraging the unique attributes of graph databases, knowledge graphs can accomplish the following functionalities:
Efficient query and graph analysis: Graph databases employ query languages (such as SPARQL and Cypher) and graph-based algorithms to facilitate efficient query and analysis operations. This enables the utilization of graph databases as a platform for constructing question-answering systems based on knowledge graphs, facilitating fast graph traversal, relationship path queries, node similarity calculations, and other operations.
Large-scale data processing and horizontal scalability: Graph databases possess the capacity to handle large-scale datasets and support horizontal scalability. Through techniques such as partitioning, replication, and distributed computing, graph databases distribute data and computational workloads across multiple nodes, achieving high performance, availability, and scalability. Consequently, they provide the groundwork for expanding knowledge graphs and enriching knowledge systems, serving as a platform for large-scale knowledge graph sharing.
Visualization and exploratory analysis: Graph databases offer visualization tools and query interfaces to aid users in intuitively comprehending and exploring graph data. These tools can depict the topological relationships of nodes and edges, visualize the outcomes of graph algorithms, and assist users in identifying the hidden patterns and insights within the knowledge graph.
2.3.5. Data Pipeline Design
To achieve seamless integration among modules in the automation section, it is essential to construct data pipelines for the transmission and transformation of data. These data pipelines should be designed based on the inputs and outputs of the modules (
Figure 6). For instance, in the model training sub-module, labeled corpora are converted into the required format for the model at the input stage, or data reading rules are modified. At the output stage, trained model parameters are generated. In the knowledge extraction sub-module, cleaned unstructured data are inputted and predicted labels are generated. It is imperative to associate these predicted labels with entities, attributes, and relationships and convert them into triplets. In the knowledge fusion sub-module, triplets involving entities are inputted, and fused triplets are generated as output. Finally, in the graph construction sub-module, triplets are inputted to generate a visualized graph as output.
3. The Construction of a Knowledge Graph for Iron Ore Deposits
In this section, to verify the feasibility and efficiency of the proposed approach and process design, this paper starts with data collection and ontology creation and implements the construction and visualization of a knowledge graph for iron ore deposits according to the described method.
The construction of the iron ore deposits ontology is accomplished using the automated process of building a knowledge graph. This involves applying the seven-step method for domain construction [
47,
48,
49] and drawing insights from expert articles and patent achievements in the field of iron ore deposits [
50,
51,
52,
53,
54]. By gaining an understanding of the field’s characteristics, problems, and requirements, iron ore deposits concepts, attributes, and entities are extracted from the relevant literature, professional terminology, and industry standards. The hierarchical structure of concepts in the ontology is then designed, and the main concepts and their hierarchical relationships are determined. The ontology is organized through classification based on features, attributes, and other methods, allowing for the definition of parent–child relationships and properties between concepts. As a result, the iron ore deposits knowledge system is constructed. The attributes, characteristics, and relationships within the iron ore deposits knowledge graph primarily revolve around iron ore deposits entities. During geological exploration, it is essential to first identify the geological conditions in the investigated area, including the stratigraphy, structure, and the presence of igneous rocks, in order to comprehend and analyze variations in ore bodies, predict changes, and assess the reserves, quality, and morphology of mineral deposits. Subsequently, the external morphology and internal structural characteristics of the ore body are studied to determine aspects such as form, distribution, occurrence, ore grade, material composition, and structural features—all crucial factors for evaluating geological traits. Accordingly, the existing iron ore deposits knowledge system is combined to establish relationships between iron ore deposits entities, attributes, and entity-attribute features, thereby modeling the iron ore deposits entities and attributes. Semantic relationships within iron ore deposits encompass the mapping relationship between concepts and instances, the inclusion relationship between entities, and the attribute correlation relationship between entities and attributes (
Figure 7).
The mapping relationship between concepts and instances of iron ore deposits can be expressed as Equation (6).
In the equation, C signifies concepts, e denotes instances that correspond to the concepts, and r represents the relationship between concepts and their corresponding instances. In this paper, C stands for a concept, such as a mineral deposit; e stands for an instance of an iron ore deposit, such as the Gongchangling iron mine; and r stands for the mapping relationship between the concept of a mineral deposit and the instance of the Gongchangling iron mine.
The inclusion relationship among iron ore deposits entities is primarily defined by the interconnections between concepts at various levels. This relationship can be mathematically represented by Equation (7), in accordance with the hierarchical structure of the concepts.
In the equation, C1 stands for deposit, C2 stands for zone, C3 stands for ore body, and C4 stands for ore district. Equation R (Ci, Cj) denotes the inclusion relationship. Specifically, mineral deposit C1 encompasses ore zone C2, ore zone C2 encompasses ore body C3, and mineral deposit C1 encompasses ore body C3. These semantic relationships are applicable to triplex mineral deposit entities associated with concepts at various levels.
The association between entities and properties primarily revolves around the attribute features of iron ore deposits entities, which can be categorized into object properties and data properties. Object properties refer to the semantic relationship between iron ore deposits entities and attribute objects, such as the correlation between the mineral deposit entity and attributes like ore-forming stratigraphy, ore-forming structures, and tectonic structures. On the other hand, data properties represent the correlation between entities and attribute values, such as the association between the ore body entity and numerical values like strike, dip, length, and thickness. The entity–property association in the context of iron ore deposits can be formally expressed as Equation (8).
In the equation, e denotes an instance, property signifies an attribute, and value represents the corresponding attribute value.
Upon completion of the construction of the iron ore deposits ontology, a knowledge corpus specific to iron ore deposits is generated based on the constraints of the ontology. Esteemed Chinese geological journals, including “Acta Petrologica Sinica”, “Journal of Mineral Deposits”, “Mineral Deposits Geology”, and “Geological Review”, serve as data sources. Relevant articles pertaining to iron ore deposits are collected and unstructured data are extracted. Following data cleansing, the Doccano annotation platform is utilized to annotate entities, properties, and semantic relationships based on the iron ore deposits ontology. This process culminates in the creation of a labeled corpus intended for the extraction of iron ore deposits knowledge. The corpus is divided into three subsets: training data, validation data, and data earmarked for extraction, categorized according to their respective purposes.
Following human-assisted operations pertaining to the data, the developed corpus is integrated into the system to construct the PLMs (Pretrained Language Models) + BiLSTM + CRF framework. The PLMs + BiLSTM + CRF modeling framework is a knowledge extraction composite modeling framework that combines pretrained language models (PLMs), long short-term memory (LSTM) networks, and conditional random fields (CRFs). It has distinctive features compared to other models. First, the framework combines multiple models and incorporates their individual features. Pretrained language models have strong contextual understanding. LSTM, as a variant of recurrent neural networks, can process sequential data and capture long-term dependencies. CRF learns the relationships between entities and the transformation patterns of label sequences, optimizing the model outputs to improve the accuracy and consistency of entity boundaries. Since the geoscientific texts have strong syntactic dependencies and contextual relationships, the chosen modeling framework meets the data requirements. Second, the PLMs + LSTM + CRF modeling framework allows end-to-end training and inference. It allows for simultaneous learning of feature representation, sequence modeling, and label prediction, thus reducing the need for feature engineering and simplifying processing steps. Last but not least, the flexibility of the chosen modeling framework combined with the pipeline-based approach described in this paper allows the model to be modified and extended according to the task requirements, highlighting the versatility of the pipeline-based approach investigated in this paper.
By comparing the practical effects of BERT [
54,
55], RoBERTa [
56], PERT [
57], and LERT [
58] on tasks involving named entity recognition and relationship extraction, LERT has exhibited superior performance in Chinese knowledge extraction tasks (
Table 3) in comparison to BERT, RoBERTa, and PERT. Geoscience data are highly specialized, and the underlying geoscience entities are distinctly domain-specific, and they exhibit close syntactic dependencies with attributes and relationships. Although previous efforts have been made to train pretrained models, the existing models still have limitations. During the training of these models, attention is mainly focused on incorporating some linguistic features without carefully analyzing the contribution of each feature to the overall performance and the relationship between different tasks. LERT forms a multi-task pretraining methodology based on masked language models and three linguistic tasks for training. Part-of-Speech (POS) tagging, Named Entity Recognition (NER), and Dependency Parsing (DEP) are relatively basic linguistic features. They perform well in labeling and satisfy the one-to-one labeling condition. Among them, the NER feature depends on the output of POS tokens, while the DEP feature depends on both POS tokens and NER tokens; that is to say, word class is the most basic linguistic feature, followed by noun phrases and verb phrases. Based on these dependencies, a different learning rate is assigned to each linguistic feature during the training of the LERT model, so that POS learns faster than NER and DEP. This is similar to human learning, where an individual usually learns the basic topic first, and then learns the higher-level knowledge on which it depends. The use of all three language features results in a final performance gain that produces consistent improvements across all downstream tasks. This model has undergone extensive and robust experiments based on ten Chinese natural language tasks. The experimental results demonstrate that LERT significantly improves performance in various comparable downstream tasks.
Consequently, this study selected LERT + BiLSTM + CRF as the framework for model training. Given that the model necessitates tokenization and word conversion into LERT token representation, along with sequence splitting and padding for training purposes, a maximum sequence length (max_seq_len) of 512 was set in this study to ensure uniformity in the lengths of the text chunks. To supervise and optimize the model during training for enhanced performance and precision, the definition of the loss function and optimizer proves fundamental. In this study, the loss function is defined as the negative log-likelihood loss of CRF, which simultaneously takes into account dependencies between labels during training. The hyperparameters of the knowledge extraction model are in
Table 4.
Upon completion of the model training, data are directed into the knowledge extraction module. Leveraging the Chinese pretrained model, LERT, the module generates word vectors from the annotated corpus of entity–property labels. By combining the long-range dependency-capturing capability of the BiLSTM network, it extracts the features of iron ore deposits entities and properties. The CRF is utilized to predict labels for the extracted entities and properties, thereby accomplishing the extraction process. While the expressions of iron ore deposits entities in the acquired unstructured data are standardized, descriptions related to object properties and numerical properties tend to be more intricate. Consequently, in terms of practical effectiveness, entity recognition outperforms property recognition. However, despite model integration, a few recognition errors and incompleteness issues persist due to ambiguous boundaries. Thus, this study incorporates the Word2Vec model to adopt a knowledge fusion method, utilizing the cosine similarity algorithm for similarity calculation. Applying Equation (9) to calculate the feature vectors V for entities and comparing the cosine similarity between two entity feature vectors, entities sharing a cosine similarity exceeding 0.9 are deemed the same, thereby initiating the merging operation. To ensure precise fusion, manual assistance is employed for control.
In the equation, represents the character vector, incorporating the character information of the entity, with corresponding weight parameter . represents the word vector, which is obtained through fine-tuning a pretrained model such as BERT, encompassing entities from the corpus, with corresponding weight parameter . represents the feature vector within the contextual context c, capturing the contextual information surrounding the current entity, with weight parameter . represents the feature vector of the feature word f within the contextual context c. denotes the total number of words in the context.
Once fusion is completed, the process proceeds to the graph visualization module. In designing this module, Neo4j is employed due to its capabilities of rapid data querying, comprehensive functionality, visualization support, structured and multidimensional storage, as well as its advantageous management of triplets. The iron ore deposits knowledge graph is stored and visualized using Neo4j, capitalizing on its strengths. Concepts, entities, genetic types, and mineralization geological features, such as distribution and scale, are converted into nodes, whereas the connections between different iron ore deposits entities and their corresponding mineralization geological features are represented as edges. The resulting iron ore deposits knowledge graph is visualized as depicted in
Figure 8. Node colors in the visualization interface indicate different levels of knowledge within the knowledge system. By utilizing the querying functionality of the graph database, associations can be retrieved between the “Gongchangling iron mine” node and other entity nodes or attribute value nodes.
4. Discussion
Knowledge graphs are knowledge representation models that utilize semantic associations to organize and represent structured data graphs of domain knowledge. They provide a means of describing the relationships and semantic connections between entities by organizing entities, properties, and relationships into a network graph. The primary objective of knowledge graphs is to capture and express real-world knowledge, aiding individuals in comprehending and effectively utilizing vast amounts of information. When knowledge graphs are utilized in the field of geoscience, they are known as geoscience knowledge graphs. These graphs offer several advantages, including the integration of multiple data sources, knowledge discovery and association, analysis of the Earth system, intelligent search and recommendation capabilities, as well as knowledge sharing and collaboration functionalities. Consequently, geoscience knowledge graphs serve as a powerful tool for research and application within the geoscience field, opening new avenues for the development and application of Earth Science. However, research on geoscience knowledge graphs is still primarily in the theoretical and experimental phase, lacking sufficient practical case applications. Moreover, given the multitude of disciplines, diverse data, and ambiguous data boundaries in the field of Earth Science, the construction of large-scale geoscience knowledge graph platforms, such as Wikipedia, that combine crowd-sourcing and expert knowledge systems will play a crucial role in the future.
The present study presents a “pipeline”-based approach for the automated construction of geoscience knowledge graphs. This approach efficiently generates knowledge graphs of research objects within specific disciplines or sub-disciplines of Earth Science. This study also demonstrates a systematic process for data collection, construction, and visualization of geoscience knowledge graphs. Two sections, human-assisted and automated, are designed, each incorporating various functional modules. The human-assisted section primarily focuses on domain ontology construction and corpus development, with the domain ontology serving as a crucial constraint for corpus development. Construction of domain-specific ontologies requires incorporating expert opinions and considering the characteristics of the domain knowledge system. The knowledge system can be continuously expanded and enriched as research progresses. The generality of the method proposed in this paper is reflected in this section. When applying this method to construct knowledge graphs in other disciplines within the field of Earth Sciences, it is essential to develop discipline-specific ontologies and build a corpus using relevant literature and geological survey reports as data sources and then to proceed to the next section to carry out the subsequent processing. The automated section encompasses model training, knowledge extraction, knowledge fusion, and graph construction. The knowledge extraction model can be designed based on extraction algorithms from the field of natural language processing, thereby enhancing the precision and accuracy of knowledge extraction. To ensure the quality of the knowledge graph, post-extraction verification can be performed.
The approach used in this study for the rapid construction of geoscience knowledge graphs aims to reduce technical barriers and increase the number of domain-specific knowledge graphs in the future. It can provide foundational data support for the development of a comprehensive interdisciplinary knowledge-sharing platform and intelligent question-answering and decision-making systems within the field of Earth Science. It can also be combined with knowledge graph-based algorithms and artificial intelligence techniques to build more advanced AI-driven knowledge systems and serve applications such as resource management, risk assessment, and recommender systems. For example, in critical mineral resource management, it is important to construct a knowledge graph of the entire mineral industry chain from mine supply to smelters and downstream industries. Through knowledge fusion and constructing the knowledge graph of the whole mineral industry chain of key minerals and combining graph representation learning, graph-clustering algorithms (such as spectral clustering and the Louvain algorithm), graph-matching algorithms (such as subgraph isomorphism, graph isomorphism, and graph edit distance), graph-recommending algorithms (such as path-based recommendation and graph convolutional networks), and knowledge graph completion algorithms (such as Rescal and DistMult based on tensor decomposition models) to build a key mineral resource search engine and a key mineral resource question-answering system.
In addition to describing the design of each functional module and data pipeline in the approach, a knowledge graph of iron ore deposits is constructed in this study, focusing on iron ore deposits as the subject. A Chinese pretrained model is selected based on the training corpora, and the LERT + BiLSTM + CRF knowledge extraction model framework is determined through model training outcomes. The expanding number of knowledge nodes in the iron ore deposits knowledge graph reveals significant potential for knowledge mining and discovery within the field of iron ore deposits. The graph integrates and organizes a substantial amount of information related to geological features and deposit types, forming a comprehensive knowledge network. Consequently, researchers can efficiently acquire and analyze knowledge in the field of iron ore deposits, uncovering patterns and correlations. Additionally, integrating the knowledge graph of iron ore deposits with those in other fields presents a promising avenue for interdisciplinary knowledge discovery. Combining knowledge graphs from geology, geophysics, geochemistry, and other fields facilitates the resolution of complex issues such as the genesis of iron ore deposits, mineral resource evaluation, and mineral exploration. This collaboration nurtures a deeper understanding of Earth evolution and deposit models.
The continuous expansion of the iron ore deposits knowledge graph and its integration with knowledge graphs from other disciplines will provide robust support for research, exploration, and development in the field of iron ore deposits. It not only accelerates the accumulation and dissemination of knowledge but also offers significant references for decision-making and technological innovation in related fields. As a result, it drives progress and advancement in the geological resources field.
5. Conclusions
The advancement of natural language processing (NLP) technology has enabled the resolution of various geoscience issues through NLP-based algorithms. Geoscience knowledge graphs, which are interdisciplinary in nature, merging computer science and Earth Science, have aroused immense interest among geologists and computer scientists. This study presents a “pipeline”-based approach to automating the construction of geoscience knowledge graphs, thereby reducing the technical complexities associated with their development. By using an iron ore deposits knowledge graph as an exemplar, the article illustrates the comprehensive process of constructing a geoscience knowledge graph based on the proposed methodology. From this study, the following conclusions can be drawn:
(1) Constructing large-scale geoscience knowledge graphs necessitates the integration of vast instantiated graphs, requiring a blend of crowdsourcing and expert decision-making to establish a data-sharing platform.
(2) Given the interdisciplinary nature of Earth Science, ontology construction should be tailored to the unique characteristics of each discipline, offering suitable constraints for research objects in specific domains.
(3) Geoscience knowledge graphs, as a specific type of knowledge graph, possess organizational and storage capabilities, intelligent search functionalities, automated recommendations, knowledge discovery and analysis capabilities, scalability, and maintainability. They can effectively facilitate the integration of multi-source data in Earth Science, knowledge discovery and correlation, Earth system analysis, intelligent search and recommendations, as well as knowledge sharing and collaboration, among other applications.
(4) The quality of knowledge extraction relies on both the corpus quality and the construction of the model framework. In different domains, it is essential to compare multiple models.
(5) The proposed approach in this paper is efficient at constructing knowledge graphs and significantly simplifies the development process of knowledge graph projects, and can also be combined with algorithms based on knowledge graph applications to realize the efficient construction of geoscience Q&A systems.