Data Analysis and Classification Methods and Applications Entire PDF Ebook
Data Analysis and Classification Methods and Applications Entire PDF Ebook
Visit the link below to download the full version of this book:
https://medipdf.com/product/data-analysis-and-classification-methods-and-applica
tions/
Marek Walesiak
Editors
Data Analysis
and Classification
Methods and Applications
123
Editors
Krzysztof Jajuga Krzysztof Najman
Department of Financial Investments Department of Statistics
and Risk Management University of Gdańsk
Wroclaw University of Economics Sopot, Poland
and Business
Wroclaw, Poland
Marek Walesiak
Department of Econometrics and Computer
Science
Wroclaw University of Economics
and Business
Jelenia Góra, Poland
Mathematics Subject Classification: 62H25, 62H30, 62H86, 62-09, 68U20, 62P12, 62P20, 62P25
© The Editor(s) (if applicable) and The Author(s), under exclusive licence to Springer Nature
Switzerland AG 2021
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, expressed or implied, with respect to the material contained
herein or for any errors or omissions that may have been made. The publisher remains neutral with regard
to jurisdictional claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
This volume presents the papers from the 29th Conference of Section of
Classification and Data Analysis of Polish Statistical Society held at the University
of Gdansk on September 7–9, 2020. The papers presented refer to a set of studies
addressing a wide range of recent methodological aspects and applications of
classification and data analysis tools in micro and macroeconomic problems. In the
final selection, we accepted 19 of the papers that were presented at the conference.
Each of the submissions has been reviewed by two anonymous referees, and the
authors have subsequently revised their original manuscripts and incorporated the
comments and suggestions of the referees. The selection criteria were based on the
contribution of the papers to the theory and applications of modern classification
and data analysis.
The chapters have been organized along with the major fields and themes in
classification and data analysis: Methodology, Application in Finance, Application
in Economics, Application in Social Issues, and Application with COVID-19 Data.
The part on Methodology contains five papers. The paper by Dudek focuses on
the new algorithm from spectral clustering family and its applications in large data
sets analysis. The author conducted a comparative analysis with other approaches.
Rozmus article focuses on the analysis of the number of clusters and stability
indicators. The aim of the article is to compare the results in terms of the indicated
correct number of groups by classical indexes and stability measures. The paper by
Majkowska, Migdał-Najman, Najman, and Raca attempts to characterize words
commonly used in the messages published by Twitter users. Text mining methods
and techniques were used to carry out the research, which was mainly focused on
the analysis of individual words and collocations occurring in the users’ tweets.
Bryś in his paper conducts research of 1446 selected publications provides insights
on classification algorithms applied to information security tasks, their popularity,
and the algorithm selection challenges. The paper by Najman and Zieliński
investigates the issue of the usefulness of isolation forests in outlier detection. The
results of simulations and empirical studies on selected data sets are presented. The
assessment takes into account the impact of individual characteristics of big data
sets on the effectiveness of the analyzed methods.
v
vi Preface
Methodology
Evaluation of Two-Step Spectral Clustering Algorithm for Large
Untypical Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Andrzej Dudek
Determining the Number of Groups in Cluster Analysis Using
Classical Indexes and Stability Measures—Comparison of Results . . . . 11
Dorota Rozmus
Identification of the Words Most Frequently Used by Different
Generations of Twitter Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Agata Majkowska, Kamila Migdał-Najman, Krzysztof Najman,
and Katarzyna Raca
Classification Algorithms Applications for Information Security
on the Internet: A Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Michał Bryś
Outlier Detection with the Use of Isolation Forests . . . . . . . . . . . . . . . . 65
Krzysztof Najman and Krystian Zieliński
Application in Finance
Propositions of Transformations of Asymmetrical Nominants into
Stimulants on the Example of Chosen Financial Ratios . . . . . . . . . . . . . 83
Barbara Batóg and Katarzyna Wawrzyniak
Gini Regression in the Capital Investment Risk
Assessment—Sensitivity Risk Measures in Portfolio Analysis . . . . . . . . . 101
Grażyna Trzpiot
ix
x Contents
Application in Economics
Enterprise Dark Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Katarzyna Raca
The Significance of Medical Science Issues in Research Papers
Published in the Field of Economics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Urszula Cieraszewska, Monika Hamerska, Paweł Lula,
and Marcela Zembura
Application of Duration Analysis Methods in the Study of the Exit
of a Real Estate Sale Offer from the Offer Database System . . . . . . . . . 153
Ewa Putek-Szeląg and Anna Gdakowicz
Is Society Ready for Long-Term Investments?—Profiles of Electricity
Users in Silesia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Sylwia Słupik and Joanna Trzęsiok
The Use of the Spatial Taxonomic Measure of Development
to Assess the Tourist Attractiveness of Districts of the Lesser
Poland Province . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Jacek Wolak
xiii
Methodology
Evaluation of Two-Step Spectral
Clustering Algorithm for Large
Untypical Data Sets
Andrzej Dudek
Abstract Researchers analyzing large (>100,000 objects) data sets with the
methods of cluster analysis often face the problem of computational complexity of
algorithms that sometimes makes it impossible to analyze in an acceptable time.
Common solution of this problem is to use less computationally complex algo-
rithms (like k-means), which in turn can in many cases give much worse results
than for example algorithms using eigenvalues decomposition. In the article, the
new algorithm from spectral clustering family is proposed and compared with other
approaches.
1 Introduction
Researchers analyzing large (>100,000 objects) data sets with the methods of
cluster analysis often face the number of problems that make analysis very hard or
even impossible. Computational complexity of algorithms, sometimes, makes it
impossible to analyze in an acceptable time. The other limitation is memory size of
standard PC-like computers, which in many cases may be too small for necessary
calculations on such data sets. Thus, not all clustering algorithms may be used for
that kind of data.
The article is divided into four parts with introduction. First part presents which
clustering algorithms can or cannot be used for large data sets in popular statistical
R framework. The second part is a proposal of modification of spectral clustering
procedure. Third part present computational simulation results on over 100,000
objects data matrices with known cluster structure for untypical cluster shapes
against the proposed algorithm. The final part contains remarks and conclusions.
A. Dudek (&)
Wrocław University of Economics and Business, Wrocław, Poland
e-mail: [email protected]
Dudek (2013) has examined the following clustering algorithms on one million
object multivariate normal distribution data set:
• hierarchical agglomerative methods,
• hierarchical divisive method (diana),
• k-means algorithm,
• partition around medoids (pam, k-medoids algorithm),
• spectral clustering approach (von Luxburg 2006),
• ensemble approach (Dimitriadou et al. 2001).
Only one algorithm (k-means) has passed the following requirements in
R environment:
• method execution should not report any lack of memory error,
• method should not run longer than five hours.
But in further analysis for untypical cluster shapes, k-means has given the results
that not meet the actual structure of clusters.
k¼1
aij ¼ e r ð1Þ
where: r—scaling parameter. Most often it is calculated according to Ng et al.
(2002) algorithm of iterative choosing of r, minimalizing the with-class distances
of random subset (random rows selected) of X: X′ (this method requires processing
of approximately few hundreds clustering procedures of objects in X′),
n—number of rows,
m—number of columns,