100% found this document useful (13 votes)
202 views14 pages

Data Analysis and Classification Methods and Applications Entire PDF Ebook

This document is a preface and overview of a volume presenting papers from the 29th Conference of the Section of Classification and Data Analysis of the Polish Statistical Society, held in September 2020. It includes 19 reviewed papers covering various methodological aspects and applications of classification and data analysis in finance, economics, social issues, and COVID-19 data. The volume aims to address modern data science challenges and encourage further research in the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (13 votes)
202 views14 pages

Data Analysis and Classification Methods and Applications Entire PDF Ebook

This document is a preface and overview of a volume presenting papers from the 29th Conference of the Section of Classification and Data Analysis of the Polish Statistical Society, held in September 2020. It includes 19 reviewed papers covering various methodological aspects and applications of classification and data analysis in finance, economics, social issues, and COVID-19 data. The volume aims to address modern data science challenges and encourage further research in the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Data Analysis and Classification Methods and Applications

Visit the link below to download the full version of this book:

https://medipdf.com/product/data-analysis-and-classification-methods-and-applica
tions/

Click Download Now


More information about this series at http://www.springer.com/series/1564
Krzysztof Jajuga Krzysztof Najman
• •

Marek Walesiak
Editors

Data Analysis
and Classification
Methods and Applications

123
Editors
Krzysztof Jajuga Krzysztof Najman
Department of Financial Investments Department of Statistics
and Risk Management University of Gdańsk
Wroclaw University of Economics Sopot, Poland
and Business
Wroclaw, Poland

Marek Walesiak
Department of Econometrics and Computer
Science
Wroclaw University of Economics
and Business
Jelenia Góra, Poland

ISSN 1431-8814 ISSN 2198-3321 (electronic)


Studies in Classification, Data Analysis, and Knowledge Organization
ISBN 978-3-030-75189-0 ISBN 978-3-030-75190-6 (eBook)
https://doi.org/10.1007/978-3-030-75190-6

Mathematics Subject Classification: 62H25, 62H30, 62H86, 62-09, 68U20, 62P12, 62P20, 62P25

© The Editor(s) (if applicable) and The Author(s), under exclusive licence to Springer Nature
Switzerland AG 2021
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, expressed or implied, with respect to the material contained
herein or for any errors or omissions that may have been made. The publisher remains neutral with regard
to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface

This volume presents the papers from the 29th Conference of Section of
Classification and Data Analysis of Polish Statistical Society held at the University
of Gdansk on September 7–9, 2020. The papers presented refer to a set of studies
addressing a wide range of recent methodological aspects and applications of
classification and data analysis tools in micro and macroeconomic problems. In the
final selection, we accepted 19 of the papers that were presented at the conference.
Each of the submissions has been reviewed by two anonymous referees, and the
authors have subsequently revised their original manuscripts and incorporated the
comments and suggestions of the referees. The selection criteria were based on the
contribution of the papers to the theory and applications of modern classification
and data analysis.
The chapters have been organized along with the major fields and themes in
classification and data analysis: Methodology, Application in Finance, Application
in Economics, Application in Social Issues, and Application with COVID-19 Data.
The part on Methodology contains five papers. The paper by Dudek focuses on
the new algorithm from spectral clustering family and its applications in large data
sets analysis. The author conducted a comparative analysis with other approaches.
Rozmus article focuses on the analysis of the number of clusters and stability
indicators. The aim of the article is to compare the results in terms of the indicated
correct number of groups by classical indexes and stability measures. The paper by
Majkowska, Migdał-Najman, Najman, and Raca attempts to characterize words
commonly used in the messages published by Twitter users. Text mining methods
and techniques were used to carry out the research, which was mainly focused on
the analysis of individual words and collocations occurring in the users’ tweets.
Bryś in his paper conducts research of 1446 selected publications provides insights
on classification algorithms applied to information security tasks, their popularity,
and the algorithm selection challenges. The paper by Najman and Zieliński
investigates the issue of the usefulness of isolation forests in outlier detection. The
results of simulations and empirical studies on selected data sets are presented. The
assessment takes into account the impact of individual characteristics of big data
sets on the effectiveness of the analyzed methods.

v
vi Preface

The part on Application in Finance contains two papers. Batóg and


Wawrzyniak’s study was carried out on the basis of selected financial ratios, which
in the literature are considered to be nominants with the recommended range of
values, with the assumption that the better situation of the examined object is when
the values of the indicator-nominant are above the upper limit of the recommended
range of values (right-handed asymmetrical nominant) or below the lower limit of
this range (left-handed asymmetrical nominant). Trzpiot in her article considers
whether the standard risk estimation procedures are in line with investors’ expec-
tations. Article is concerned on presenting the assumptions of Gini regression, the
selected estimation method, and its application to the systematic risk assessment.
The application part is modeling assets listed on the Warsaw Stock Exchange.
The part on Application in Economics contains five papers. The paper by Raca
presents an overview of the definitions of the term dark data, a proposal of its
interpretation, and a classification of data in a company with regard to: usability,
availability, and quality. As part of the research, four universal features of dark data
sets have been indicated (unavailability, unawareness, uselessness, and costliness).
Cieraszewska, Hamerska, Lula, and Zembura present the results of research
including the analysis of abstracts of scientific articles in the field of economics,
prepared in English by authors from 36 European countries and registered in the
Scopus database in the years 2011–2020. The ontology-based approach is used for
identification of concepts related to medical science and economics. The paper also
presents the results of research on the relationship between the interdisciplinary
nature of research in the field of economics and the number and ‘degree of inter-
nationalization of authors’ teams. The aim of the Putek Szeląg’s and Gdakowicz
article is to present selected methods of duration analysis to assess the probability of
exit from the real estate sale offer system, taking into account various types of
competing risk (the year of submitting the property for sale). In the survey, the
calculation of the offer duration takes into account the properties that have been
sold and are still current (on the day of the end of the survey). Słupik and Trzęsiok’s
work aims to identify and characterize electricity users in terms of their attitudes
toward energy saving. The authors of the article based their analysis on the results
of the proprietary research conducted among households in the Silesian Province in
Poland, in 2018, and on a review of the literature on profiling individual energy
consumers. In the article, the authors also characterize the obtained segments and
identify fundamental factors influencing the respondents’ behavior toward save
energy.
Wolak in the paper presents a study of selected linear ordering algorithms to
build a ranking of districts in the Lesser Poland Province in terms of tourist
attractiveness using techniques considering potential spatial relationships.
The part on Application in Social Issues contains four papers. Bieszk-Stolorz in
her paper assesses the impact of gender of unemployed people on the duration of
registered unemployment and on the duration of staying out of the office’s register,
taking into account different reasons for de-registration. Due to censored observa-
tions, i.e., observations not completed with an event in the analyzed period, author
decided to use selected methods of survival analysis. The purpose of Grzenda’s
Preface vii

paper is to indicate the possibility of using Cox regression model to determine


direct adjusted probabilities of finding a job by the unemployed depending on their
individual characteristics in the context of long-term unemployment risk. The study
is based on LFS data from 2017 and 2018 for Poland. Przybysz, Stanimir, and
Wasiak proposed to use the methods of multidimensional comparative analysis to
assess the level of implementation of the Europe 2020 strategy, indicating areas
important for the quality of life of seniors and identifying changes in the assessment
of the implementation of this strategy by this generation. The study showed the
existence of a very large diversity of seniors in terms of their life quality and their
assessment of the strategy. Kos-Łabędowicz and Trzęsiok present two classifica-
tions of the elderly in Poland in terms of their preferences regarding means of
transport: one prepared on the basis of literature research and expert knowledge, the
other with the use of a selected taxonomic method. The aim of the article is to test
the agreement between the obtained classifications and thus to verify the validity
of the proposed expert segmentation which reflects Polish society specifically.
The part on Application with COVID-19 Data contains three papers. Nojszewska
and Sielska analyze the similarities of European countries during COVID-19 pan-
demic in terms of the following indicators: Economic sentiment indicator (ESI),
employment expectations indicator (EEI) from the beginning of 2020. The research
shows that after the collapse in March/April 2020, the values of variables reflecting
the condition of economies started to increase in most of the identified groups of
countries. Salamaga studied a question regarding the influence of the corona crisis on
global foreign investment in the near future, especially in the investment market
of the Visegrad Group countries. The main purpose of the Landmesser’s paper is to
analyze the patterns of COVID-19 evolution in a group of 27 EU countries. First,
author applies the concept of dynamic time warping (DTW) to identify groups of EU
countries affected to varying degrees by the COVID-19 pandemic. Further, within
the selected groups, the structure of the time series for infected and deceased
COVID-19 patients using ARIMA models was analyzed.
We wish to thank all the authors for making their studies available for our
volume. Their scholarly efforts and research inquiries made this volume possible.
We are also indebted to the anonymous referees for providing insightful reviews
with many useful comments and suggestions.
In spite of our intention to address a wide range of problems pertaining to
classification and data analysis theory, there are issues that still need to be
researched. We hope that the studies included in our volume will encourage further
research and analyses in modern data science.

Wroclaw, Poland Krzysztof Jajuga


Sopot, Poland Krzysztof Najman
Jelenia Góra, Poland Marek Walesiak
January 2021
Contents

Methodology
Evaluation of Two-Step Spectral Clustering Algorithm for Large
Untypical Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Andrzej Dudek
Determining the Number of Groups in Cluster Analysis Using
Classical Indexes and Stability Measures—Comparison of Results . . . . 11
Dorota Rozmus
Identification of the Words Most Frequently Used by Different
Generations of Twitter Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Agata Majkowska, Kamila Migdał-Najman, Krzysztof Najman,
and Katarzyna Raca
Classification Algorithms Applications for Information Security
on the Internet: A Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Michał Bryś
Outlier Detection with the Use of Isolation Forests . . . . . . . . . . . . . . . . 65
Krzysztof Najman and Krystian Zieliński

Application in Finance
Propositions of Transformations of Asymmetrical Nominants into
Stimulants on the Example of Chosen Financial Ratios . . . . . . . . . . . . . 83
Barbara Batóg and Katarzyna Wawrzyniak
Gini Regression in the Capital Investment Risk
Assessment—Sensitivity Risk Measures in Portfolio Analysis . . . . . . . . . 101
Grażyna Trzpiot

ix
x Contents

Application in Economics
Enterprise Dark Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Katarzyna Raca
The Significance of Medical Science Issues in Research Papers
Published in the Field of Economics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Urszula Cieraszewska, Monika Hamerska, Paweł Lula,
and Marcela Zembura
Application of Duration Analysis Methods in the Study of the Exit
of a Real Estate Sale Offer from the Offer Database System . . . . . . . . . 153
Ewa Putek-Szeląg and Anna Gdakowicz
Is Society Ready for Long-Term Investments?—Profiles of Electricity
Users in Silesia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Sylwia Słupik and Joanna Trzęsiok
The Use of the Spatial Taxonomic Measure of Development
to Assess the Tourist Attractiveness of Districts of the Lesser
Poland Province . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Jacek Wolak

Application in Social Issues


Models of Competing Events in Assessing the Effects of the Transition
of Unemployed People Between the States of Registration
and De-Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Beata Bieszk-Stolorz
Direct Adjusted Survival Probabilities in the Analysis of Finding a Job
by the Unemployed Depending on Their Individual Characteristics . . . 229
Wioletta Grzenda
Europe 2020 Strategy—Objective Evaluation of Realization
and Subjective Assessment by Seniors as Beneficiaries
of Social Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Klaudia Przybysz, Agnieszka Stanimir, and Marta Wasiak
Do Seniors Get to the Disco by Bike or in a Taxi?—Classification
of Seniors According to Their Preferred Means of Transport . . . . . . . . 271
Joanna Kos-Łabędowicz and Joanna Trzęsiok

Application with COVID-19 Data


The Impact of the COVID-19 Pandemic on the Economies
of European Countries in the Period January–September 2020 Based
on Economic Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
Ewelina Nojszewska and Agata Sielska
Contents xi

Modelling the Risk of Foreign Divestment in the Visegrad Group


Countries During the COVID-19 Pandemic . . . . . . . . . . . . . . . . . . . . . . 319
Marcin Salamaga
Analysis of COVID-19 Dynamics in EU Countries Using the Dynamic
Time Warping Method and ARIMA Models . . . . . . . . . . . . . . . . . . . . . 337
Joanna Landmesser
About the Editors

Krzysztof Jajuga is a professor of finance at Wroclaw University of Economics


and Business, Poland. He holds master, doctoral, and habilitation degree from
Wroclaw University of Economics and Business, Poland, title of professor given by
the President of Poland, honorary doctorate from Cracow University of Economics
and honorary professorship from Warsaw University of Technology. He carries out
research within financial markets, risk management, household finance, and mul-
tivariate statistics.

Krzysztof Najman is an associate professor at the University of Gdansk, Deputy


Dean for Student Affairs and Education at Faculty of Management. He obtained
doctoral degree and habilitation degree from University of Gdansk in Poland. He is
a member of Main Council of Polish Statistical Association and Section of
Classification and Data Analysis SKAD. His field of scientific interests covers
cluster analysis and classification methods, artificial intelligence models,
self-learning neural networks, multivariate statistical analysis, data mining.

Marek Walesiak is a professor of economics at Wroclaw University of Economics


and Business in Department of Econometrics and Computer Science. He holds
master, doctoral, and habilitation degree from Wroclaw University of Economics
and Business, Poland, title of professor given by the President of Poland. He is a
member of the Methodological Commission and Scientific Statistical Council in
Statistics Poland (GUS) and an active member of many scientific professional
bodies (i.e., Section of Classification and Data Analysis SKAD). His main areas of
interest include: classification and data analysis, composite indicators, multivariate
statistical analysis, marketing research, computational techniques in R.

xiii
Methodology
Evaluation of Two-Step Spectral
Clustering Algorithm for Large
Untypical Data Sets

Andrzej Dudek

Abstract Researchers analyzing large (>100,000 objects) data sets with the
methods of cluster analysis often face the problem of computational complexity of
algorithms that sometimes makes it impossible to analyze in an acceptable time.
Common solution of this problem is to use less computationally complex algo-
rithms (like k-means), which in turn can in many cases give much worse results
than for example algorithms using eigenvalues decomposition. In the article, the
new algorithm from spectral clustering family is proposed and compared with other
approaches.

Keywords Clustering  Classification  Large data sets  Spectral clustering

1 Introduction

Researchers analyzing large (>100,000 objects) data sets with the methods of
cluster analysis often face the number of problems that make analysis very hard or
even impossible. Computational complexity of algorithms, sometimes, makes it
impossible to analyze in an acceptable time. The other limitation is memory size of
standard PC-like computers, which in many cases may be too small for necessary
calculations on such data sets. Thus, not all clustering algorithms may be used for
that kind of data.
The article is divided into four parts with introduction. First part presents which
clustering algorithms can or cannot be used for large data sets in popular statistical
R framework. The second part is a proposal of modification of spectral clustering
procedure. Third part present computational simulation results on over 100,000
objects data matrices with known cluster structure for untypical cluster shapes
against the proposed algorithm. The final part contains remarks and conclusions.

A. Dudek (&)
Wrocław University of Economics and Business, Wrocław, Poland
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 3


K. Jajuga et al. (eds.), Data Analysis and Classification, Studies in Classification,
Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-030-75190-6_1
4 A. Dudek

2 Limitations of Large Data Sets Classification

Dudek (2013) has examined the following clustering algorithms on one million
object multivariate normal distribution data set:
• hierarchical agglomerative methods,
• hierarchical divisive method (diana),
• k-means algorithm,
• partition around medoids (pam, k-medoids algorithm),
• spectral clustering approach (von Luxburg 2006),
• ensemble approach (Dimitriadou et al. 2001).
Only one algorithm (k-means) has passed the following requirements in
R environment:
• method execution should not report any lack of memory error,
• method should not run longer than five hours.
But in further analysis for untypical cluster shapes, k-means has given the results
that not meet the actual structure of clusters.

3 Proposal of New Algorithm

Spectral decomposition algorithm according to von Luxburg (2006) and Ng et al.


(2002) can be stated in its general form in the following way:
Let X means data matrix with n rows and m columns, u—number of cluster to
divide X (given by researcher before start of decomposition). Sample input data is
presented on Fig. 1. Next figures will be showing the same data in transformed
space.
Let A be similarity matrix of objects from X. A can be calculated in many ways
but most often its elements aij are defied according to Eq. 1:
Pm
ðxik xjk Þ
2

 k¼1
aij ¼ e r ð1Þ
where: r—scaling parameter. Most often it is calculated according to Ng et al.
(2002) algorithm of iterative choosing of r, minimalizing the with-class distances
of random subset (random rows selected) of X: X′ (this method requires processing
of approximately few hundreds clustering procedures of objects in X′),
n—number of rows,
m—number of columns,

You might also like