Semester-Long Internship Report: Tanmay Srinath (BMSCE, Bangalore)
Semester-Long Internship Report: Tanmay Srinath (BMSCE, Bangalore)
Semester-Long Internship Report: Tanmay Srinath (BMSCE, Bangalore)
on
FLOSS - R
submitted by
and supervision of
For this feature, it is required to check the completed TBCs over the platform for errors by running their
codes. Hence, the assigned task involved checking each code file associated with 12 completed TBCs
mentioned in Table 2.1 over the platform, recording the errors obtained, and forwarding the list of errors
to the FOSSEE web team for correction.
Table 2.1: List of completed TBCs checked over the R on Cloud platform.
S. No. Book Name
1 Operations Research: An Introduction by Hamdy A Taha, Pearson, 2014
2 Probability and Statistics for Engineering and the Sciences by Jay L Devore, Richard
Stratton, Boston, USA, 2012
3 Probability and Statistics for Engineers by Richard L. Scheaffer, Madhuri S.
Mulekar, James T. McClave, Cengage Learning, USA, 2011
Probability and Statistics for Engineers and Scientists by Ronald E. Walpole, Raymond H. Myers,
4 Sharon L. Myers, Keying Ye, Pearson Education, Boston USA, 2016
Following is the list of the type of errors encountered during the process of testing TBC codes over the R
on Cloud platform -
1. Missing libraries.
2. Error when loading code from a zip file.
1. Introduction
The FOSSEE project promotes the use of FLOSS tools in academia and research. It conducts regular
workshops on different FLOSS to help industry professionals, faculty, researchers, and students from
various institutions shift from proprietary to open-source software. These workshops are conducted
throughout the year and generally consist of spoken tutorials, live lectures, assignments, and interactive
activities to engage the participants. For the assessment of a workshop’s effectiveness, participants are
required to fill up a feedback form at the end. The task assigned was to analyze the feedback data to
identify the underlying variables called factors that can explain the interrelationships among the variables
(questions) of the feedback data using a method known as EFA (Exploratory Factor Analysis) [1]. The
obtained factors shall help in determining those aspects of the workshop that contributed more towards its
effectiveness. Analysis began after cleaning and processing the obtained data. The complete procedure
from data collection to analysis has been described in the following sections.
2. Data Collection
Data was collected through feedback forms for the ChemCollective Virtual Lab Beginner 1 day workshop
conducted on 12th December 2020 and the Jmol Application Advanced Workshop conducted on 12th
September 2020. Thirty-five people attended the ChemCollective workshop, out of which thirty-two filled
the feedback form. Only seventeen people attended the Jmol workshop and all of them filled the
feedback form. The feedback form contained questions associated with the workshop experience
consisting of sub-sections corresponding to workshop activity, practice problems, spoken tutorials,
knowledge gained from the workshop, and general opinions. The responses to these questions were in the
form of Likert scale ratings and subjective comments. Different scales were used for recording responses
depending upon the nature of the question.
3. Data Exploration
The feedback datasets were loaded into the R environment using the “read.csv()” function. A glimpse of
the original datasets can be seen in Figure 3.1 and Figure 3.2.
Figure 3.1: Glimpse of the original ChemCollective workshop feedback response dataset.
Figure 3.2: Glimpse of the original Jmol workshop feedback response dataset.
The datasets mentioned above contained more columns than their respective feedback form questions
because some questions contained sub-sections. The Jmol workshop dataset contained 50 columns,
whereas the ChemCollective workshop dataset contained 56 columns. It was also observed that some
column names were lengthy and some consisted of only numbers or a few characters that did not provide
any information about the question associated with that column, as shown in Figure 3.3.
Therefore the columns were renamed and grouped into the following categories, where each category has
sub-divisions based on the associated feedback form questions -
Figure 3.4: Code to rename the ChemCollective workshop feedback dataset columns.
The updated column names are shown in Figure 3.6 and Figure 3.7.
Figure 3.6: Updated column names of the ChemCollective workshop feedback dataset.
Figure 3.7: Updated column names of the Jmol workshop feedback dataset.
Data Exploration continued after updating the column names using the “skim()” function from the
“skimr” package [2]. It was observed that there were missing values in the columns of both the datasets as
shown in Figures 3.8 and 3.9.
Missing values may occur due to a variety of reasons. One reason could be the participants’ unwillingness
to respond to certain optional feedback sections like providing subjective comments, which is acceptable,
but they may also occur due to some data formatting/reformatting issues. To move ahead with the
exploration, it was necessary to remove the missing values in a way that we do not lose any vital
information. Therefore, a row and column-wise check for missing values was carried out, as shown in
Figure 3.10 and 3.11.
Figure 3.10: Row and column-wise examination of missing values for ChemCollective data.
Figure 3.11: Row and column-wise examination of missing values for Jmol data.
For ChemCollective data, only the “Suggestions” column contained missing values; hence it was ignored
because the feedback form question associated with that column was optional. For Jmol data, the
row-wise examination showed that at max only one value is missing for a particular row and the
column-wise examination showed that only the fourth column contained missing values. Hence the fourth
column was removed from the Jmol data.
After dealing with missing values, both datasets were checked for duplicate entries because such entries
can introduce bias in statistical analyses [3,4]. The Approximate String Matching (Fuzzy Matching)
technique was applied over the “Name” column of both the datasets using the “agrep()” function by Brian
Ripley and Kurt Hornik provided in the “base” package of R [5] to check for similar participant names. If
matching names are found, then other background details of those participants like their institution name,
educational background and profession were examined.
Figure 3.12 and 3.13 show the result of checking duplicate entries for the ChemCollective and Jmol
datasets, respectively. In the ChemCollective dataset, only five unique row entries were found, hence it
was deemed unfit for further analysis, whereas in the Jmol dataset, no duplicate entries were found.
The data exploration process shed light on the structure and format of the original feedback datasets. It
also helped in identifying the errors in the datasets. It is followed by the data cleaning process as
described in the subsequent section only for the Jmol dataset.
4. Data Cleaning
Data cleaning is the most critical step performed before analyzing the data, as any result obtained from
incorrect data will be unreliable. It involves the steps for removing erroneous and mislabelled data [6,7].
There is a possibility of incorrect responses in the feedback data due to several reasons such as
inattentiveness of participants while filling the feedback form, lack of understanding of the questions
asked, etc. Hence, there is a need to carefully examine the data and remove all misleading responses to
preserve its reliability.
For data cleaning, all possible ambiguities in the data were systematically checked and recorded.
Grouping the responses into categories (performed during data exploration) made the checking process
easier. The complete process can be broadly divided into three steps, with each containing multiple
substeps as listed below -
1. Checked participants’ backgrounds and opinions regarding JMOL and similar tools -
● Checked for entries where participants had given a negative response when asked if they had used
any software other than Jmol but entered the name of a software in the subsequent section.
● Checked for entries where participants had given a positive response when asked if they had used
any software other than Jmol but did not mention the name of the software.
Figure 3.14: Columns containing entries related to participants’ experience with modeling software.
● Checked for entries where participants had given a positive response when asked if they had used
Jmol before but failed to mention the purpose of use.
● Checked for entries where participants had given a negative response when asked if they had used
Jmol before but mentioned the purpose of use.
Figure 3.15: Columns indicating participants’ experience with the Jmol software.
2. Checked the qualitative and quantitative feedback responses regarding the procedure and
quality of the workshop -
● Checked for contradicting responses in columns associated with the quality and effectiveness of
the workshop; for example, searched and recorded all such entries where a participant had
selected the option “Strongly Agree” for both “Exposure To New Knowledge” and “Did Not
Learn Much” columns.
Figure 3.16: Column entries associated with the workshop’s quality and effectiveness.
● Checked if the level of knowledge regarding Jmol for any participant dropped after the workshop,
as it is improbable.
Figure 3.17: Entries associated with the conceptual knowledge regarding various aspects of Jmol.
Figure 3.18: Entries associated with the overall workshop’s feedback questions.
● After performing all possible checks, only a single misleading entry was found and removed. The
final dataset contained 16 rows and 49 columns.
5. Data Preprocessing
Only the Likert scale based columns containing categorical responses were kept from the cleaned Jmol
workshop feedback dataset for EFA, as shown in Figure 3.19 [8,9]. Any information related to the
participants’ backgrounds and their subjective comments was removed from the dataset. All entries
containing the string "Not Attempted" were replaced with NA. The remaining dataset had 16 rows and 34
columns, where each column had the factor data type.
Therefore, those columns were removed and the “N_FACTORS()” function was applied over the
correlation matrix obtained from the remaining data. The “N_FACTORS()” function tested the suitability
of the correlation matrix for EFA by applying “Bartlett's test of sphericity” over it and calculating its
“Kaiser-Meyer-Olkin criterion (KMO)” value. Bartlett’s test of sphericity statistically tests the hypothesis
that the correlation matrix contains ones on the diagonal and zeros on the off-diagonals. This test should
produce a statistically significant chi-square value to justify the application of EFA [11]. The KMO value
indicates the proportion of variance in the variables that might be caused by underlying factors [12]. The
“N_FACTORS()” function calculates the appropriate number of factors for the given data only when it
obtains a favorable result from Bartlett’s test and a suitable KMO value. Unfortunately, the pre-processed
data failed both tests because its correlation matrix was singular, as shown in Figure 3.21.
7. Conclusion
In this project, data exploration, cleaning and preprocessing, were performed over the ChemCollective
Virtual Lab Beginner 1 day workshop and the Jmol Application Advanced Workshop feedback datasets
completely using the R programming language with the objective of applying EFA over them. However,
the datasets turned out to be unsuitable for the proposed analysis as the ChemCollective dataset was too
small and the Jmol dataset failed the reliability tests required for EFA. This project could be further
extended by using some alternative of EFA, keeping in mind the mixed nature of the Jmol feedback data.
Chapter 4
Spoken Tutorial content creation
1. Introduction
The Spoken Tutorial project aims to make video tutorials on Free and Open Source Software (FOSS)
available in several Indian languages. The goal is to enable the use of spoken tutorials to teach in any
Indian language to learners of various levels of expertise - Beginner, Intermediate or Advanced. Every
tutorial has to go through a series of checks to ensure that it is perfect for its audience, which is crucial
for achieving the goal of this project. I was given the opportunity to contribute to the creation of eleven
scripts and corresponding slides associated with the Advanced R spoken tutorial series. The tutorial
topics were related to machine learning and are listed below.
1. Introduction
SOM is an unsupervised data visualization technique popular among researchers for dimensionality
reduction and clustering [30]. This project aims to create an open-source code base for SOM in R to help
researchers, students, and professionals understand the working of SOM. The material has been designed
to encourage and promote the R programming language among people wanting to learn and apply SOM
for their choice of use. The complete code with proper explanation and examples has been made freely
available for educational purposes in the form of a document on the Resources page of the R FOSSEE
website.
With the help of the R profiler, which was implemented using the “profvis()” function of “profvis”
package [39], it was determined that the BMU and SOM functions had the highest time complexity.
Finally, the following changes were made -
BMU: The entire function was reduced to just three lines of code. In the 1st line, the “sweep()” function
was used to find the euclidean distance between every neuron in the grid and the given data point. The
2nd line contained the “which.min()” function to find the winning neuron in the grid. The final line of
code returned the index of the winning neuron.
SOM: A similar approach was applied to the SOM function. The lateral distances from the winning
neuron were computed using the “sweep()” function. The weights were updated by first finding the
required indices using the “which()” function and then applying matrix operation instead of a for loop.
2. Data Collection
The COVID-19 Economic Impact Assessment data was collected for this case study from an online
repository known as the ADB Data Library. The ADB Data Library is a platform that hosts publicly
available data from the Asian Development Bank. The data obtained contains a measure of the potential
economy and sector specific impact of the COVID-19 outbreak [40].
3. Data Exploration
The original dataset had the dimension 1566 x 10 and contained the following columns -
5. Data Analysis
The data analysis step involved the application of linear regression and artificial neural network over the
data -
6. Results
6.1 Linear Regression
Figure 6.4: Accuracy measurement results of the final linear regression model.
Figure 6.6: Squared residuals plot associated with the neural network model.
7. Conclusion
The case study attempted to explore the impact of COVID-19 on the GDP of various countries. The
obtained statistical models have accurately predicted the GDP loss. The final linear regression model
predicted the actual values with a minor error. It seems like a good choice for predicting the GDP loss as
it has the advantage of being a white-box approach. On the other hand, a neural network may provide
better results, but it is a black-box approach.
Chapter 7
Conclusion
The FOSSEE Semester-long Internship provided me an opportunity to learn a lot, both from my fellow
interns and my mentors, while working on a variety of projects. The R on Cloud project allows its users to
access all TBC codes over a cloud platform. The task of maintaining it by finding and fixing bugs and
errors helped me contribute towards promoting education via a digital medium. Writing scripts related to
various machine learning topics for the R Spoken Tutorial series allowed me to strengthen my
fundamentals of machine learning and contribute to the noble cause of providing free and high-quality
education for all. Analysis of the FOSSEE workshop feedback data helped me to deepen my
understanding of the workflow in a typical data science project. It helped me understand why cleaning
and preprocessing of data is extremely important to obtain reliable results from an analysis. The SOM
project was the most challenging endeavor that I have undertaken to date. It not only helped improve my
programming skills in R, but it also taught me how I should break down a complex problem into simpler
achievable steps. Coding an intricate machine learning model from scratch proved to be enlightening and
helped me expand my frontiers. The Case Study project that I undertook helped me grasp the nuances of
research work and taught me the importance of machine learning models for both researchers and their
users.
Overall, the FOSSEE Semester-long Internship was much more than what the title seems to suggest. It
taught me how to approach a problem, how I should collaborate with my teammates, how I should apply
my critical thinking to discern and solve problems, and how I should constantly elevate my current skill
level, among various other things. This was my first professional work experience, and it opened up a
whole new world for me. Most importantly, my internship at FOSSEE allowed me to transcend to an
enhanced level of skill while constructively contributing to society at large, paving the way for my future
successes. I hope that my work helps promote the R programming language.
References
[1] A Practical Introduction to Factor Analysis: Confirmatory Factor Analysis. UCLA: Statistical
Consulting Group.
https://stats.idre.ucla.edu/spss/seminars/introduction-to-factor-analysis/a-practical-introduction-to-fa
ctor-analysis/
[2] Elin Waring, Michael Quinn, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu and Shannon
Ellis (2021). skimr: Compact and Flexible Summaries of Data. R package version 2.1.3.
https://CRAN.R-project.org/package=skimr
[3] Broeck, J., Argeseanu Cunningham, S., Eeckels, R., and Herbst, K. (2005). Data cleaning: detecting,
diagnosing, and editing data abnormalities. PLoS medicine, 2(10), p.e267.
[4] Chu, X., Ilyas, I., Krishnan, S., and Wang, J. (2016). Data cleaning: Overview and emerging
challenges. In Proceedings of the 2016 international conference on management of data (pp.
2201–2206).
[5] R Core Team (2021). R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria. https://www.R-project.org/
[6] Margaret Beaver (2012). Survey Data Cleaning Guidelines: (SPSS and Stata) 1st Edition.
https://www.canr.msu.edu/resources/survey-data-cleaning-guidelines-spss-and-stata-1st-edition
[7] Krishnan, S., Haas, D., Franklin, M., and Wu, E. 2016. Towards reliable interactive data cleaning: A
user survey and recommendations. In Proceedings of the Workshop on Human-In-the-Loop Data
Analytics (pp. 1–5).
[8] Hooper, D. (2012), ‘Exploratory Factor Analysis’, in Chen, H. (Ed.), Approaches to Quantitative
Research – Theory and its Practical Application: A Guide to Dissertation Students, Cork, Ireland:
Oak Tree Press.
[9] Tarka, P. (2015). Likert Scale and Change in Range of Response Categories vs. the Factors
Extraction in EFA Model. Acta Universitatis Lodziensis. Folia Oeconomica, 311.
[10] Steiner, M.D., & Grieder, S.G. (2020). EFAtools: An R package with fast and flexible
implementations of exploratory factor analysis tools. Journal of Open Source Software, 5(53), 2521.
https://doi.org/10.21105/joss.02521
[11] Watkins, M. (2018). Exploratory factor analysis: A guide to best practice. Journal of Black
Psychology, 44(3), p.219–246.
[12] KMO and Bartlett's test, SPSS Statistics Subscription - New, SPSS Statistics, IBM Corporation.
https://www.ibm.com/docs/en/spss-statistics/version-missing?topic=detection-kmo-bartletts-test
[13] Murphy, K.P., 2012. Machine learning: a probabilistic perspective. MIT press.
[14] David Meyer, Evgenia Dimitriadou, Kurt Hornik, Andreas Weingessel and Friedrich Leisch (2021).
e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly:
E1071), TU Wien. R package version 1.7-9. https://CRAN.R-project.org/package=e1071
[15] Max Kuhn (2021). caret: Classification and Regression Training. R package version 6.0-90.
https://CRAN.R-project.org/package=caret
[16] H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
[17] Scrucca L., Fop M., Murphy T. B. and Raftery A. E. (2016) mclust 5: clustering, classification and
density estimation using Gaussian finite mixture models The R Journal 8/1, pp. 289-317
[18] Kanti Mardia, J. Kent, J. Bibby. Multivariate Analysis, 1st Edition. December 14, 1979.
[19] Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer,
New York. ISBN 0-387-95457-0
[20] Gareth, J., Daniela, W., Trevor, H. and Robert, T., 2013. An introduction to statistical learning: with
applications in R. Spinger.
[21] Ng, Andrew (2000). "CS229 Lecture Notes" (PDF). CS229 Lecture Notes: 16–19.
[22] Thomas W. Yee (2015). Vector Generalized Linear and Additive Models: With an Implementation in
R. New York, USA: Springer.
[23] Rokach, L., & Maimon, O. (2014). Data Mining With Decision Trees: Theory and Applications.
World Scientific Publishing Co., Inc.
[24] Terry Therneau and Beth Atkinson (2019). rpart: Recursive Partitioning and Regression Trees. R
package version 4.1-15. https://CRAN.R-project.org/package=rpart
[25] Stephen Milborrow (2021). rpart.plot: Plot 'rpart' Models: An Enhanced Version of 'plot.rpart'. R
package version 3.1.0. https://CRAN.R-project.org/package=rpart.plot
[26] Genuer, R. and Poggi, J.-M. (2020) Random forests with R. Cham: Springer International Publishing.
[27] A. Liaw and M. Wiener (2002). Classification and Regression by randomForest. R News 2(3), 18-22.
[28] Arthur, D. and S. Vassilvitskii (2007). “k-means++: The advantages of careful seeding.” In H.
Gabow (Ed.), Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms
[SODA07], Philadelphia, pp. 1027-1035. Society for Industrial and Applied Mathematics.
[29] Georg M. Goerg (2013). LICORS: Light Cone Reconstruction of States - Predictive State Estimation
From Spatio-Temporal Data. R package version 0.2.0.
https://CRAN.R-project.org/package=LICORS
[30] Kevin Pang. Self-organizing Maps. https://www.cs.hmc.edu/~kpang/nn/som.html
[31] Kohonen, Teuvo. "The self-organizing map." Proceedings of the IEEE 78.9 (1990): 1464-1480.
[32] Uoolc, A. Bradford. "Self-organizing Map Formation: Foundations of Neural Computation."
[33] Kohonen, Teuvo. "Essentials of the self-organizing map." Neural networks 37 (2013): 52-65.
[34] Kohonen, Teuvo, and Timo Honkela. "Kohonen network." Scholarpedia 2.1 (2007): 1568.
[35] Sven Krüger. Self-Organizing Maps.
https://www.iikt.ovgu.de/iesk_media/Downloads/ks/computational_neuroscience/vorlesung/comp_n
euro8-p-2090.pdf
[36] John A. Bullinaria. (2004). Self Organizing Maps: Fundamentals.
https://www.cs.bham.ac.uk/~jxb/NN/l16.pdf
[37] Jae-Wook Ahn and Sue Yeon Syn. (2005). Self-Organizing Maps.
https://sites.pitt.edu/~is2470pb/Spring05/FinalProjects/Group1a/tutorial/som.html
[38] UCLA Graduate School Admissions Data. https://stats.idre.ucla.edu/stat/data/binary.csv
[39] Winston Chang, Javier Luraschi and Timothy Mastny (2020). profvis: Interactive Visualizations for
Profiling R Code. R package version 0.3.7. https://CRAN.R-project.org/package=profvis
[40] Covid-19 economic impact assessment template.
https://data.adb.org/dataset/covid-19-economic-impact-assessment-template
[41] D. A. Freedman, Statistical Models: Theory and Practice. Cambridge University Press, 2009.
[42] G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning with
Applications in R (Second Edition), 2021.
[43] Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer,
New York. ISBN 0-387-95457-0