TWP
TWP
library for machine learning in Python. It provides a selection of efficient tools for
machine learning and statistical modeling including classification, regression,
clustering and dimensionality reduction via a consistence interface in Python.
We will use Python’s libraries such that Matplotlib, Numpy and Seaborn to ploting
and visualizing data.
#Matplotlib:
Matplotlib is a powerful plotting library in Python used for creating static, animated,
and interactive visualizations. Matplotlib’s primary purpose is to provide users with
the tools and functionality to represent data graphically, making it easier to analyze
and understand.
#Seaborn:
Seaborn is a library for making statistical graphics in Python. It builds on top of
matplotlib and integrates closely with pandas data structures. Seaborn helps you
explore and understand your data. Its plotting functions operate on dataframes and
arrays containing whole datasets and internally perform the necessary semantic
mapping and statistical aggregation to produce informative plots.
#Numpy:
NumPy is the fundamental package for scientific computing in Python. It is a Python
library that provides a multidimensional array object, various derived objects (such
as masked arrays and matrices), and an assortment of routines for fast operations
on arrays, including mathematical, logical, shape manipulation, sorting, selecting,
I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations,
random simulation and much more.
Especially, we use Pandas because it can clean messy data sets, and make them
readable and relevant. Pandas is a Python library used for working with data sets.
Pandas allows us to analyze big data and make conclusions based on statistical
theories.
https://www.tutorialspoint.com/scikit_learn/scikit_learn_introduction.htm
https://seaborn.pydata.org/tutorial/introduction.html
https://www.geeksforgeeks.org/python-introduction-matplotlib/
https://numpy.org/doc/stable/user/whatisnumpy.html
Some familiar algorithm will be used there: K-means clustering,
PCA( Principal Components Analysis), Forest Isolation, sau đây chúng tôi sẽ khái
quát về chúng:
#K-means clustering
K-Means Clustering is an Unsupervised Machine Learning algorithm, which
groups the unlabeled dataset into different clusters. The article aims to explore the
fundamentals and working of k mean clustering along with the implementation. Of
course it has some drawbacks:
We have to determine the exactly number of clusterings
Convergent speed depends on initial centroids
Number of points in each cluster has to be approximately with others
Clusters need to be circle-shaped
having troubles when a cluster is inside another cluster
Và các giải pháp kèm theo:
K-Means++ method: most efficient way to find initial centroids for each clusters
Run the algorithm with different initial centroids then choose the one that has the
minimum value of loss function
Elbow method: determine the number of clusters in a dataset
# PCA