Data Science Study Materials
Data Science Study Materials
Unit-1
Data Science has become the most demanding job of the 21st century. Every organization is
looking for candidates with knowledge of data science. In this tutorial, we are giving an
introduction to data science, with data science Job roles, tools for data science, components
of data science, application, etc.
Data Science is a multidisciplinary field that involves the use of statistical and computational
methods to extract insights and knowledge from data. To analyze and comprehend large
data sets, it uses techniques from computer science, mathematics, and statistics.
Data mining, machine learning, and data visualization are just a few of the tools and
methods we frequently employ to draw meaning from data. They may deal with both
structured and unstructured data, including text and pictures, databases, and spreadsheets.
A number of sectors, including healthcare, finance, marketing, and more, use the insights
and experience gained via data analysis to steer innovation, advise business decisions, and
address challenging problems.
o Collecting data from a range of sources, including databases, sensors, websites, etc.
o Making sure data is in a format that can be analyzed while also organizing and
processing it to remove mistakes and inconsistencies.
o Finding patterns and correlations in the data using statistical and machine learning
approaches.
o Creating mathematical models and computer programs that can classify and forecast
based on data.
Example:
Let's suppose we want to travel from station A to staztion B by car. Now, we need to make
some decisions such as which route will be the best route to reach faster at the location, in
which route there will be no traffic jam, and which will be cost-effective. All these decision
factors will act as input data, and we will get an appropriate answer from these decisions, so
this analysis of data is called the data analysis, which is a part of data science.
Some years ago, data was less and mostly available in a structured form, which could be
easily stored in excel sheets, and processed using BI tools.
But in today's world, data is becoming so vast, i.e., approximately 2.5 quintals bytes of data
is generating on every day, which led to data explosion. It is estimated as per researches,
that by 2020, 1.7 MB of data will be created at every single second, by a single person on
earth. Every Company requires data to work, grow, and improve their businesses.
Now, handling of such huge amount of data is a challenging task for every organization. So to
handle, process, and analysis of this, we required some complex, powerful, and efficient
algorithms and technology, and that technology came into existence as data Science.
Following are some main reasons for using data science technology:
o Every day, the world produces enormous volumes of data, which must be processed
and analysed by data scientists in order to provide new information and
understanding.
o Data science is now crucial for creating and educating intelligent systems as artificial
intelligence and machine learning have grown in popularity.
As per various surveys, data scientist job is becoming the most demanding Job of the 21st
century due to increasing demands for data science. Some people also called it "the hottest
job title of the 21st century". Data scientists are the experts who can use various statistical
tools and machine learning algorithms to understand and analyze the data.
The average salary range for data scientist will be approximately $95,000 to $ 165,000 per
annum, and as per different researches, about 11.5 millions of job will be created by the
year 2026.
If you learn data science, then you get the opportunity to find the various exciting job roles
in this domain. The main job roles are given below:
1. Data Scientist
2. Data Analyst
4. Data engineer
5. Data Architect
6. Data Administrator
7. Business Analyst
1. Data Scientist: A data scientist is in charge of deciphering large, complicated data sets for
patterns and trends, as well as creating prediction models that may be applied to business
choices. They could also be in charge of creating data-driven solutions for certain business
issues.
Skill Required: To become a data scientist, one needs skills in mathematics, statistics,
programming languages(such as Python, R, and Julia), Machine Learning, Data Visualisation,
Big Data Technologies (such as Hadoop), domain expertise( such that the person is capable
of understanding data which is related to the domain), and communication and presentation
skills to efficiently convey the insights from the data.
Skill Required: Programming languages like Python and Java, statistics, machine learning
frameworks like TensorFlow and PyTorch, big data technologies like Hadoop and Spark,
software engineering, and problem-solving skills are all necessary for a machine learning
engineer.
3. Data Analyst: Data analysts are in charge of gathering and examining data in order to spot
patterns and trends and offer insights that may be applied to guide business choices.
Creating data visualizations and reports to present results to stakeholders may also fall
within the scope of their responsibility.
Skill Required: Data analysis and visualization, statistical analysis, database querying,
programming in languages like SQL or Python, critical thinking, and familiarity with tools and
technologies like Excel, Tableau, SQL Server, and Jupyter Notebook are all necessary for a
data analyst.
4. Business Intelligence Analyst: Data analysis for business development and improvement
is the responsibility of a business intelligence analyst. They could also be in charge of
developing and putting into use data warehouses and other types of data management
systems.
Skill Required: A business intelligence analyst has to be skilled in data analysis and
visualization, business knowledge, SQL and data warehousing, data modeling, and ETL
procedures, as well as programming languages like Python and knowledge of BI tools like
Tableau, Power BI, or QlikView.
5. Data Engineer: A data engineer is in charge of creating, constructing, and maintaining the
infrastructure and pipelines for collecting and storing data from diverse sources. In addition
to guaranteeing data security and quality, they could also be in charge of creating data
integration solutions.
Skill Required: To create, build, and maintain scalable and effective data pipelines and data
infrastructure for processing and storing large volumes of data, a data engineer needs
expertise in database architecture, ETL procedures, data modeling, programming languages
like Python and SQL, big data technologies like Hadoop and Spark, cloud computing
platforms like AWS or Azure, and tools like Apache Airflow or Talend.
6. Big Data Engineer: Big data engineers are in charge of planning and constructing systems
that can handle and analyze massive volumes of data. Additionally, they can be in charge of
putting scalable data storage options into place and creating distributed computing systems.
Skilled Required: Big Data Engineers must be proficient in distributed systems, programming
languages like Java or Scala, data modeling, database management, cloud computing
platforms like AWS or Azure, big data technologies like Apache Spark, Kafka, and Hive, and
experience with tools like Apache NiFi or Apache Beam in order to design, build, and
maintain large-scale distributed data processing systems for hand.
7. Data Architect: Data models and database systems that can support data-intensive
applications must be designed and implemented by a data architect. They could also be in
charge of maintaining data security, privacy, and compliance.
Skill Required: A data architect needs knowledge of database design and modeling, data
warehousing, ETL procedures, programming languages like SQL or Python, proficiency with
data modeling tools like ER/Studio or ERwin, familiarity with cloud computing platforms like
AWS or Azure, and expertise in data governance and security.
Skill Required: A data administrator needs expertise in database management, backup, and
recovery, data security, SQL programming, data modeling, familiarity with database
platforms like Oracle or SQL Server, proficiency with data management tools like SQL
Developer or Toad, and experience with cloud computing platforms like AWS or Azure.
Skill Required: A business analyst needs expertise in data analysis, business process
modeling, stakeholder management, requirements gathering and documentation,
proficiency in tools like Excel, Power BI, or Tableau, and experience with project
management.
Non-Technical Prerequisite:
While technical skills are essential for data science, there are also non-technical skills that
are important for success in this field. Here are some non-technical prerequisites for data
science:
4. Curiosity and creativity: Data science frequently entails venturing into unfamiliar
territory, so being able to think creatively and approach issues from several
perspectives may be a significant skill.
6. Critical thinking: In data science, it's critical to be able to assess information with
objectivity and reach logical conclusions. This involves the capacity to spot biases and
assumptions in data and analysis as well as the capacity to form reasonable
conclusions based on the facts at hand.
Technical Prerequisite:
Since data science includes dealing with enormous volumes of data and necessitates a
thorough understanding of statistical analysis, machine learning algorithms, and
programming languages, technical skills are crucial. Here are some technical prerequisites
for data science:
1. Mathematics and Statistics: Data science is working with data and analyzing it using
statistical methods. As a result, you should have a strong background in statistics and
mathematics. Calculus, linear algebra, probability theory, and statistical inference are
some of the important ideas you should be familiar with.
2. Programming: A fundamental skill for data scientists is programming. A solid
command of at least one programming language, such as Python, R, or SQL, is
required. Additionally, you must be knowledgeable about well-known data science
libraries like Pandas, NumPy, and Matplotlib.
5. Deep Learning: Neural networks are used in deep learning, a kind of machine
learning. Deep learning frameworks like TensorFlow, PyTorch, or Keras should be
familiar to you.
6. Big Data Technologies: Large and intricate datasets are a common tool used by data
scientists. Big data technologies like Hadoop, Spark, and Hive should be known to
you.
BI stands for business intelligence, which is also used for data analysis of business
information: Below are some differences between BI and Data sciences:
Scientific(goes deeper to
Method Analytical(historical data) know the reason for the
data report)
Statistics, Visualization, and
Statistics and Visualization
Machine learning are the
Skills are the two skills required
required skills for data
for business intelligence.
science.
Data science involves several components that work together to extract insights and value
from data. Here are some of the key components of data science:
4. Data Collection: Data is gathered and acquired from a number of sources. This can be
unstructured data from social media, text, or photographs, as well as structured data
from databases.
6. Data Exploration and Visualization: This entails exploring the data and gaining
insights using methods like statistical analysis and data visualization. To aid in
understanding the data, this may entail developing graphs, charts, and dashboards.
7. Data Modeling: In order to analyze the data and derive insights, this component
entails creating models and algorithms. Regression, classification, and clustering are
a few examples of supervised and unsupervised learning techniques that may be
used in this.
8. Machine Learning: Building predictive models that can learn from data is required for
this. This might include the increasingly significant deep learning methods, such as
neural networks, in data science.
10. Deployment and Maintenance: The models and algorithms need to be deployed and
maintained when the data science project is over. This may entail keeping an eye on
the models' performance and upgrading them as necessary.
Tools for Data Science
o Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R Studio, MATLAB, Excel,
RapidMiner.
To become a data scientist, one should also be aware of machine learning and its algorithms,
as in data science, there are various machine learning algorithms which are broadly being
used. Following are the name of some machine learning algorithms used in data science:
o Regression
o Decision tree
o Clustering
o Naive Bayes
o Apriori
We will provide you some brief introduction for few of the important algorithms here,
1. Linear Regression Algorithm: Linear regression is the most popular machine learning
algorithm based on supervised learning. This algorithm work on regression, which is a
method of modeling target values based on independent variables. It represents the form of
the linear equation, which has a relationship between the set of inputs and predictive
output. This algorithm is mostly used in forecasting and predictions. Since it shows the linear
relationship between input and output variable, hence it is called linear regression.
The below equation can describe the relationship between x and y variables:
1. Y= mx+c
2. Decision Tree: Decision Tree algorithm is another machine learning algorithm, which
belongs to the supervised learning algorithm. This is one of the most popular machine
learning algorithms. It can be used for both classification and regression problems.
In the decision tree algorithm, we can solve the problem, by using tree representation in
which, each node represents a feature, each branch represents a decision, and each leaf
represents the outcome.
3. K-Means Clustering: K-means clustering is one of the most popular algorithms of machine
learning, which belongs to the unsupervised learning algorithm. It solves the clustering
problem.
If we are given a data set of items, with certain features and values, and we need to
categorize those set of items into groups, so such type of problems can be solved using k-
means clustering algorithm.
4. SVM: The supervised learning technique known as SVM, or support vector machine, is
used for regression and classification. The fundamental principle of SVM is to identify the
hyperplane in a high-dimensional space that best discriminates between the various classes
of data.
SVM, to put it simply, seeks to identify a decision boundary that maximizes the margin
between the two classes of data. The margin is the separation of each class's nearest data
points, known as support vectors, from the hyperplane.
The use of various kernel types that translate the input data to a higher-dimensional space
where it may be linearly separated allows SVM to be used for both linearly separable and
non-linearly separable data.
Among the various uses for SVM are bioinformatics, text classification, and picture
classification. Due to its strong performance and theoretical assurances, it has been widely
employed in both industry and academic studies.
5. KNN: The supervised learning technique known as KNN, or k-Nearest Neighbours, is used
for regression and classification. The fundamental goal of KNN is to categorize a data point
by selecting the class that appears most frequently among the "k" nearest labeled data
points in the feature space.
Simply said, KNN is a lazy learning method that saves all training data points in memory and
uses them for classification or regression whenever a new data point is provided, rather than
developing a model manually.
The value of "k" indicates how many neighbors should be taken into account for
classification when using KNN, which may be utilized for both classification and regression
issues. A smoother choice boundary will be produced by a bigger value of "k," whereas a
more complicated decision boundary will be produced by a lower value of "k".
There are several uses for KNN, including recommendation systems, text classification, and
picture classification. Due to its efficacy and simplicity, it has been extensively employed in
both academic and industrial research. When working with big datasets can be
computationally costly and necessitates the careful selection of the value of "k" and the
distance metric employed to determine the separation between data points.
6. Naive Bayes: A supervised learning method used for classification and regression analysis
is called Naive Bayes. It is founded on the Bayes theorem, a probability theory that
determines the likelihood of a hypothesis in light of the data currently available.
The term "naive" refers to the assumption made by Naive Bayes, which is that the existence
of one feature in a class is unrelated to the presence of any other features in that class. This
presumption makes conditional probability computation easier and increases the algorithm's
computing efficiency.
Naive Bayes utilizes the Bayes theorem to determine the likelihood of each class given a
collection of input characteristics for binary and multi-class classification problems. The
projected class for the input data is then determined by selecting the class with the highest
probability.
Naive Bayes has several uses, including document categorization, sentiment analysis, and
email spam screening. Due to its ease of use, effectiveness, and strong performance across a
wide range of activities, it has received extensive use in both academic research and
industry. However, it could not be effective for complicated issues in which the
independence assumption is violated.
7. Random Forest: A supervised learning system called Random Forest is utilized for
regression and classification. It is an ensemble learning technique that mixes various
decision trees to increase the model's robustness and accuracy.
Simply said, Random Forest builds a number of decision trees using randomly chosen
portions of the training data and features, combining the results to provide a final
prediction. The characteristics and data used to construct each decision tree in the Random
Forest are chosen at random, and each tree is trained independently of the others.
Both classification and regression issues may be solved with Random Forest, which is
renowned for its excellent accuracy, resilience, and resistance to overfitting. It may be used
for feature selection and ranking and can handle huge datasets with high dimensionality and
missing values.
There are several uses for Random Forest, including bioinformatics, text classification, and
picture classification. Due to its strong performance and capacity for handling complicated
issues, it has been widely employed in both academic research and industry. For issues
involving strongly linked traits or class inequalities, it might not be very effective.
8. Logistic Regression: For binary classification issues, where the objective is to predict the
likelihood of a binary result (such as Yes/No, True/False, or 1/0), logistic regression is a form
of supervised learning technique. It is a statistical model that converts the result of a linear
regression model into a probability value between 0 and 1. It does this by using the logistic
function.
Simply expressed, logistic functions are used in logistic regression to represent the
connection between the input characteristics and the output probability. Any input value is
converted by the logistic function to a probability value between 0 and 1. Given the input
attributes, this probability number indicates the possibility that the binary result will be 1.
Both basic and difficult issues may be solved using logistic regression, which can handle
input characteristics with both numerical and categorical data. It may be used for feature
selection and ranking since it is computationally efficient and simple to understand.
How to solve a problem in Data Science using Machine learning algorithms?
Now, let's understand what are the most common types of problems occurred in data
science and what is the approach to solving the problems. So in data science, problems are
solved using algorithms, and below is the diagram representation for applicable algorithms
for possible questions:
Is this A or B? :
We can refer to this type of problem which has only two fixed solutions such as Yes or No, 1
or 0, may or may not. And this type of problems can be solved using classification
algorithms.
Is this different? :
We can refer to this type of question which belongs to various patterns, and we need to find
odd from them. Such type of problems can be solved using Anomaly Detection Algorithms.
Now if you have a problem which needs to deal with the organization of data, then it can be
solved using clustering algorithms.
Clustering algorithm organizes and groups the data based on features, colors, or other
common characteristics.
The main phases of data science life cycle are given below:
1. Discovery: The first phase is discovery, which involves asking the right questions. When
you start any data science project, you need to determine what are the basic requirements,
priorities, and project budget. In this phase, we need to determine all the requirements of
the project such as the number of people, technology, time, data, an end goal, and then we
can frame the business problem on first hypothesis level.
2. Data preparation: Data preparation is also known as Data Munging. In this phase, we need
to perform the following tasks:
o Data cleaning
o Data Reduction
o Data integration
o Data transformation,
After performing all the above tasks, we can easily use this data for our further processes.
3. Model Planning: In this phase, we need to determine the various methods and techniques
to establish the relation between input variables. We will apply Exploratory data
analytics(EDA) by using various statistical formula and visualization tools to understand the
relations between variable and to see what data can inform us. Common tools used for
model planning are:
o R
o SAS
o Python
4. Model-building: In this phase, the process of model building starts. We will create
datasets for training and testing purpose. We will apply different techniques such as
association, classification, and clustering, to build the model.
o WEKA
o SPCS Modeler
o MATLAB
5. Operationalize: In this phase, we will deliver the final reports of the project, along with
briefings, code, and technical documents. This phase provides you a clear overview of
complete project performance and other components on a small scale before the full
deployment.
6. Communicate results: In this phase, we will check if we reach the goal, which we have set
on the initial phase. We will communicate the findings and final result with the business
team.
o Gaming world:
In the gaming world, the use of Machine learning algorithms is increasing day by day.
EA Sports, Sony, Nintendo, are widely using data science for enhancing user
experience.
o Internet search:
When we want to search for something on the internet, then we use different types
of search engines such as Google, Yahoo, Bing, Ask, etc. All these search engines use
the data science technology to make the search experience better, and you can get a
search result with a fraction of seconds.
o Transport:
Transport industries also using data science technology to create self-driving cars.
With self-driving cars, it will be easy to reduce the number of road accidents.
o Healthcare:
In the healthcare sector, data science is providing lots of benefits. Data science is
being used for tumor detection, drug discovery, medical image analysis, virtual
medical bots, etc.
o Recommendation systems:
Most of the companies, such as Amazon, Netflix, Google Play, etc., are using data
science technology for making a better user experience with personalized
recommendations. Such as, when you search for something on Amazon, and you
started getting suggestions for similar products, so this is because of data science
technology.
o Risk detection:
Finance industries always had an issue of fraud and risk of losses, but with the help
of data science, this can be rescued.
Most of the finance companies are looking for the data scientist to avoid risk and any
type of losses with an increase in customer satisfaction.
1. Programming Languages:
A fundamental tool for every data scientist is a programming language. ‘Python’ and ‘R’ are
two of the most popular languages in the field. Python offers a versatile ecosystem with
libraries such as NumPy, Pandas, and Scikit-learn, making it ideal for data manipulation,
analysis, and machine learning. R, on the other hand, excels in statistical analysis and
visualization, with packages like dplyr and ggplot2. Familiarity with one or both of these
languages will enable you to efficiently handle and process data.
‘Python’, with its clean syntax and extensive community support, has become the go-to
choice for many data scientists. Its rich ecosystem of libraries provides powerful tools for
tasks ranging from data cleaning and preprocessing to model building and deployment. The
simplicity and readability of Python code make it accessible to both beginners and
experienced programmers alike.
‘R’, on the other hand, has a strong foundation in statistical analysis and is highly regarded in
academic and research settings. It provides a wide range of statistical techniques and
packages that are specifically designed for data analysis and visualization. R's syntax and
capabilities make it a powerful tool for exploratory data analysis and statistical modeling.
It's important to note that while Python and R are popular, there are other programming
languages like Julia and Scala that are gaining traction in the data science community. As a
data scientist, being adaptable and open to learning new languages will widen your range of
tools and increase your versatility.
2. Data Visualization:
Data visualization is a powerful technique for communicating insights effectively. Tools like
Matplotlib, Seaborn, Plotly, and Tableau allow data scientists to create visual representations
that aid in understanding complex patterns and trends. Visualizations can simplify complex
concepts, identify outliers, and present data-driven narratives that resonate with
stakeholders. Developing proficiency in data visualization empowers you to tell compelling
stories with data.
Matplotlib, a popular plotting library for Python, provides a flexible framework for creating
static, animated, and interactive visualizations. It offers a wide range of plot types,
customization options, and control over every aspect of the visualization. Seaborn, built on
top of Matplotlib, specializes in statistical graphics and provides a high-level interface for
creating aesthetically pleasing visualizations with minimal code.
Plotly, a powerful data visualization library, offers interactive and dynamic visualizations that
can be embedded in web applications. It allows you to create interactive plots, charts, and
dashboards that enhance the user experience and enable exploration of complex datasets.
Tableau, a widely used data visualization tool, provides a user-friendly interface for creating
visually appealing and interactive dashboards. It offers drag-and-drop functionality, intuitive
design features, and robust data connectivity options. Tableau’s strength lies in its ability to
quickly transform data into actionable insights, making it popular among data analysts and
business users.
By mastering these visualization tools, you can effectively communicate your findings,
engage stakeholders, and drive data-informed decision-making within your organization.
Machine learning algorithms enable data scientists to extract valuable insights and make
predictions from data. Familiarity with a range of algorithms empowers you to select the
most appropriate approach for a given problem and optimize model performance.
a. Supervised Learning: Supervised learning algorithms learn patterns from labeled data to
make predictions or classify new instances. Linear regression, decision trees, random forests,
support vector machines (SVM), and neural networks are common examples of supervised
learning algorithms. Each algorithm has its strengths and is suitable for different types of
problems. For instance, linear regression is used for predicting continuous values, while
decision trees and random forests excel in handling categorical or binary outcomes.
c. Deep Learning: Deep learning, a subset of machine learning, focuses on neural networks
with multiple hidden layers. Deep learning algorithms have achieved remarkable success in
various domains, including computer vision, natural language processing, and speech
recognition. Convolutional neural networks (CNN) and recurrent neural networks (RNN) are
widely used architectures in deep learning. They have revolutionized image recognition,
language translation, and sentiment analysis, among other applications.
4. Data Wrangling:
Data rarely comes in a clean and ready-to-use format. Data wrangling involves cleaning,
transforming, and preparing raw data for analysis. Libraries like Pandas in Python and tidyr in
R provide powerful tools for data wrangling tasks, including handling missing values, merging
datasets, and reshaping data structures. Proficiency in data wrangling allows you to handle
messy data efficiently and extract meaningful insights.
Data wrangling is often an iterative and time-consuming process. It requires skills in data
cleaning, data integration, and data transformation. Cleaning involves removing duplicates,
dealing with missing values, and handling outliers. Integration combines data from different
sources or merges multiple datasets. Transformation includes reshaping data, creating new
variables, or aggregating data at different levels of granularity.
By mastering data wrangling techniques, you ensure data quality, enhance the reliability of
your analyses, and lay a solid foundation for further exploration and modeling.
Data is often stored in databases, and SQL (Structured Query Language) is a powerful tool
for querying and manipulating structured data. Understanding SQL and working with
database systems like MySQL, PostgreSQL, or SQLite enables data scientists to extract
relevant information, perform aggregations, and join datasets efficiently. SQL skills are
essential for accessing and manipulating data stored in relational databases.
SQL allows you to perform operations such as selecting specific columns, filtering rows
based on conditions, sorting data, and joining tables to combine information from different
sources. It provides a standardized way to interact with databases and retrieve the data
needed for analysis or modeling tasks.
Furthermore, SQL is not limited to querying databases but also supports creating tables,
modifying data structures, and managing user permissions. This allows data scientists to
handle data engineering tasks, ensuring data is organized, updated, and readily available for
analysis.
In the era of big data, traditional data processing techniques may not suffice. Familiarity with
distributed computing frameworks like Apache Hadoop and Apache Spark is becoming
increasingly important. These tools enable efficient processing and analysis of large-scale
datasets across distributed clusters. Learning to leverage these frameworks equips data
scientists with the ability to handle big data challenges effectively.
Apache Hadoop is an open-source framework that allows distributed storage and processing
of large datasets across clusters of computers. It utilizes a distributed file system called
Hadoop Distributed File System (HDFS) and a processing framework called MapReduce. With
Hadoop, data scientists can parallelize computations and distribute data across multiple
nodes, enabling the processing of massive datasets in a scalable and fault-tolerant manner.
Apache Spark, is a fast and general-purpose distributed computing system. It provides an in-
memory computing engine that allows data scientists to perform iterative computations and
interactive data analysis at a much faster pace compared to traditional disk-based systems.
Spark supports various programming languages, including Python and Scala, and offers high-
level APIs for data manipulation, machine learning, and graph processing.
After mastering these big data processing frameworks, data scientists can efficiently handle
large volumes of data, perform complex computations, and extract insights that were
previously unattainable with traditional tools.
7. Version Control:
Version control systems like Git provide a structured and collaborative approach to managing
code and project files. Data scientists often work in teams and need to track changes,
collaborate seamlessly, and maintain a history of their work. By adopting version control
practices, you can effectively manage code, experiment with different approaches, and
ensure the reproducibility of your analyses.
Git allows you to track changes, create branches for experimentation, and merge different
versions of your code. It enables collaboration by allowing multiple contributors to work on
the same project simultaneously and provides mechanisms to resolve conflicts that may
arise during the development process. Moreover, Git integrates well with platforms like
GitHub and GitLab, providing additional features like issue tracking, code reviews, and
project management tools.
Version control not only facilitates collaboration but also ensures the integrity of your work.
By keeping a history of changes, you can revert to previous versions if needed, trace the
evolution of your analyses, and maintain a well-documented and organized workflow.
Conclusion:
Becoming a proficient data scientist requires mastering a diverse set of tools and techniques.
The data science toolbox encompasses programming languages, data visualization tools,
machine learning algorithms, data wrangling skills, SQL and database systems, big data
processing frameworks, and version control systems. By investing time and effort into
developing these foundational skills, you’ll be well-equipped to navigate the complex
landscape of data science and contribute valuable insights to organizations.
With a plethora of data collection points available at their disposal, it’s easier for business
leaders to be lost in the information glut.
However, every business and business leader should ponder for a moment: will collecting
such a huge volume of data make sense unless it provides meaningful information and
insights? Big data comes with bigger challenges.
In 2017, The Economist published a story titled, “The world’s most valuable resource is no
longer oil, but data.”
But unfortunately, this story is no longer relevant in today’s data-intensive world. Big data
doesn’t always equate to good data.
Instead, the world’s most valuable resource is the ability to use data to extract meaningful
data insights and leverage untapped potential. If appropriately utilized, those insights can
help an organization:
However, before discussing the process and ways of collecting valuable insights, it’s
important to understand the difference between data and insights.
Many people use data, information, and insights interchangeably. However, there’s a vast
difference between these terms.
If you look at these terms from a pyramid point of view, data sits at the foundation,
information occupies the middle part, and insight is positioned at the pinnacle.
Data: Raw and unprocessed facts in the form of numbers, text, images, audio or video files,
etc., which primarly exists in various formats ad systems. On its own, data neither makes
sense nor provides valuable inputs to a business.
Information: Information can also be called “data processed, aggregated, and organized into
a more human-friendly format.” It provides more context but is still not ready to inform
business decisions.
Insights: Insights are generated by analyzing information and drawing conclusions. This step
can make or break an organization’s ability to understand its data better and leverage it to
maximize profitability, reduce cost, and create value for shareholders.
If you look around, all successful companies like Coca-Cola, Netflix, Google, Spotify, etc.,
leverage insights to enhance the customer experience and increase their revenue.
In a nutshell, data is the input for extracting relevant information, and then information
becomes the input to obtain meaningful insights.
As discussed above, data on its own can’t influence business decisions. It has to be first
processed and organized in a more human-friendly format and then converted into
actionable insights.
For instance, a company receives hundreds or even thousands of invoices from its vendors
each month. Those invoices are recorded in the accounting system that results in big data
generation.
However, such data is of little use in decision making until it is processed further and
actionable insights are drawn from it. Unprocessed data may be limited by severe data
quality issues such as:
1. Duplicate data – Since most organizations collect data from all directions and
systems, it may result in duplication and overlap in these sources. A duplicate invoice
may lead to a duplicate payment.
2. Inaccurate data – Human errors, data drift, and data decay can lead to a loss of data
integrity. Inaccurate data recording may delay payments which can adversely affect
an organization’s relationship with its vendors.
It’s almost impossible for business leaders to make decisions based on myriad data sources
until this procurement big data is converted into relevant information and then actionable
insights.
Only then can leaders uncover hidden patterns and trends and obtain necessary inputs to
make informed business decisions.
Organizations often overlook the accounts payable department while trying to make better
use of data.
AP manages critical financial data that can provide valuable insights for discovering potential
savings.
Organizations can use AP data to optimize their cash flow, create better and deeper
relationships with suppliers, and understand trends in the payment data.
Before we discuss further how an organization can make better sense of AP data, let’s
understand how raw data can be transformed into actionable insights.
Here is a 5-step process that can help you convert raw data into actionable insights:
It’s critical for an organization to keep an eye on the prize — the end goals to be achieved
from data analytics should be clearly outlined.
The goals should align with the company’s strategic priorities. It’s easy to deviate towards
vanity metrics that sound impressive and look good on paper, but in reality, don’t add value
to a business.
A useful framework for setting goals and KPIs is to be SMART – Specific, Measurable,
Attainable, Realistic, and Timely.
Once end goals have been identified, the next step is to figure out key information needed
for informed business decisions.
What are the key drivers of revenue, expenses, and risks in the targeted business
area?
How will specific insights impact the operations and add to the bottom line?
Who will consume these insights? What actions do they want to take based on these
insights?
Every user will have different expectations from the data analytics activity. C-suite executives
may focus on the big financial picture, while managers may be more interested in collecting
insights that improve management practices.
Similarly, executives may want to collect operational insights. Make sure all users
requirements are considered ahead of time.
This step is the most critical step when converting data points into insights. Mostly, data is
stored in disparate systems with varying degrees of accuracy across sources.
Hence, it becomes important for the organization to collect, combine, and collate this data
into a single data model.
Also, the organization has to handle and eliminate common data handling and
transformation challenges such as missing values, different output formats, and varying
levels of granularity for different levels.
Here, pattern recognition also plays a key role. Not all patterns will be relevant or crucial.
Each pattern should be reviewed and moved forward only if it answers necessary questions.
Segmentation is also necessary since it allows you to group data based on common
attributes and then process it further.
Once data has been collected, collated, and cross-examined for accuracy and cleanliness, the
next step is to set up visual analytics.
Visual analytics present information in a highly graphical, interactive, and visual format
through interactive dashboards, reports, summaries, graphs, charts, and maps.
Result?
Critical data is displayed in meaningful, insightful ways to help business leaders make
informed business decisions such as forecasting, planning, analysis, risk
management, strategic sourcing, operational complexcity reductions, and anti-fraud
monitoring — to name a few.
The final step is to derive the required information to make better strategic decisions and
generate more value from data.
The insights collected from the entire 5-step process can help the organization manage and
enhance profitability, maximise prosperity, and transform risk into value across the board.
This 5-step process is not a law that has to be followed as it is. The main objective behind
converting data into insights is to present it in an easy-to-understand, simple, and visual
language.
Over the last few years, businesses have understood the importance of being more agile and
proactive in getting access to real-time data and insights.
Instead of relying on manual systems where a finance team pulls data from multiple
spreadsheets, crunch numbers, and send reports to stakeholders and executives,
organizations need to adopt an automated system with robust analytics capabilities.
Ideally, it will be a centralized system that captures data in a systematic, standardized format
every time. For example, a centralized Procure-to-Pay software like PLANERGY.
The automated system should have the ability to use various visual formats to provide
actionable insights that help business leaders make informed business decisions.
Data on its own can seem like an alien language to people outside of the analytics team. This
is where data visualization can take raw data and turn it into easily interpretable insights.
A few common data visualization techniques include pie charts, bar charts, histograms,
Gantt charts, heat maps, Waterfall charts, etc.
An organization has a variety of data visualization tools at its disposal – Power BI, Google
Charts, Tableau, etc.
A good data analytics and visualization software is fully customizable and can be embedded
right into the core product or ERP.
You can pull data from multiple sources into a standalone data visualization tool but having
tools in the various areas of your business processes already equipped with data
visualisation can be even better.
You see the relevant data as you are making decisions in the application.
For example, PLANERGY Spend Analysis software offers a powerful real-time business
intelligence software for spend data, equipped with data visualization features such as
reports and customizable dashboards.
You can track every purchase in PLANERGY to power your reporting insights, drill down and
uncover hidden patterns to realize savings of up to 15%, and integrate with almost all ERPs.
Besides, a good visualization tool gives you an option to build custom business intelligence
reports according to your requirements.
All reports are fully filterable which allows you to drill down and see hidden details.
Bottom Line
By adopting automated systems that convert raw data into insights, the organization can
manage their finances more effectively, make better decisions, and earn the best ROI on
capital.
Combining data processing with machine learning makes the system more intelligent and
capable of handling complex data points.
Steadily business leaders have recognized the need to transform their data into actionable
insights and are finding the right tools to capture data accurately, provide information in a
way that they can do a deep dive when needed, but also provides the right data at the right
time and in the right format to aid decision making.
Version control systems are a category of software tools that helps in recording changes
made to files by keeping a track of modifications done in the code.
Leverages the productivity, expedites product delivery, and skills of the employees
through better communication and assistance,
For each different contributor to the project, a different working copy is maintained
and not merged to the main file unless the working copy is validated. The most
popular example is Git, Helix core, Microsoft TFS,
Informs us about Who, What, When, Why changes have been made.
Copy of Work (sometimes called as checkout): It is the personal copy of all the files
in a project. You can edit to this copy, without affecting the work of others and you
can finally commit your changes to a repository when you are done making your
changes.
Working in a group: Consider yourself working in a company where you are asked to
work on some live project. You can’t change the main code as it is in production, and
any change may cause inconvenience to the user, also you are working in a team so
you need to collaborate with your team to and adapt their changes. Version control
helps you with the, merging different requests to main repository without making
any undesirable changes. You may test the functionalities without putting it live, and
you don’t need to download and set up each time, just pull the changes and do the
changes, test it and merge it back. It may be visualized as.
Types of Version Control Systems:
Local Version Control Systems: It is one of the simplest forms and has a database
that kept all the changes to files under revision control. RCS is one of the most
common VCS tools. It keeps patch sets (differences between files) in a special format
on disk. By adding up all the patches it can then re-create what any file looked like at
any point in time.
Centralized Version Control Systems: Centralized version control systems contain just
one repository globally and every user need to commit for reflecting one’s changes in
the repository. It is possible for others to see your changes by updating.
Two things are required to make your changes visible to others which are:
You commit
They update
The benefit of CVCS (Centralized Version Control Systems) makes collaboration amongst
developers along with providing an insight to a certain extent on what everyone else is doing
on the project. It allows administrators to fine-grained control over who can do what.
It has some downsides as well which led to the development of DVS. The most obvious is
the single point of failure that the centralized repository represents if it goes down during
that period collaboration and saving versioned changes is not possible. What if the hard disk
of the central database becomes corrupted, and proper backups haven’t been kept? You lose
absolutely everything.
Distributed Version Control Systems: Distributed version control systems contain multiple
repositories. Each user has their own repository and working copy. Just committing your
changes will not give others access to your changes. This is because commit will reflect those
changes in your local repository and you need to push them in order to make them visible
on the central repository. Similarly, When you update, you do not get others’ changes unless
you have first pulled those changes into your repository.
You commit
You push
They pull
They update
The most popular distributed version control systems are Git, and Mercurial. They help us
overcome the problem of single point of failure.
Purpose of Version Control:
Multiple people can work simultaneously on a single project. Everyone works on and
edits their own copy of the files and it is up to them when they wish to share the
changes made by them with the rest of the team.
It integrates the work that is done simultaneously by different members of the team.
In some rare cases, when conflicting edits are made by two people to the same line
of a file, then human assistance is requested by the version control system in
deciding what should be done.
Markdown
What is Markdown?
Markdown is a lightweight markup language that you can use to add formatting elements to
plaintext text documents. Created by John Gruber in 2004, Markdown is now one of the
world’s most popular markup languages.
Using Markdown is different than using a WYSIWYG editor. In an application like Microsoft
Word, you click buttons to format words and phrases, and the changes are visible
immediately. Markdown isn’t like that. When you create a Markdown-formatted file, you
add Markdown syntax to the text to indicate which words and phrases should look different.
For example, to denote a heading, you add a number sign before it (e.g., # Heading One). Or
to make a phrase bold, you add two asterisks before and after it (e.g., **this text is bold**).
It may take a while to get used to seeing Markdown syntax in your text, especially if you’re
accustomed to WYSIWYG applications. The screenshot below shows a Markdown file
displayed in the Visual Studio Code text editor.
You can add Markdown formatting elements to a plaintext file using a text editor application.
Or you can use one of the many Markdown applications for macOS, Windows, Linux, iOS,
and Android operating systems. There are also several web-based applications specifically
designed for writing in Markdown.
Depending on the application you use, you may not be able to preview the formatted
document in real time. But that’s okay. According to Gruber, Markdown syntax is designed to
be readable and unobtrusive, so the text in Markdown files can be read even if it isn’t
rendered.
The overriding design goal for Markdown’s formatting syntax is to make it as readable as
possible. The idea is that a Markdown-formatted document should be publishable as-is, as
plain text, without looking like it’s been marked up with tags or formatting instructions.
Markdown is future proof. Even if the application you’re using stops working at some
point in the future, you’ll still be able to read your Markdown-formatted text using a
text editing application. This is an important consideration when it comes to books,
university theses, and other milestone documents that need to be preserved
indefinitely.
Markdown is everywhere. Websites like Reddit and GitHub support Markdown, and
lots of desktop and web-based applications support it.
The best way to get started with Markdown is to use it. That’s easier than ever before thanks
to a variety of free tools.
You don’t even need to download anything. There are several online Markdown editors that
you can use to try writing in Markdown. Dillinger is one of the best online Markdown
editors. Just open the site and start typing in the left pane. A preview of the rendered
document appears in the right pane.
You’ll probably want to keep the Dillinger website open as you read through this guide. That
way you can try the syntax as you learn about it. After you’ve become familiar with
Markdown, you may want to use a Markdown application that can be installed on your
desktop computer or mobile device.
Dillinger makes writing in Markdown easy because it hides the stuff happening behind the
scenes, but it’s worth exploring how the process works in general.
When you write in Markdown, the text is stored in a plaintext file that has
an .md or .markdown extension. But then what? How is your Markdown-formatted file
converted into HTML or a print-ready document?
The short answer is that you need a Markdown application capable of processing the
Markdown file. There are lots of applications available — everything from simple scripts to
desktop applications that look like Microsoft Word. Despite their visual differences, all of the
applications do the same thing. Like Dillinger, they all convert Markdown-formatted text to
HTML so it can be displayed in web browsers.
Note: The Markdown application and processor are two separate components. For the sake
of brevity, I've combined them into one element ("Markdown app") in the figure below.
1. Create a Markdown file using a text editor or a dedicated Markdown application. The
file should have an .md or .markdown extension.
3. Use the Markdown application to convert the Markdown file to an HTML document.
4. View the HTML file in a web browser or use the Markdown application to convert it
to another file format, like PDF.
From your perspective, the process will vary somewhat depending on the application you
use. For example, Dillinger essentially combines steps 1-3 into a single, seamless interface —
all you have to do is type in the left pane and the rendered output magically appears in the
right pane. But if you use other tools, like a text editor with a static website generator, you’ll
find that the process is much more visible.
Markdown is a fast and easy way to take notes, create content for a website, and produce
print-ready documents.
It doesn’t take long to learn the Markdown syntax, and once you know how to use it, you
can write using Markdown just about everywhere. Most people use Markdown to create
content for the web, but Markdown is good for formatting everything from email messages
to grocery lists.
Git
Branching and Merging
The Git feature that really makes it stand apart from nearly every other SCM out there is its
branching model.
Git allows and encourages you to have multiple local branches that can be entirely
independent of each other. The creation, merging, and deletion of those lines of
development takes seconds.
Frictionless Context Switching. Create a branch to try out an idea, commit a few
times, switch back to where you branched from, apply a patch, switch back to where
you are experimenting, and merge it in.
Role-Based Codelines. Have a branch that always contains only what goes to
production, another that you merge work into for testing, and several smaller ones
for day to day work.
Feature Based Workflow. Create new branches for each new feature you're working
on so you can seamlessly switch back and forth between them, then delete each
branch when that feature gets merged into your main line.
Disposable Experimentation. Create a branch to experiment in, realize it's not going
to work, and just delete it - abandoning the work—with nobody else ever seeing it
(even if you've pushed other branches in the meantime).
Notably, when you push to a remote repository, you do not have to push all of your
branches. You can choose to share just one of your branches, a few of them, or all of them.
This tends to free people to try new ideas without worrying about having to plan how and
when they are going to merge it in or share it with others.
There are ways to accomplish some of this with other systems, but the work involved is
much more difficult and error-prone. Git makes this process incredibly easy and it changes
the way most developers work when they learn it.
Git is fast. With Git, nearly all operations are performed locally, giving it a huge speed
advantage on centralized systems that constantly have to communicate with a server
somewhere.
Git was built to work on the Linux kernel, meaning that it has had to effectively handle large
repositories from day one. Git is written in C, reducing the overhead of runtimes associated
with higher-level languages. Speed and performance has been a primary design goal of Git
from the start.
Distributed
One of the nicest features of any Distributed SCM, Git included, is that it's distributed. This
means that instead of doing a "checkout" of the current tip of the source code, you do a
"clone" of the entire repository.
Multiple Backups
This means that even if you're using a centralized workflow, every user essentially has a full
backup of the main server. Each of these copies could be pushed up to replace the main
server in the event of a crash or corruption. In effect, there is no single point of failure with
Git unless there is only a single copy of the repository.
Any Workflow
Because of Git's distributed nature and superb branching system, an almost endless number
of workflows can be implemented with relative ease.
Subversion-Style Workflow
Another common Git workflow involves an integration manager — a single person who
commits to the 'blessed' repository. A number of developers then clone from that
repository, push to their own independent repositories, and ask the integrator to pull in their
changes. This is the type of development model often seen with open source or GitHub
repositories.
For more massive projects, a development workflow like that of the Linux kernel is often
effective. In this model, some people ('lieutenants') are in charge of a specific subsystem of
the project and they merge in all changes related to that subsystem. Another integrator (the
'dictator') can pull changes from only his/her lieutenants and then push to the 'blessed'
repository that everyone then clones from again.
Data Assurance
The data model that Git uses ensures the cryptographic integrity of every bit of your project.
Every file and commit is checksummed and retrieved by its checksum when checked back
out. It's impossible to get anything out of Git other than the exact bits you put in.
It is also impossible to change any file, date, commit message, or any other data in a Git
repository without changing the IDs of everything after it. This means that if you have a
commit ID, you can be assured not only that your project is exactly the same as when it was
committed, but that nothing in its history was changed.
Staging Area
Unlike the other systems, Git has something called the "staging area" or "index". This is an
intermediate area where commits can be formatted and reviewed before completing the
commit.
One thing that sets Git apart from other tools is that it's possible to quickly stage some of
your files and commit them without committing all of the other modified files in your
working directory or having to list them on the command line during the commit.
This allows you to stage only portions of a modified file. Gone are the days of making two
logically unrelated modifications to a file before you realized that you forgot to commit one
of them. Now you can just stage the change you need for the current commit and stage the
other change for the next commit. This feature scales up to as many different changes to
your file as needed.
Git is released under the GNU General Public License version 2.0, which is an open source
license. The Git project chose to use GPLv2 to guarantee your freedom to share and change
free software---to make sure the software is free for all its users.
GitHub
What is GitHub?
GitHub is a web-based version control and collaboration platform for software developers.
Microsoft, the biggest single contributor to GitHub, acquired the platform for $7.5 billion in
2018. GitHub, which is delivered through a software as a service (SaaS) business model, was
started in 2008. It was founded on Git, an open source code management system created by
Linus Torvalds to make software builds faster.
Git is used to store the source code for a project and track the complete history of all
changes to that code. It lets developers collaborate on a project more effectively by
providing tools for managing possibly conflicting changes from multiple developers.
GitHub allows developers to change, adapt and improve software from its public repositories
for free as part of various paid plans. Each public and private repository contains all a
project's files, as well as each file's revision history. Repositories can have multiple
collaborators and owners.
GitHub facilitates social coding by providing a hosting service and web interface for the Git
code repository, as well as management tools for collaboration. The developer platform can
be thought of as a social networking site for software developers. Members can follow each
other, rate each other's work, receive updates for specific open source projects, and
communicate publicly or privately.
Fork. A fork, also known as a branch, is a repository that has been copied from one
member's account to another member's account. Forks and branches let a developer
make modifications without affecting the original code.
Pull request. If a developer would like to share their modifications, they can send a
pull request to the owner of the original repository.
Merge. If, after reviewing the modifications, the original owner would like to pull the
modifications into the repository, they can accept the modifications and merge them
with the original repository.
Push. This is the reverse of a pull -- a programmer sends code from a local copy to
the online repository.
GitHub offers an on-premises version in addition to the well-known SaaS product. GitHub
Enterprise supports integrated development environments and continuous integration tools,
as well as many third-party apps and services. It offers more security and auditability than
the SaaS version.
GitHub Pages. Static webpages to host a project, pulling information directly from an
individual's or organization's GitHub repository.
GitHub Desktop. Users can access GitHub from Windows or Mac desktops, rather
than going to GitHub's website.
GitHub Student Developer Pack. A free offering of developer tools for students. It
includes cloud resources, programming tools and support, and GitHub access.
GitHub Campus Experts. A program students can use to become leaders at their
schools and develop technical communities there.
GitHub CLI. A free, open source command-line tool that brings GitHub features, such
as pull requests, to a user's local terminal. This capability eliminates the need to
switch contexts when coding, streamlining workflows.
GitHub is used to store, track and collaborate on software projects in a number of different
contexts:
Programming instructors and students make use of GitHub in several ways. The
Student Developer Pack gives teachers and students an array of low-cost resources.
Students use the platform to learn web development, work on creative development
projects and host virtual events.
Open source software developers use GitHub to share projects with individuals who
want to use their software or collaborate on it. Developers network, collaborate and
pitch their work to other developers in real time, catching errors in proposed code
before changes are finalized. These collaboration and networking capabilities are why
GitHub is classified as a social media site; it often links to other community sites such
as Reddit in the repository notes. Users also can download applications from GitHub.
To sign up for GitHub and create a repository, new users and beginners follow these steps:
Learn about the command line. The command line is how users interact with
GitHub. The ability to use it is a prerequisite for working with GitHub; tutorials and
other tools are available to help with this process. An alternative is the GitHub
Desktop client.
Install Git. Git can be installed for free using instructions on the Git website. Installing
GitHub Desktop will also install a command-line version of Git. Git comes installed by
default on many Mac and Linux machines.
Create a new repository. Go to the GitHub homepage, click the + sign and then
click examplerepo. Name the repository and provide a brief description when
prompted. Add a README file, .gitignore template and project license. Then scroll to
the bottom of the page and click Create repository.
R
What is R?
Introduction to R
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical
tests, time-series analysis, classification, clustering, …) and graphical techniques, and is
highly extensible. The S language is often the vehicle of choice for research in statistical
methodology, and R provides an Open Source route to participation in that activity.
One of R’s strengths is the ease with which well-designed publication-quality plots can be
produced, including mathematical symbols and formulae where needed. Great care has
been taken over the defaults for the minor design choices in graphics, but the user retains
full control.
R is available as Free Software under the terms of the Free Software Foundation’s GNU
General Public License in source code form. It compiles and runs on a wide variety of UNIX
platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.
The R environment
R is an integrated suite of software facilities for data manipulation, calculation and graphical
display. It includes
graphical facilities for data analysis and display either on-screen or on hardcopy, and
R, like S, is designed around a true computer language, and it allows users to add additional
functionality by defining new functions. Much of the system is itself written in the R dialect
of S, which makes it easy for users to follow the algorithmic choices made. For
computationally-intensive tasks, C, C++ and Fortran code can be linked and called at run
time. Advanced users can write C code to manipulate R objects directly.
R has its own LaTeX-like documentation format, which is used to supply comprehensive
documentation, both on-line in a number of formats and in hardcopy.
R the application is installed on your computer and uses your personal computer resources
to process R programming language. RStudio integrates with R as an IDE (Integrated
Development Environment) to provide further functionality. RStudio combines a source code
editor, build automation tools and a debugger.