Unit 1-FDS
Unit 1-FDS
Course Outcomes:
Upon completion of this course, students will be able to:
1. Explain the need of Data Science to analyze the skill sets of data scientists.
2. Describe the Data Science Process and its components interact.
3. Apply basic machine learning algorithms for predictive modeling.
4. Simplify a real-world problem into mathematical terms.
5. Create effective visualization of given data.
UNIT-I Syllabus
Text Books:
1. Chirag Shah, A Hands-On Introduction to Data Science. Cambridge: Cambridge University Press, 2020.
2. Rafael A. Irizarry, Introduction to Data Science: Data Analysis and Prediction Algorithms with R, CRC Press,
2020.
What is Data Science?
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and
systems to extract knowledge and insights from data in various forms, both structured and
unstructured, similar to data mining.
Every company, however, has information, and its business value depends on how much information it thinks.
Since late, Information Science has acquired significance in the light of the fact that it can assist companies with
growing business estimation of their accessible knowledge and thus allow them to take the upper hand against their
rivals.
It can help us know our customers better, it can help us refine our processes and it can help us make better decisions.
Knowledge, in the light of information technology, has become a vital instrument.
Role of Data Scientist
• Data scientists help organizations understand and handle data, and address complex problems using
knowledge from a range of technology.
• They are typically built in the fields of computer science, modeling, statistics, analytics and
mathematics, coupled with modeling statistics and mathematics combined with a clear business sense.
How to do Data Science?
A typical data science process looks like this, which can be modified for specific use
case:
1962: American mathematician John W. Tukey first articulated the data science dream. In his now-famous article "The Future of Data Analysis," he
foresaw the inevitable emergence of a new field nearly two decades before the first personal computers.
1977: The theories and predictions of "pre" data scientists like Tukey and Naur became more concrete with the establishment of The International
Association for Statistical Computing (IASC), whose mission was "to link traditional statistical methodology, modern computer technology, and the
knowledge of domain experts in order to convert data into information and knowledge.“
1980s and 1990s: Data science began taking more significant strides with the emergence of the first Knowledge Discovery in Databases (KDD)
workshop and the founding of the International Federation of Classification Societies (IFCS).
1994: BusinessWeek published a story on the new phenomenon of "Database Marketing.” It described the process by which businesses were collecting
and leveraging enormous amounts of data to learn more about their customers, competition, or advertising techniques. The only problem at the time
was that these companies were flooded with more information than they could possibly manage.
1990s and early 2000s: We can clearly see that data science has emerged as a recognized and specialized field.
2000s: Technology made enormous leaps by providing nearly universal access to internet connectivity, communication, and (of course) data collection.
2005: Big data enters the scene. With tech giants such as Google and Facebook uncovering large amounts of data, new technologies capable of
processing them became necessary. Hadoop rose to the challenge, and later on Spark and Cassandra made their debuts.
2014: Due to the increasing importance of data, and organizations’ interest in finding patterns and making better business decisions, demand for data
scientists began to see dramatic growth in different parts of the world.
2015: Machine learning, deep learning, and Artificial Intelligence (AI) officially enter the realm of data science. These
technologies have driven innovations over the past decade — from personalized shopping and entertainment to self-
driven vehicles along with all the insights to efficiently bring forth these real-life applications of AI into our daily lives.
2018: New regulations in the field are perhaps one of the biggest aspects in the evolution in data science.
2020s: We are seeing additional breakthroughs in AI, machine learning, and an ever-more-increasing demand for qualified
professionals in Big Data
Applications of Data Science
2. Programming Languages
Python is used because of its capacity for statistical analysis and its easy readability. Python also has rich libraries and various packages for
Machine Learning, data visualization, data analysis, etc. that make it suited for data science.
3. Machine Learning
Machine Learning is all the rage in Data Science these days! It enables machines to learn a task from experience without programming them
specifically. This is done by training the machines using various machine learning models using the data and different algorithms. So you
need to be familiar with Supervised and Unsupervised Machine Learning algorithms like Linear Regression, Logistic Regression, K-means
Clustering, Decision Tree, K Nearest Neighbor.
4.Cloud Services
Well, more and more companies are moving their databases to the cloud with time. This could be a move to the public, private or hybrid
cloud with the most popular contenders being Amazon Web Services and Microsoft Azure. Most companies are also moving big data and
analytics applications on the cloud and so Data Scientist needs to understand these cloud services a little more deeply so that they can
perform data analytics effectively.
5.SQL
You should be able to write and execute complex queries in SQL that will help in carrying out analytical functions and changing the
database as required. You need to be proficient in SQL as a Data Scientist that you can access the data easily as well as work on it SQL can
give you deep insights into a database depending on your query.
Tools for Data Science
1. Python
2. R
3. SQL
4. Hadoop
5. Tableau
6. Weka
The Importance of Ethical Data Usage
Data Scientists are the Heart of Data they hold the data which can make powerful decisions that can shape
the future. The data is more valuable than anything so maintaining ethical standards is not a obligation but
it's a fundamental aspect of a Data scientist ensuring responsible data usage.
Ethical Data usage is the main block of trust. When individuals provide their Data to organizations or
platforms, they expect it to maintain with integrity and basic ethics. Respecting their privacy is most
important part as it will increase the organization reputation.
Bias Mitigation
Identifying and mitigating biases in data and algorithms is critical for fair outcomes.
This includes:
1. Data Audits: Regularly auditing datasets for inherent biases based on demographics or historical
imbalances.
2. Algorithm Fairness: Assessing algorithms to detect and rectify biases in decision-making processes to
ensure fairness across diverse groups.
3. Diverse Representation: Actively seeking diverse perspectives and inclusivity in datasets and model
development to avoid reinforcing existing biases.
Data Privacy and Consent
Respecting data privacy laws and obtaining informed consent are foundational principles:
a. Informed Consent: Clearly communicating to individuals how their data will be used, ensuring they understand
and agree to its usage.
c. Compliance: Adhering to legal frameworks such as GDPR, HIPAA, or CCPA to ensure lawful and ethical data
handling.