0% found this document useful (0 votes)
23 views28 pages

DSF 1-2

The document provides an overview of data science fundamentals. It discusses how data science uses statistics, data analysis, and machine learning to extract knowledge and insights from data. It explains why data science has become important in the era of big data, where large data sets require machine processing. The document also describes key components of data science like data wrangling, modeling, and visualization. It discusses how data science has evolved from earlier statistical techniques limited by computing power.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views28 pages

DSF 1-2

The document provides an overview of data science fundamentals. It discusses how data science uses statistics, data analysis, and machine learning to extract knowledge and insights from data. It explains why data science has become important in the era of big data, where large data sets require machine processing. The document also describes key components of data science like data wrangling, modeling, and visualization. It discusses how data science has evolved from earlier statistical techniques limited by computing power.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

DATA SCIENCE

FUNDAMENTAL
S
DSC293
Lecture 1-2
Dr. Hufsa Mohsin
OVERVIEW
 Data Science is a combination of multiple disciplines that uses
 statistics,
 data analysis, and
 machine learning

 to analyze data and to extract knowledge and insights from it.


WHY DATA SCIENCE
 Information is what we want, but data are what we’ve got.
 The techniques for transforming data into information go back hundreds of years.
 “bills” - tabulations—a condensation of data on individual events into a form more readily
assimilated by the human reader.
 Constructing such tabulations was a manual operation.
WHY DATA SCIENCE…
 Over the centuries, as data became larger, machines were introduced to speed up the
tabulations.
 Herman Hollerith’s development of punched cards
 Also in the late 19th century, statistical methods began to develop rapidly.
 These methods have been tremendously important in interpreting data, but they were not
intrinsically tied to mechanical data processing.
 Generations of students have learned to carry out statistical operations by hand on small sets
of data.
WHY DATA SCIENCE…
 Nowadays, it is common to have data sets that are so large they can be processed only by
machine.
 In this era of big data, data are gathered by networks of instruments and computers. The settings
where such data arise are diverse:
 the genome
 satellite observations of Earth,
 entries by Web users,
 sales transactions, etc.

 There are new opportunities for finding and characterizing patterns using techniques described as
data mining, machine learning, data visualization, and so on.
 Such techniques require computer processing. Among the tasks that need performing are data
cleaning, combining data from multiple sources, and reshaping data into a form suitable as input
to data-summarization operations for visualization and modeling.
WHY DATA SCIENCE…
 Data science spans a wide range of capacities that they described as “data acumen.” Key
components that are part of data acumen include
 mathematical
 computational
 statistical foundations
 data management and curation
 data description and visualization
 data modeling and assessment
 workflow and reproducibility, communication and teamwork, domain-specific considerations,
and ethical problem solving.
DATA WRANGLING
 A process of preparing data for visualization and other modern techniques of statistical
interpretation and using those data to answer statistical questions via modeling and
visualization.
 The ability to reason statistically and utilize computational and algorithmic capacities.
 R and the packages dplyr and ggplot2—
 focus on a small subset of functions that accomplish data wrangling tasks in a concise and expressive way.
WHAT IS DATA SCIENCE?
 The science of extracting meaningful information from data.
 Data science as a fine-grained blend of intellectual traditions from statistics and computer
science.
 Computer science
 It is the creation of appropriate abstractions to express computational structures and the development
of algorithms that operate on those abstractions.
 Statistics
 it is the interplay of general notions of sampling, models, distributions and decision-making.

 Data science is based on the idea that these styles of thinking support each other (Pierson
2016).
WHAT IS DATA SCIENCE?
 Data science is best applied in the context of expert knowledge about the domain from which
the data originate.
 The distinction between data and information is the core of data science.
 Data scientists are people who are interested in converting the data that is now abundant into
actionable information that always seems to be scarce.
STATISTICS AND DATA
SCIENCE
 The goals of data scientists and statisticians are the same??
 Much of statistical technique was originally developed in an environment where data were
scarce and difficult or expensive to collect, so statisticians focused on creating methods that
would maximize the strength of inference one is able to make, given the least amount of data.
 These techniques were often ingenious, involved sophisticated mathematics, and have proven
invaluable to the empirical sciences
 While several of the most influential early statisticians saw computing as an integral part of
statistics, it is also true that much of the development of statistical theory was to find
mathematical approximations for things that we couldn’t yet compute
STATISTICS AND DATA
SCIENCE
 Today, the manner in which we extract meaning from data is different in two ways
 we are able to compute many more things than we could before
 some of the techniques that were ubiquitous in statistics education in the 20th century (e.g., the t-test, ANOVA)
are being replaced by computational techniques that are conceptually simpler but were simply infeasible until
the microcomputer revolution (e.g., the bootstrap, permutation tests).
 We have a lot more data than we had before
 many of the data we now collect are observational—they don’t come from a designed experiment, and they
aren’t really sampled at random.
 clinical trials and A/B tests
VS
 predictive model, an interactive visualization of the data, or a web application that allows the user to engage
with the data to explore questions and extract meaning.
EVOLUTION OF DATA
SCIENCE
DATA SCIENCE LIFE CYCLE
DATA SCIENCE COMPONENTS
DATA PLANNING AND
STRATEGY
 Developing a plan or a data strategy is simply determining what data are you going to gather
and why.
 not the strategy for deciding what mathematical techniques we’re going to use or the technologies
required.
 The focus is on the data we need to address the business problem/ opportunity and why.
 Hence, deciding on a strategy requires making a connection between the data and the business
goals.
 Gathering and formatting the data, getting rid of the ‘garbage data’ that doesn’t serve the
business goal is a reflection of achieving mission-critical data for business goals.
DATA MINING
 Data mining basically implies analyzing data patterns in large batches of data using one or
more software. It has applications in multiple fields like science and research.
 As an application of data mining, businesses can learn more about their customers as it helps
them to be closer to them & develop more effective strategies related to business functions &
leverage resources in an optimal & insightful manner.
DATA ENGINEERING
 Data engineering primarily involves the creation of software solutions for data problems that
involve establishing a data system with data pipelines and endpoints within that system.
 Data engineering requires an in-depth understanding of a wide range of data technologies &
frameworks along with creating data solutions to enable business processes.
DATA ANALYSIS & MODELS
 Considered as the heart of data science, we can think of data analysis & mathematical
models in terms of how to use data to extract insights or make business predictions & to create
a tool that replaces or supplements what a human does.
DATA VISUALIZATION &
OPERATIONALIZATION
 Data visualization is not just presenting the analyzed data correctly; it involves understanding
the raw data and what is needed to be visualized based on the needs and goals of users and the
operations.
 Data operationalization involves real-time person decision/action, a long-term response, or a
recommendation on a specific task.
1. DATA ANALYST SQL, R,
SAS, and
Python

 Data analysts are responsible for a variety of tasks including visualisation, munging, and
processing of massive amounts of data. They also have to perform queries on the databases
from time to time. One of the most important skills of a data analyst is optimization.
 This is because they have to create and modify algorithms that can be used to cull information from
some of the biggest databases without corrupting the data.
 Roles and Responsibilities:
 Extracting data from primary and secondary sources using automated tools
 Developing and maintaining databases
 Performing data analysis and making reports with recommendations
 Analyzing data and forecasting trends that impact the organization/project
 Working with other team members to improve data collection and quality processes
Hive,

2. DATA ENGINEERS NoSQL, R,


Ruby, Java,
C++, and
Matlab

 Data engineers build and test scalable Big Data ecosystems for the businesses so that the data
scientists can run their algorithms on the data systems that are stable and highly optimized.
 Data engineers also update the existing systems with newer or upgraded versions of the
current technologies to improve the efficiency of the databases.
 Roles and Responsibilities:
 Design and maintain data management systems
 Data collection/acquisition and management
 Conducting primary and secondary research
 Finding hidden patterns and forecasting trends using data
 Collaborating with other teams to perceive organizational goals
 Make reports and update stakeholders based on analytics
3. DATABASE
ADMINISTRATOR
 The job profile of a database administrator is pretty much self-explanatory- they are
responsible for the proper functioning of all the databases of an enterprise and grant or revoke
its services to the employees of the company depending on their requirements. They are also
responsible for database backups and recoveries.

Roles and Responsibilities:


 Working on database software to store and manage data
 Working on database design and development
 Implementing security measures for database
 Preparing reports, documentation, and operating manuals
 Data archiving
 Working closely with programmers, project managers, and other team members
4. MACHINE LEARNING
ENGINEER
 Machine learning engineers are in high demand today. However, the job profile comes with its
challenges. Apart from having in-depth knowledge of some of the most powerful technologies such
as SQL, REST APIs, etc. machine learning engineers are also expected to perform A/B testing,
build data pipelines, and implement common machine learning algorithms such as classification,
clustering, etc.
 Roles and Responsibilities:
 Designing and developing Machine Learning systems
 Researching Machine Learning Algorithms
 Testing Machine Learning systems Java,
 Developing apps/products basis client requirements Python,
 Extending existing Machine Learning frameworks and libraries
JS
 Exploring and visualizing data for a better understanding
 Training and retraining systems
 Know the importance of statistics in machine learning
5. DATA SCIENTIST R, MatLab,
SQL,
Python,

 Data scientists have to understand the challenges of business and offer the best solutions using
data analysis and data processing. For instance, they are expected to perform predictive
analysis and run a fine-toothed comb through an “unstructured/disorganized” data to offer
actionable insights. They can also do this by identifying trends and patterns that can help the
companies in making better decisions.
 Roles and Responsibilities:
 Identifying data collection sources for business needs
 Processing, cleansing, and integrating data
 Automation data collection and management process
 Using Data Science techniques/tools to improve processes
 Analyzing large amounts of data to forecast trends and provide reports with recommendations
 Collaborating with business, engineering, and product teams
data warehousing,

6. DATA ARCHITECT data modelling,


extraction
transformation and
load (ETL)
Hive, Pig, and Spark
 A data architect creates the blueprints for data management so that the databases can be easily
integrated, centralized, and protected with the best security measures. They also ensure that the
data engineers have the best tools and systems to work with.
 Roles and Responsibilities:
 Developing and implementing overall data strategy in line with business/organization
 Identifying data collection sources in line with data strategy
 Collaborating with cross-functional teams and stakeholders for smooth functioning of database
systems
 Planning and managing end-to-end data architecture
 Maintaining database systems/architecture considering efficiency and security
 Regular auditing of data management system performance and making changes to improve systems
accordingly.
SQL, data

7. STATISTICIAN mining, and the


various
machine
learning
technologies

 A statistician, as the name suggests, has a sound understanding of statistical theories and data
organization. Not only do they extract and offer valuable insights from the data clusters, but
they also help create new methodologies for the engineers to apply.
 Roles and Responsibilities:
 Collecting, analyzing, and interpreting data
 Analyzing data, assessing results, and predicting trends/relationships using statistical
methodologies/tools
 Designing data collection processes
 Communicating findings to stakeholders
 Advising/consulting on organizational and business strategy basis data
 Coordinating with cross-functional teams
8. BUSINESS ANALYST Power BI
Tableau

 The role of business analysts is slightly different than other data science jobs. While they do
have a good understanding of how data-oriented technologies work and how to handle large
volumes of data, they also separate the high-value data from the low-value data. In other
words, they identify how the Big Data can be linked to actionable business insights for
business growth.
 Roles and Responsibilities:
 Understanding the business of the organization
 Conducting detailed business analysis – outlining problems, opportunities, and solutions
 Working on improving existing business processes
 Analysing, designing, and implementing new technology and systems
 Budgeting and forecasting
 Pricing analysis
9. DATA AND ANALYTICS
MANAGER
 A data and analytics manager oversees the data science operations and assigns the duties to their team
according to skills and expertise. Their strengths should include technologies like SAS, R, SQL, etc.
and of course management.
 Roles and Responsibilities:
 Developing data analysis strategies Python, SAS,
 Researching and implementing analytics solutions R, Java
 Leading and managing a team of data analysts
 Overseeing all data analytics operations to ensure quality
 Building systems and processes to transform raw data into actionable business insights
 Staying upto date on industry news and trends

 How to Become a Data and Analytics Manager?


First and foremost, to go down the analytics manager career path, you must have excellent social skills,
leadership qualities, and an out-of-box thinking attitude. You should also be good at data science
technologies like Python, SAS, R, Java, etc.

You might also like