1.
IntroductionToDataScience
June 3, 2020
What is data science ?
a study about data to find some usefull insights from data to make decisions or solve a problem
what is data ?
whatever we know or we can explain is data, there many forms of data
but in data science we deal digitial stored information in a structured or non-structred ma
Type of Data according to structure
Structured Data - list, excelsheets, sql-database
Unstructured Data - raw data, log, audio, video
Semi-Structred Data - which has some kind of structure but still not fully structured d
json, xml,
from where i will get data ?
source of data generation
databases - sql & no-sql
warehouses - streaming
social platform - APIs
websites (reviews, product information) - webscrapping
government
server (log server) - socket
senosors (machine equipments) - socket
1
surveys - manual or automated task
What skills a person should have to become a data scientist ?
Curiosity - should be able to form relevent questions to answer from data
Communication - should be able to tell a story with the help of data
Programming - should be familer with atlease one programming langauge which has tools to proces
Databases - sholud know how to fetch and store data from and to database
Maths - algebra, calculas, metrices & vectors, statistics, probability
Data Mining & Data Engnieering - pre-processing of data to make data suitable for analysis
Data Analysis - Explore the data to find answers of questions
Data Visualzation - graphs to view data to gain more meaning full information that is hidden in
Machine Learning - Supervise & Unsupervise
Deep Learnign - neural networks
Big Data Technologies - to process huge amount of data
Tools: data science open source or commercial tools used in companies
1. Data Management
2. Data Integration & Transformation
3. Data Visulation
4. Model Deployment
5. Model Monitioring & Assesment
6. Code Acsset Mangement tools
7. Development Enviroments tools
8. Execution Environment
report :
tools used in data science world open source and commercial both ?
[ ]:
Stats
2
[1]: from tqdm import tqdm
from time import sleep
for _ in tqdm(range(900)):
sleep(1)
100%|�����������������������������������������������������������������������|
900/900 [15:01<00:00, 1.00s/it]
Our Road Map
1. Maths : Stats, algebra, calculas, metrices & vectors, probability
2. Data Science using Python
1. Numpy & Scipy Module - to proess metrices and apply statistical knowlege on data
2. Pandas to pre-process and Analyze the data
3. Matplotlib, Seaborn, plotly - data Visulations
4. sklearn, tensorflow, kera, opencv, Machine Learning & deep learning
5. pyspark for distributed computing & Big Data Processing
3. Above using R
4. Big data - hadoop, database
5. AWS, Linux
(Admin) Dev-Ops -> go through it ansible, docker, kubernets, jenkins, openshift, openstack, cep
Data Pipeline Creation
source -> storage -> processing -> modeling -> monitioring -> optimization
report -> 1 hr
stats -> 3 hr
[ ]: