0% found this document useful (0 votes)
6 views3 pages

Date Science Part 2

Class 10 AI

Uploaded by

banani1776
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views3 pages

Date Science Part 2

Class 10 AI

Uploaded by

banani1776
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Chapter: Data Science

Data Collection
1) Data Collection is not difficult, Data Analysis is . Here we require Data Science.
2) Data Analysis provides an insight to the collected data, it adds value to the
dataset. It helps AI machines in the process of predictions and suggestions
3) Majorly the type of data used in Data Science is Numeric or Alpha-Numeric and
that are in the form of tables or databases

Sources of Data

1) Any type of data required can be collected from various sources. The
following are some of the sources available
a) Online Mode: Open-source Govt. portals, WHO websites.
b) Offline Mode: Surveys, Experiments, Personal Interview etc.
2) While handling data online or offline, the following point should be
remembered
a) The source of data should be authentic and reliable.
b) For proper training of AI model, the authenticity of data is must.
c) Privacy of data source should always be kept in mind.
d) Consent from the owner should be taken before using their personal
data.
e) Data present in the public domain should preferably be used, if
available.

Types of Data

1) The most suitable way of for a dataset is storing it in the form of tables.
2) The following are some of the popular tabular formats of storing data:
a) Spreadsheet: Data stored in the form of rows and columns under a
filename is a spreadsheet application. Some popular spreadsheet
applications are MS Excel, Open Office Spreadsheet etc.
b) Comma Separated Values(CSV): These are files with extension of .csv
that contain records with each value separated with commas. These files
are created using Excel, Google Sheets etc.
c) Structure Query Language(SQL): A query language that is sued to store,
manage and retrieve data form DBMS(Data Base Management System).

Issues Related to Data

At the time of collecting the data needed for data Science we might face some
issues like

a) Erroneous Data: It means the values in a dataset is not received as per the
expectations in that position. There are two ways in which the data can be
erroneous:
i) Incorrect Values: The values in the dataset at random places are not
correct. Either the data is mismatched or it is not relevant to that
position. For example the phone number column instead of having
10 digits mobile number has eight digits landline number.
ii) Invalid or Null Values: It means value either corrupted or has no
meaning. These values when occurring in a dataset need to be
removed as they hold no value for data processing. For example,
phone number column not appropriately filled.
b) Missing Data: It means data not present at the desired location of a
dataset. Missing data is not erroneous data. Data with the missing value is
considered as an incomplete dataset. For example, email address, pin code
missing in a set of student details.
c) Outlier Data: It means the data that differs drastically from the rest of the
data. This kind of unusual data needs to be removed or replaced from the
dataset for accurate results. For example, value zero given in marks of a
student who is absent instead of exemption. This will not give an accurate
class average.
Python for Data Science

1) Data Science is using a combination of Python and Mathematical concepts


like Statistics, Data Analysis, probabilities etc.
2) Python is the most suitable, simple and easy language to write the code
and can handle the highly complex mathematical processing required to
develop applications using AI.
3) There are various packages related to various purposes available for free to
be used in Python.
4) Some of the Open source packages available needed for AI are
a) NumPy: Numerical Array Data Handling Package. It is used for data
Analysis and calculation related to large numerical data sets.
b) OpenCV: Image Processing Package. It is used for manipulating and
processing of images like cropping, resizing, editing etc.
c) Matplotlib: Data Visualization Package. It is used for the graphical
representation to produce high quality data visualization of the
numerical data.
d) NLTK (Natural Language Tool Kit): Natural Language Package. It helps in
task related to textual data.
e) Pandas: Data related to 2 or more dimensions is handled using Pandas.
The source of data is data arranged in tabular form either using
spreadsheets or database software.

You might also like