0% found this document useful (0 votes)
21 views4 pages

Working With CSV Files

Notes for collecting data from CSV filies in Scikit Learn.

Uploaded by

aaravzoc31
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views4 pages

Working With CSV Files

Notes for collecting data from CSV filies in Scikit Learn.

Uploaded by

aaravzoc31
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Working With CSV Files

CSV are files with comma separated values. These files are widely used in Machine
Learning.

We mostly use the Pandas python library for manipulating csv files.

Reading CSV file stored locally


df = pd.read_csv(‘file_name“)

Opening/Reading CSV file from an URL

import requests
From io import StirngIO
url = ‘url_here’
header = {“user-Agent”: “Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0)
Gecko/20100101 Firefox/66.0”}
req= requests.get(url, headers=headers)
Data = StringIO(req.text)

pd.read_csv(data)

Sep Parameter
By default read_csv is set to read data separated by comma values which are of csv
files. But when we have to deal with tsv files we have to set sep = \t as they are
separated by Tab.Eg:

df = pd.read_csv(‘file_name’, sep = ‘\t’)


header row missing/names parameter

When we don't have headings row for our data we can explicitly give header data by
using names parameter of read_csv(‘’)

df = pd.read_csv(‘file_name’, sep = ‘\t’, names =[‘’,’’,’’,’’])

Index_col parameter
When we have two index columns, one from our data and another one auto generated
by our default pandas. We can remove the default generated by using this parameter.

df = pd.read_csv(‘file_name’, Index_col=’our_index_col_name’)

Header parameter
If by mistake our header row is also consider in first row of data. We can fix that by
using header parameter in read_csv method.
pd.read_csv(‘filename’, header = 1)

use_cols parameter
if we want to take/consider only some olumns of our data we can use this parameter
and specify which columns should be taken from data.
Pd.read_csv(‘file_name’, use_cols = [‘col1_name’, ‘col2_name’, so on)

Skiprows parameter
If we want to skip some specific rows while importing data. We usi this in pd.read_csv
as skiprows=[‘row numbers seperated by comma’]

Nrows
Used in pd.read_csv, if we only want to import some specified rows form large data.
Used in pd.read_csv(‘filename’, nrows = no of rows we want to import eg 100)

Encoding parameter
Some data set don’t have utf8 encodings and gives error. In case of that we can specify
encodings =’correct encoding of that dat set’
Skip bad lines
Some data sets have uneven number of datasets for specific column or row which
pandas cannot read so to avoid we use error_bad_lines = False. This skips that code or
row.

Dtypes parameter
We can change datatype for specif column as dtype = {‘col_name’:data_type}

Handling Dates
In data sets datee and time are stored as string. So we cannot use data time python operations
on it. To change that we can use parse_dates =[‘name of colms u want to store as date sep by
commas’]

Converters
Some times we want to transform data variables before the data set has been loaded, we use
this. Eg: Royal challengers bangolore to RCB.
converters={‘col_name wher_we_want_to_apply’: func for transformation}.
Eg def rename(name){
If name ==”Royan Challengers Bangolore”:
return”RCB”
Else:
return name

na_values parameter
we can specify which values we can consider na values in datasets. Eg hyphens should be
considered na values. na_values = [‘value name eg: Male’]. If multiple sep my comma
Loading data in chunks
We can use chunksize = 5000(eg). While using chunksize we have to run our operaions in loop.
So that it applies for all chunks.

You might also like