0% found this document useful (0 votes)

79 views37 pages

Handling Spreadsheets in Ruby An Essay Computer Science NITIE

This document provides an overview of CSV (comma-separated values) format and how to work with CSV files in Ruby. It discusses the structure of CSV files, where they are commonly used, and common problems that can occur when parsing CSV files. It also covers how to parse CSV files from different sources in Ruby, create CSV files, edit existing CSV files, and more advanced CSV manipulation techniques. Useful Ruby gems for working with CSV are also presented, along with solutions for common CSV parsing problems. The goal is to teach readers everything they need to know about parsing and working with CSV files in Ruby.

Uploaded by

John Wick

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

79 views37 pages

Handling Spreadsheets in Ruby An Essay Computer Science NITIE

Uploaded by

John Wick

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Mastering CSV in Ruby

Paweł Dąbrowski, Long Live Ruby

Table of contents

Introduction 3

What you will learn by reading this book …………………………………………………. 3

Information about the Ruby version …………………………………………………………. 3
Information about the author …………………………………………………………………….. 4

The anatomy of CSV format 4

The structure of the CSV file ………………………………………………………………………. 4

Where CSV files are used ……………………………………………………………………………. 5
Common problems with parsing CSV files ………………………………………………. 6
How Ruby handles files in CSV format ……………………………………………………... 7

Parsing CSV file 7

Parsing files from different sources ………………………………………………………….. 7

Parsing large files ………………………………………………………………………………………... 9
Summary about CSV and parsing modes ……………………………………………….. 11
Parsing the table and rows separately …………………………………………………….. 11

Creating CSV files 13

Editing existing files 15

Modifying existing rows …………………………………………………………………………….. 16

More advanced CSV manipulation 16

Preprocessing the data ………………………………………………………………………………. 16
Creating own preprocessor ……………………………………………………………………….. 18

Ruby on Rails and CSV format 20

Exporting the data in CSV format from controller level …………………………. 21

Exporting records in the CSV format ……………………………………………………….. 23
Loading records from the CSV format …………………………………………………….. 24

Useful ruby gems 26

Checking the changes between two CSV files ……………………………………….. 26

Processing CSV files with Smarter CSV library ………………………………………. 29

Fixing common problems 31

High memory usage ……………………………………………………………………………………. 31

Parsing files with duplicated headers ………………………………………………………. 31
Dealing with the encoding issues …………………………………………………………….. 33
Parsing file with a non-standard operator ……………………………………………….. 34
Parsing file with duplicated rows ……………………………………………………………... 34
Parsing file with empty rows …………………………………………………………………….. 35

Standard CSV library cheet sheet 35

Parsing modes ……………………………………………………………………………………………. 35

Available options ………………………………………………………………………………………... 36
Introduction

CSV stands for Comma Separated Values, and it’s a standard format for
storing spreadsheet data. Modern internet systems often utilize this type
of file to provide import and export functionality. Ruby makes it easy and
enjoyable to parse those files in a modern, object-oriented way. This
book is a perfect starting point for anyone who is not familiar with the
CSV and those developers who work with CSV files daily but face some
common problems and are looking for performance tips.

What you will learn by reading this book

After reading this book will know how to:

● parse CSV files from different sources to get the well-formatted

data for your application
● create CSV files to create an importer functionality that will allow
you to export the data from your application
● avoid performance issues when parsing large files
● deal with everyday problems like badly encoded or malformed files
● use Ruby gems that will speed up the development process and
save you tons of time

This book covers all of the essential things about parsing CSV files with
Ruby. It will be a from zero to hero journey.

Information about the Ruby version

I wrote the contents of the book based on the newest version of Ruby. I
will constantly update it each time the latest version of Ruby will be
available.

3
You don’t have to worry that some of the code snippets presented here
will become outdated at some point in the future.The goal of this book is
to always provide you the most relevant and valuable knowledge.

Latest date update: 28-04-2021

Ruby version: 3.0.1

Information about the author

My name is Paweł Dąbrowski, and I’m currently working as a CTO at

iRonin, a software house with Ruby in the DNA. I’m also the founder of
the Long Live Ruby blog. With over ten years of experience in Ruby and
more than 13 years of experience in web development, I am passionate
about learning and sharing the knowledge with other developers.

I work with CSV files daily when building complex data-processing

applications that import and export tons of information. I believe that the
knowledge about parsing CSV files is valuable for every developer
working with Ruby, and that kind of book will be your best friend.

The anatomy of CSV format

Before we get our hands dirty by writing Ruby code, it’s good to learn
something more about the CSV format itself. Such an introduction
always gives us a better understanding of the topic and makes us
smarter even when programming with other languages.

The structure of the CSV file

Imagine a simple table with information about the users:

4
First name Age Location

Tim 33 Berlin

Kate 29 New York

If we would like to export the table to the CSV format, it could look the
following:

first_name,age,location
Tim,33,Berlin
Kate,29,New York

By looking at the above example, we can quickly write down the critical
elements of the CSV format:

● The file starts with the line where headers are listed
● Each file entry is separated with a new line
● Columns are usually separated by the , or ; character

When there is a possibility that the columns’ values will contain , or ;

characters, then values are usually wrapped in double quotes to avoid
parsing problems.

Where CSV files are used

CSV is the default format for spreadsheets. A spreadsheet is a file with

some structured data where all records share the same headers,
meaning that they belong to the same group. A good example is the list
of the employees for a given company. Probably each record would
contain the name and e-mail address of the employee.

5
CSV format is used for files everywhere where there is a need to import
some data into the system or export the data from the system. Such a
file is also human-readable so that we can create it by hand. An excellent
example of the system that exports data in the CSV format is the online
banking system that allows you to export your transactions. You can
then import such a file to software that helps you keep tracking your
expenses or grow your savings.

Other places where you can meet files in the CSV format:

● Trading platforms that allow you to export the list of transactions

● Mail clients that allow you to export the contact or the list of
contacts (or import as well)
● Enterprise systems that handle the employee data for the company

Common problems with parsing CSV files

Even though the CSV format is quite simple, you can run into some
problems when parsing files in that format:

● Parsing large files - usually, CSV files are pretty small, but
sometimes a single file can weigh even a few gigabytes. Opening
and parsing such files can be a tricky thing. I will cover that topic
later and give you some tips and good practices that you can use
later to make it less painful and more performant.
● Badly formatted files - most of the time, you will have to work with
files generated automatically. Still, even those can contain some
poorly formatted values that can make the parsing harder.
● Different encoding - sometimes it happens that you receive values
with a different encoding than you expected. Things might get a
little bit difficult if you have to transform the characters into a
readable format or remove the invalid part without touching the
valid one.

6
The problems I mentioned above are pretty common and are reported
many times around the internet. The solution for them is not that hard to
implement, but it’s sometimes time-consuming to find the exact fix, so I
will guide you to fix those problems as well.

How Ruby handles files in CSV format

The standard Ruby library provides a robust and straightforward

interface for parsing and creating files in CSV format. Most of the time,
you won’t need any external libraries.

You might look for some more advanced solutions if you would like to
have more flexibility in creating the CSV files, checking the differences
between two CSV files, or accessing a file’s contents in a non-standard
way. Thankfully, the Ruby community already provided a bunch of Ruby
gems to achieve all of this.

Parsing CSV file

Before we start writing some code, let’s prepare a test CSV file that we
will be working on. Visit
https://longliveruby.com/books/mastering-csv-files/file.csv and
download a sample CSV file. I will use its contents to show you how you
can quickly parse the data.

Parsing files from different sources

In most cases, there are three types of ways that you can get the CSV
content into your code. I will quickly go through them to show you how to
pull the data quickly.

7
Parsing CSV from a variable

require 'csv'

# Parse from variable

csv_string = "header1,header2\nvalue1,value2"
CSV.parse(csv_string)
# => [["header1", "header2"], ["value1", "value2"]]

As you can see, you can simply pass the variable that contains the data
in CSV format and use CSV.parse to get the formatted output.

Parsing CSV from a file

require 'csv'

# Read directly from file

file_path = './file.csv'
CSV.read(file_path)
# => [["header1", "header2"], ["value1", "value2"]]

When using CSV.read way, you can pass the file path directly. The
explicit way of reading the file is to get the contents of a file first and
then pass it to the CSV:

# Read explicitly from a file

file_path = './file.csv'
csv_string = File.read(file_path)
CSV.parse(csv_string)
# => [["header1", "header2"], ["value1", "value2"]]

8
However, the first way of reading the file seems handier.

Parsing CSV from a URL

require 'csv'
require 'open-uri'

# Read from URL

url =
'https://longliveruby.com/books/mastering-csv-files/file.cs
v'
CSV.parse(URI.parse(url).open)
# => [["header1", "header2", "header3"], ["value1a",
"value1b", "value1c"], ["value2a", "value2b", "value2c"]]

The only difference between this example and those two first is that we
open the URL to get the web page contents and then pass it directly to
the CSV module’s parse method. Simple as that.

Parsing large files

The file we tested in the above examples is tiny. We need something

bigger:

9
# Create a big file
headers = ['id', 'first_name', 'last_name', 'location',
'age']

CSV.open('bigfile.csv', 'w', write_headers: true, headers:

headers) do |csv|
10_000_000.times do |i|
csv << [i, "Joh#{i}", "Doe#{i}", "New York", "35"]
end
end

The bigfile.csv weighs around 400MB; it took the program around

30 seconds to create it and fill it with the data. Now, we have something
with which we can begin making accurate tests.

Using CSV.read or CSV.parse is a bad idea as it will blow up your

program’s memory. The most optimal solution for large files is to use
CSV.foreach:

CSV.foreach('./bigfile.csv', headers: true) do |row|

# do something with row['first_name'] or row['age']
end

The foreach method, instead of loading the whole file into memory,
iterates over the file row by row. Always use it when parsing large files as
you get access to the rows immediately and do not load everything into
memory.

10
Summary about CSV and parsing modes

Before we jump into the next chapter, let’s quickly summarize the ways
we can use the CSV module to load the data:

● Iterating over every row of the file - we can achieve that by using
CSV.foreach. It’s a perfect solution for parsing large files as we
immediately access the data and do not load everything into
memory.
● Reading and parsing the file - we can achieve that by using
CSV.read. This way is a good solution for smaller files that we
don’t want to open explicitly.
● Parsing the data - if the CSV data is located in the variable, we can
parse it using the CSV.parse method.

Parsing the table and rows separately

In many cases, when parsing CSV data, we don’t want to just receive a
multi-dimensional array due to parsing. Thankfully we are not limited to
accessing the rows by using only indexes.

Accessing CSV rows as a hash

Instead of an array of values, we can access the given row as a hash

where the header names as keys and values as… values. We just have to
pass the headers: true option:

11
csv_string = "header1,header2\nvalue1,value2"
CSV.parse(csv_string, headers: true).map do |row|
row['header1']
end
# => ['value1']

file_path = './file.csv'
CSV.read(file_path, headers: true).map do |row|
row['header1']
end
# => ['value1']

When you are using CSV with the headers option, you receive
CSV::Table instance, which contains multiple CSV::Row instances for
each row instead of an array.

Manipulating the single row

When we are receiving each row as a CSV::Row instance we get a lot of

flexibility. Let’s start with creating the instance that we can play with:

row = CSV::Row.new(["first_name", "last_name", "age"],

["John", "Doe", "35"])

As I mentioned before, you can access a row as a hash but when you
need a hash representation then you can simply call to_h on a row:

row.to_h
=> {"first_name"=>"John", "last_name"=>"Doe", "age"=>"35"}

You can also iterate of each pair:

12
row.each_pair.map do |header, value|
[header, value]
end
# => [["first_name", "John"], ["last_name", "Doe"], ["age",
"35"]]

Manipulating the whole table

I want to show you one more cool thing that you can do with a complete
CSV table. If you want to quickly get values only for given columns you
can use the values_at method:

csv =
"first_name,last_name,age\nJohn,Doe,35\nTim,Doe,20\nHelen,D
oe,30"
CSV.parse(csv, headers: true).values_at('first_name',
'last_name')
# => => [["John", "Doe"], ["Tim", "Doe"], ["Helen", "Doe"]]

If you would like to get only values for one column, first_name, for
example, you can do without any effort:

CSV.parse(csv, headers: true)['first_name']

=> ["John", "Tim", "Helen"]

Creating CSV files

The simplest representation of CSV in code is a multi-dimensional array

where the first element (first array) is the list of headers. Having that in
mind, we can easily create the CSV file from any data.

13
The main rule is to push array headers first and then, for every record we
want to include in CSV, push arrays with values:

headers = ['first_name', 'last_name', 'age']

User = Struct.new(:first_name, :last_name, :age)

users = [
User.new('John', 'Doe', '33'),
User.new('Tim', 'Doe', '25'),
User.new('Tina', 'Doe', '30')
]

CSV.open('./users.csv', 'w') do |csv|

csv << headers
users.each do |user|
csv << [user.first_name, user.last_name, user.age]
end
end

I used the CSV.open in the write-only mode, so the error is not raised if
the file is not existing yet. Also, when the file with the given name exists,
the code overwrites it. You can check now the users.csv file, and you will
see that we create a standard file in CSV format with headers and three
rows.

If you would like to make your code more readable, we can use add_row
method which is an alias for << but it looks more readable:

CSV.open('./users.csv', 'w') do |csv|

csv.add_row(headers)
users.each do |user|
csv.add_row([user.first_name, user.last_name, user.age])
end
end

14
Editing existing files

Similar to the creation process, we can make usage of the CSV.open

method. The only difference is the mode flag. This time instead of the
write-only mode, I will use read and write mode - a+. With such mode, the
existing file is opened, and the pointer is set to the last row so we can
add more rows without losing the original content:

require 'csv'

headers = ['first_name', 'last_name', 'age']

User = Struct.new(:first_name, :last_name, :age)

users = [
User.new('John', 'Doe', '33'),
User.new('Tim', 'Doe', '25'),
User.new('Tina', 'Doe', '30')
]

CSV.open('./users.csv', 'w') do |csv|

csv.add_row(headers)
users.each do |user|
csv.add_row([user.first_name, user.last_name, user.age])
end
end

new_user = User.new('Jenny', 'Doe', '23')

CSV.open('./users.csv', 'a+') do |csv|
csv.add_row([new_user.first_name, new_user.last_name,
new_user.age])
end

15
Modifying existing rows

If you would like to edit an existing row in a CSV file, you have to collect
all rows from the file, find the row you want to update, update it, and then
rewrite the whole file.

If we would like to change the age of Jenny Doe, from 23 to 25, we can
do this with the following code:

rows = CSV.open('./users.csv', headers: true).map(&:to_h)

row = rows.find { |r| r['first_name'] == 'Jenny' }
row_index = rows.index(row)
row['age'] = '25'
rows[row_index] = row

CSV.open('./users.csv', 'w') do |csv|

csv.add_row(rows.first.keys)
rows.map { |row| csv.add_row(row.values) }
end

More advanced CSV manipulation

The usage examples I presented in the last chapters were relatively

simple, so it’s time to learn something more advanced and less
commonly presented.

Preprocessing the data

Let’s assume that we have the following file named users.csv with the
following contents:

16
first_name,last_name,age
John,Doe,19
Tim,Doe,25

Now, let’s load the contents into our code:

csv = CSV.parse(File.read('./users.csv'), headers:

:first_row, return_headers: false)
csv.first['age']
# => "19"

We get the age value as a string instead of the integer. To get the integer
we have to modify our code and use converter:

csv = CSV.parse(File.read('./users.csv'), headers:

:first_row, return_headers: false, converters:
[CSV::Converters[:integer]])
csv.first['age']
# => 19

Everything works as expected. By default, we have the following

converters available:

● :integer
● :float
● :numeric
● :date
● :date_time
● :all

You can always check the complete list by calling

CSV::Converters.keys

17
Creating own preprocessor (converter)

Writing a new preprocessor is easy as the converter is just a lambda

expression. To confirm this, you can select one of the converters and
invoke the call method with an argument on it:

CSV::Converters[:float].call("13.1")
# => 13.1

Let’s play with the code a little bit and create an Email value object which
parses the email addresses and provides two methods: domain and
username:

class Email
def initialize(value)
@value = value
end

def to_s
@value
end

def domain
@value.split('@').last
end

def username
@value.split('@').first
end
end

We can now make an email converter that would replace any email
address with the Email object:

18
csv =
"first_name,email\nJohn,[email protected]\nTim,[email protected]"
CSV::Converters[:email] = ->(value) { value.include?('@') ?
Email.new(value) : value }
parsed_csv = CSV.parse(csv, headers: :first_row,
converters: [:email])
parsed_csv.first['email'].domain
# => doe.com

There are two essential rules when it comes to creating converters:

● ensure that you will always return a value from the converter
● declare one argument if you want to parse the only value, declare
two arguments if you would like to parse additional information
about the given value

I mentioned the second argument - additional information about the

given value. Let’s take a closer look at it:

csv =
"first_name,email\nJohn,[email protected]\nTim,[email protected]"
conv = ->(arg1, arg2) { [arg1, arg2] }
parsed_csv = CSV.parse(csv, headers: :first_row,
converters: [conv])
parsed_csv.first['email']
# => ["[email protected]", #<struct CSV::FieldInfo index=1,
line=2, header="email">]

Now we have access to the struct that contains the index, line, and
header of the given field. It provides us with a lot of flexibility.

19
Ruby on Rails and CSV format

Before we generate some CSV files from the Rails application, we have
to create it first and some of the test data. Let’s do it quickly now:

rvm use ruby-3.0.1

nvm use 12.13.0
gem install rails
rails new csvapp -d=mysql

In the above snippet, I used RVM to use the newest Ruby version, NVM
tool to select the Node.js version, and then I installed Rails gem and
generated a new project with the support for the MySQL database. After
a few seconds, the project is ready, and we can develop the first model.

cd csvapp/
rake db:create
bin/rails generate scaffold User first_name:string
last_name:string email:string age:integer
rake db:migrate
rails s

Our app is ready. I generated a simple scaffold for the User model so you
can now access localhost:3000/users/new address and add some new
users; we will need them later to manipulate their data.

I added the following users:

● John Doe, [email protected], age 25

● Tim Doe, [email protected], age 30

20
● Tina Doe, [email protected], age 29

Exporting the data in CSV format from the controller level

The goal now is to be able to download the CSV list of users when
accessing localhost:3000/users.csv. Create the proper view first:

touch app/views/users/index.csv.erb

now we have to update the UsersController controller and the

index action to respond to the CSV format properly:

def index
@users = User.all

respond_to do |format|
format.html
format.json
format.csv do
headers['Content-Disposition'] = "attachment;
filename=\"users.csv\""
headers['Content-Type'] ||= 'text/csv'
end
end
end

After visiting the localhost:3000/users.csv address, the browser sends a

blank users.csv file. The last step is to update the view. To make the
logic more separated, I will implement the to_csv method for the User
model:

21
require 'csv'

class User < ApplicationRecord

def self.csv_headers
%w(first_name last_name age)
end

def to_csv

::CSV.generate_line(attributes.slice(*self.class.csv_header
s).values)
end
end

The to_csv method utilizes CSV.generate_line, which simply

accepts an array of values and returns the CSV row, a string where the
values are separated by a comma, and there is a new line character at
the end of the string.

In the view the first step is to render headers and iterate over the @users
array and call to_csv on each record:

<%= CSV.generate_line(User.csv_headers) -%>

<% @users.each do |user| %>
<%= user.to_csv -%>
<% end %>

I used <%= … -%> to avoid a blank line between each record as it would
break our CSV with empty rows. Visit localhost:3000/users.csv again,
and now the CSV with data will be downloaded.

22
Exporting records in the CSV format

You might need to be able to export the records from a given model to
the CSV format. It’s good to keep the logic somewhere and not repeat it
for every model in our codebase. To stay DRY will create a model
concern:

touch app/models/concerns/csv_exportable.rb

I will ignore created_at and updated_at columns as in most cases;

we wouldn’t want to export them:

module CsvExportable
extend ActiveSupport::Concern

class_methods do
def export_to_csv(file_path)
ignore_columns = %w(created_at updated_at)
headers = self.column_names - ignore_columns

CSV.open(file_path, 'w') do |csv|

csv.add_row(headers)
self.find_each do |record|
csv.add_row(record.attributes.values_at(*headers))
end
end

true
end
end
end

23
You can now include the module in our User model:

class User < ApplicationRecord

include CsvExportable
end

Now, you can simply call User.export_to_csv(‘./file.csv’) to

export all records from the database. Because I used the find_each
method, you don’t have to worry about the big size of your table as the
records will be fetched in batches.

Loading records from the CSV format

Similar to the previous step, where we were exporting records from the
database, I will also create a model concern to make the logic more
reusable across other models. Let’s start with creating the file:

touch app/models/concerns/csv_importable.rb

We also need some file with the data to ensure that our code will work
well with the CSV files. You can save the following rows in the file named
users.csv:

first_name,last_name,email,age,location
John,Doe,[email protected],25,New York
Tim,Doe,[email protected],30,Berlin
Tina,Doe,[email protected],29,Barcelona

I intentionally added the location column though we don’t have it in our

User model. This column will help us test if we don’t try to pass an
attribute that is not present in our model.

24
We can now open the app/models/concerns/csv_importable.rb
file and create the class method that can be easily included in any model
to extend the class by the functionality of importing records from the
CSV file.

The method will be pretty simple but consist of the following three steps:

● Reading the rows from the file_path passed as the first

argument
● Filtering each row values to select only those that exist as the
attributes in the given model
● Creating new record using the filtered attributes

Let’s make it happen:

module CsvImportable
extend ActiveSupport::Concern

class_methods do
def import_from_csv(file_path:, columns:)
records = []

CSV.foreach(file_path, headers: true).each do |row|

attributes = row.to_h.slice(*columns)
records << create!(attributes)
end

records
end
end
end

Because I used the create! method, when one of the records is not
saved, an error will be raised, and the process will be stopped. You can
even wrap the whole loop in the transaction to ensure that we will create
all records or none if they are invalid. 25
Since I created the above code only for demonstration purposes, I won’t
spend more time refactoring it as it does what it should quite well. We
can now include the concern into our User model:

require 'csv'

class User < ApplicationRecord

include CsvImportable
end

and test it in the console:

User.import_from_csv(file_path: './users.csv', columns:

%w[first_name last_name email age])
# => [#<User:0x00007fc911c76fa0...]

Useful Ruby gems

Dealing with CSV with Ruby is relatively straightforward, but some
valuable ruby gems were created. In this section, I will mention two of
them I consider the most useful.

Using external libraries for CSV can speed up the development. Still,
you can also successfully write your code for most cases of
processing CSV with Ruby using the knowledge from this book.

Checking the changes between two CSV files

I work a lot with large CSV files that contain employees data.
Companies often update the data in the system by importing such files
frequently.
26
Doing the full update each time the file is sent to the system might be a
time-consuming process depending on the file size.
The idea to improve the process is to spot the changes, deletions, and
additions between the current file and previous file, so there is no need
to touch the records that didn’t change at all (in my case, most of the
records are the same).

To achieve that, you can use the csv-diff gem. Let me show you how it
works in practice. The first step is to install it:

gem install csv-diff

Because we will check the differences between two files, let’s create
some test data that we can use. Create the users.csv file with the
following rows:

uid,first_name,last_name,age
1,John,Doe,19
2,Tim,Doe,25

and the file named updated_users.csv that represents the file with the
updated contents:

uid,first_name,last_name,age
1,John,Doe,20
3,Rick,Doe,30

We can now open the ruby console, load the gem and check the
differences between those files:

27
require 'csv-diff'

diff = CSVDiff.new('./users.csv', './updated_users.csv')

diff.summary
# => {"Delete"=>1, "Update"=>1, "Add"=>1}

The library correctly detected that we deleted the row with Tim Doe,
added a row for Rick Doe, and updated the age of Time Doe. The
general information is helpful, but often we need a deeper insight into
the changes that were made by the system.

Checking deletions

diff.deletes
# => {"2"=>#<CSVDiff::Algorithm::Diff:0x00007fb5f28539f0
@diff_type=:delete, @fields={"uid"=>"2",
"first_name"=>"Tim", "last_name"=>"Doe", "age"=>"25"},
@row=2, @sibling_position=2>}

In our case, the uid column is the unique column, but you can configure
it how you want. When the column is not explicitly specified, the library
takes the first column, which is the id in most cases.

For the deleted rows, we can check the contents and the position of
the row as well.

Checking additions

diff.adds
# => {"3"=>#<CSVDiff::Algorithm::Diff:0x00007fb5f2852848
@diff_type=:add, @fields={"uid"=>"3", "first_name"=>"Rick",
"last_name"=>"Doe", "age"=>"30"}, @row=2,
@sibling_position=2>}

28
Like in the case of deletions, we can also check the attributes and the
position of the row for additions.

Checking changes

diff.updates
# => {"1"=>#<CSVDiff::Algorithm::Diff:0x00007fb5f2852dc0
@diff_type=:update, @fields={"uid"=>"1", "age"=>["19",
"20"]}, @row=1, @sibling_position=1>}

When it comes to updates, the library provides information about

updated rows. It’s super useful for the system where you have to
prepare a detailed notification of the changes or do some actions only
when the given field is changed.

The next steps

The gem also informs the developer when the unique keys are
duplicated, or one of the files does not contain the same type of
columns. To check the warning feature and some additional
configuration possibilities, please check the official library repository:
https://github.com/agardiner/csv-diff

Processing CSV files from Smarter CSV library

If you don’t want to write your code for reading and processing CSV
files, there is a handy and robust library out there. It’s called Smarter
CSV and provides a set of methods that will make your life easier.

Start with installing it in your system:

gem install smarter_csv

29
In the previous example, we created the users.csv file with the
following contents:

uid,first_name,last_name,age
1,John,Doe,19
2,Tim,Doe,25

Let’s use it to see how the Smarter CSV library processes it:

require 'smarter_csv'

SmarterCSV.process("./users.csv")
# => [{:uid=>1, :first_name=>"John", :last_name=>"Doe",
:age=>19}, {:uid=>2, :first_name=>"Tim", :last_name=>"Doe",
:age=>25}]

The library automatically parses the integer values and returns an array
of hashes where the column names are symbols. This is a beneficial
behavior expected by most developers when parsing the CSV file.

It’s also possible to validate the existence of the header. If the header
is not available in the passed file, the library will throw an error:

SmarterCSV.process("./users.csv", required_headers:
%i[first_name email])
# => SmarterCSV::MissingHeaders (ERROR: missing headers:
email)

The library’s documentation is well-written to quickly learn about most

of the features that the gem providers. If you liked the snippets I
provided above, here is the documentation where you can learn about
other configuration options: https://github.com/tilo/smarter_csv/wiki

30
Fixing common problems
I have been working with CSV files for the last few years, and I typically
was dealing with some common errors that are well-known in the Ruby
community. I decided to create this section to save you some time
searching through StackOverflow or Google and answer the
problematic situation that you can run into when parsing CSV files with
Ruby.

If I missed any case you recently dealt with, please contact me, and I’m
happy to update this section as soon as possible.

High memory usage

Typically, the memory usage level is not a problem when parsing CSV
files, but it can become more problematic as you parse large files. You
should avoid loading all the records into memory at once. I always
suggest using the CSV.foreach method, which iterates over every
row instead of loading the whole file into the memory and then
processing.

Imagine how this approach can speed up a process where you have to
find only one record among millions of other records or insert those
records in the database.

Parsing file with the duplicated headers

Sometimes, the CSV file contains duplicated headers, and a row can
have different values for each occurrence. It is good to know what to
do in such a case.

31
Standard library

When you are using the standard Ruby library and methods like
CSV.parse, CSV.read or CSV.foreach, the file will be parsed
without the errors but only the first occurrence of the given header will
be taken into the account:

csv = "first_name,email,first_name\nTom,[email protected],Jim"
CSV.parse(csv, headers: true).first['first_name']
# => Tom

Ruby sees the duplication but ignores it when generating output

Smarter CSV library

In case of the Smarter CSV library an error would be raised if the

duplicated headers will be present in the file:

require 'smarter_csv'

SmarterCSV.process("./users.csv")
# => SmarterCSV::DuplicateHeaders (ERROR: duplicate
headers: first_name,first_name)

Collecting duplicate headers

If you don’t want to ignore the duplications, the standard CSV library
allows you to take a row and use each_pair method on it. With that
method you can collect the duplications and process it further if
needed:

32
rows = []
csv = "first_name,email,first_name\nTom,[email protected],Jim"
CSV.parse(csv, headers: true).each do |row|
attributes = {}
row.each_pair do |column, value|
attributes[column] ||= []
attributes[column] << value
end

rows << attributes.transform_values { |v| v.size == 1 ?

v.first : v }
end

rows
# => [{"first_name"=>["Tom", "Jim"],
"email"=>"[email protected]"}]

Dealing with the encoding issues

Let’s consider the following case: we are processing the CSV file with
the custom encoding where one of the values contains some special
characters:

uid,first_name,last_name,age
1,John,Non spécifié,19
2,Tim,Doe,20
3,Marcel,Doe,30

John’s last name is not specified, and the value is in french. The file
encoding is ISO 8859-1. Opening it the standard way will throw an
error:

CSV.read("./users.csv")
# => Invalid byte sequence in UTF-8 in line 2.
(CSV::MalformedCSVError)

33
To fix the code, we have to define the encoding explicitly:

CSV.read("./users.csv", encoding: "ISO-8859-1", headers:

true).each do |row|
puts row['last_name']
end

# => Non spécifié

# => Doe
# => Doe

Parsing file with a non-standard separator

Usually, values in the CSV file are separated by the comma or

semicolon. By the default, the standard Ruby library expects comma
but you can overwrite this configuration and use any separator you
want:

csv = "first_name|last_name\nJohn|Doe"
CSV.parse(csv, col_sep: "|", headers: true).first.to_h
# => {"first_name"=>"John", "last_name"=>"Doe"}

The key is to use the col_sep setting.

Parsing file with duplicated rows

Let’s assume that we are parsing the following file named users.csv:

uid,first_name,last_name,age
1,John,Doe,19
2,Tim,Doe,20
1,John,Doe,19
3,Marcel,Doe,30
34
The row for John Doe is presented twice in the processed file. You can
distinct rows as an array by using the uniq method:

CSV.read("./users.csv", headers: true).to_a.uniq

With the SmarterCSV gem you have to do the same:

SmarterCSV.process("./users.csv").uniq

In both cases you receive a simple structure so to create the CSV

objects again you have to do the parsing again:

records = CSV.read("./users.csv", headers: true).to_a.uniq

csv = records.map { |r| r.join(",") }.join("\n")
CSV.parse(csv, headers: true)

Parsing files with empty rows

In the standard library, you can automatically remove empty rows by

passing the skip_blanks: true option:

CSV.read("./users.csv", headers: true, skip_blanks: true)

The Smarter CSV gem will remove the blank lines automatically.

Standard CSV library cheat sheet

Parsing modes

● CSV.parse - the method accepts a string which is data in CSV

format

35
● CSV.read - the method accepts a path to CSV file, loads data
into memory, and parses it
● CSV.foreach - the method accepts a path to a CSV file and
returns an enumerator. It does not load the whole into the
memory but allows you to iterate over each row
● CSV.parse_line - the method accepts a string which is a
single line from CSV data

Available options
Information is copied from the official documentation.

● row_sep - Specifies the row separator; used to delimit rows.

● col_sep - Specifies the column separator; used to delimit fields.
● quote_char - Specifies the quote character; used to quote
fields.
● field_size_limit - Specifies the maximum field size
allowed.
● converters - Specifies the field converters to be used.
● unconverted_fields - Specifies whether unconverted fields
are to be available.
● headers - Specifies whether data contains headers, or specifies
the headers themselves.
● return_headers - Specifies whether headers are to be
returned.
● header_converters - Specifies the header converters to be
used.
● skip_blanks - Specifies whether blanks lines are to be ignored.
● skip_lines - Specifies how comments lines are to be
recognized.
● strip - Specifies whether leading and trailing whitespace is to
be stripped from fields.
36
● liberal_parsing - Specifies whether CSV should attempt to
parse non-compliant data.
● nil_value - Specifies the object that is to be substituted for
each null (no-text) field.
● empty_value - Specifies the object that is to be substituted for
each empty field.

Understanding CSV and Json Files: Pytho N
No ratings yet
Understanding CSV and Json Files: Pytho N
13 pages
CSV File
No ratings yet
CSV File
10 pages
ASP.NET Core for Jobseekers: Build Career in Designing Cross-Platform Web Applications Using Razor and Entity Framework Core
From Everand
ASP.NET Core for Jobseekers: Build Career in Designing Cross-Platform Web Applications Using Razor and Entity Framework Core
Kemal Birer
No ratings yet
Implementing DevOps on AWS
From Everand
Implementing DevOps on AWS
Veselin Kantsev
No ratings yet
CSVFILES
No ratings yet
CSVFILES
37 pages
Lesson-5-File Handling-CSV Files
No ratings yet
Lesson-5-File Handling-CSV Files
37 pages
CSV Files
No ratings yet
CSV Files
28 pages
CSV Files
No ratings yet
CSV Files
23 pages
File Handling CSV Files Notes 3
No ratings yet
File Handling CSV Files Notes 3
17 pages
Ascii Unicode: Chapter - 4 CSV Files 1. What Is A CSV File?
No ratings yet
Ascii Unicode: Chapter - 4 CSV Files 1. What Is A CSV File?
9 pages
Cvs
No ratings yet
Cvs
4 pages
Chapter 06 CSV Files
No ratings yet
Chapter 06 CSV Files
59 pages
5 - File I O and CSV Module
No ratings yet
5 - File I O and CSV Module
12 pages
What Are CSV Files
No ratings yet
What Are CSV Files
2 pages
Notes - CSV FILES
No ratings yet
Notes - CSV FILES
7 pages
CSV File Handling
No ratings yet
CSV File Handling
3 pages
Csvfiles 2
No ratings yet
Csvfiles 2
28 pages
CSV File Handling Notes
No ratings yet
CSV File Handling Notes
23 pages
Computer Science Project
No ratings yet
Computer Science Project
13 pages
Using The CSV Module in Python
No ratings yet
Using The CSV Module in Python
5 pages
00-Python Archivos-Sept03
No ratings yet
00-Python Archivos-Sept03
15 pages
CSV New
No ratings yet
CSV New
4 pages
Csv-Files Final
No ratings yet
Csv-Files Final
21 pages
CSV File Handling
No ratings yet
CSV File Handling
11 pages
CSV File Handling
No ratings yet
CSV File Handling
3 pages
05.2 CSV R - Unlocked
No ratings yet
05.2 CSV R - Unlocked
23 pages
CSV File Handling
No ratings yet
CSV File Handling
16 pages
Inbound 6146778812034617939
No ratings yet
Inbound 6146778812034617939
16 pages
Computer Science Grade XII Unit 1 Chapter 7
No ratings yet
Computer Science Grade XII Unit 1 Chapter 7
26 pages
A CSV
No ratings yet
A CSV
6 pages
Data File Handling
No ratings yet
Data File Handling
29 pages
Querying MySQL: Make your MySQL database analytics accessible with SQL operations, data extraction, and custom queries (English Edition)
From Everand
Querying MySQL: Make your MySQL database analytics accessible with SQL operations, data extraction, and custom queries (English Edition)
Adam Aspin
No ratings yet
Handling CSV Files in Python
No ratings yet
Handling CSV Files in Python
11 pages
CSV Files
No ratings yet
CSV Files
8 pages
Mastering CSV in Python
No ratings yet
Mastering CSV in Python
8 pages
L CsvReadWrite
No ratings yet
L CsvReadWrite
10 pages
Querying MariaDB: Use SQL Operations, Data Extraction, and Custom Queries to Make your MariaDB Database Analytics more Accessible (English Edition)
From Everand
Querying MariaDB: Use SQL Operations, Data Extraction, and Custom Queries to Make your MariaDB Database Analytics more Accessible (English Edition)
Adam Aspin
No ratings yet
CSV File Notes
No ratings yet
CSV File Notes
23 pages
Files CSVFiles CPP
No ratings yet
Files CSVFiles CPP
9 pages
CSV Files
No ratings yet
CSV Files
24 pages
Unit5 CS
No ratings yet
Unit5 CS
15 pages
CW MD Jahid Hasan 2024
No ratings yet
CW MD Jahid Hasan 2024
20 pages
Unit IV File Handling - CSV Files
No ratings yet
Unit IV File Handling - CSV Files
28 pages
03 - 01 - Csv-Files-Lesson-Notes-Optional-Download - Files - CSV Files
No ratings yet
03 - 01 - Csv-Files-Lesson-Notes-Optional-Download - Files - CSV Files
9 pages
Final CSV File Project
No ratings yet
Final CSV File Project
9 pages
Make Bootstrap Themes
From Everand
Make Bootstrap Themes
Bo Feng
No ratings yet
Chapter5 3CSVFile
No ratings yet
Chapter5 3CSVFile
7 pages
CSV File
No ratings yet
CSV File
30 pages
Introduction to HTML & CSS
From Everand
Introduction to HTML & CSS
Claudia Da Silva
4.5/5 (4)
Chapter 5.3 CSV File Handling
No ratings yet
Chapter 5.3 CSV File Handling
9 pages
CSV 21
No ratings yet
CSV 21
15 pages
Writing and Reading in CSV Files
No ratings yet
Writing and Reading in CSV Files
14 pages
Case Study 3
No ratings yet
Case Study 3
8 pages
CS 5 MARKS (Lesson 13 To 14)
No ratings yet
CS 5 MARKS (Lesson 13 To 14)
6 pages
Python Data Science Cookbook: Practical solutions across fast data cleaning, processing, and machine learning workflows with pandas, NumPy, and scikit-learn
From Everand
Python Data Science Cookbook: Practical solutions across fast data cleaning, processing, and machine learning workflows with pandas, NumPy, and scikit-learn
Taryn Voska
No ratings yet
Python Data Science Cookbook
From Everand
Python Data Science Cookbook
Taryn Voska
No ratings yet
005.2 CSV
No ratings yet
005.2 CSV
11 pages
CSV Comma Separated Values
No ratings yet
CSV Comma Separated Values
7 pages
Getting Started with Zurb Foundation 5
From Everand
Getting Started with Zurb Foundation 5
Ryan Flores
3/5 (1)
Apache Hive Cookbook
From Everand
Apache Hive Cookbook
Shrey Mehrotra
No ratings yet
Volume 1
No ratings yet
Volume 1
234 pages
3D Assignment
No ratings yet
3D Assignment
7 pages
11592i PDF
No ratings yet
11592i PDF
1 page
Advent Review, and Sabbath Herald - August 7, 1883
No ratings yet
Advent Review, and Sabbath Herald - August 7, 1883
16 pages
UNIT1-3 Notes
No ratings yet
UNIT1-3 Notes
57 pages
64709b0902cd9 RN Ati Capstone Proctored Comprehensive Assessment 2019 B Ati Comprehensive Practice Test B Best Study Guide Version With Complete Solution 2 Revised (1) - 2
No ratings yet
64709b0902cd9 RN Ati Capstone Proctored Comprehensive Assessment 2019 B Ati Comprehensive Practice Test B Best Study Guide Version With Complete Solution 2 Revised (1) - 2
1 page
CBSE Class 7 English - Comprehension Passage
100% (1)
CBSE Class 7 English - Comprehension Passage
7 pages
500D High Pressure Syringe Pump Datasheet PDF
No ratings yet
500D High Pressure Syringe Pump Datasheet PDF
2 pages
Annexure-I For Gem Bid No:Gem/2021/B/1451378 Dated: 19-08-2021
No ratings yet
Annexure-I For Gem Bid No:Gem/2021/B/1451378 Dated: 19-08-2021
15 pages
Eaton DS 265760 NZMN4 AE1000 en - GB 20241113
No ratings yet
Eaton DS 265760 NZMN4 AE1000 en - GB 20241113
4 pages
Developing Skills Speaking, Listening, Writing and Reading.
No ratings yet
Developing Skills Speaking, Listening, Writing and Reading.
12 pages
2ndmonthly Values
No ratings yet
2ndmonthly Values
1 page
Nothing To Hide, The Blurring of The Physical and Temporal Line Between Life, Work and Education - Microcities
No ratings yet
Nothing To Hide, The Blurring of The Physical and Temporal Line Between Life, Work and Education - Microcities
7 pages
Match The Verbs With Its Definition
No ratings yet
Match The Verbs With Its Definition
2 pages
User'S Guide: 2. External Dimensions and Parts 5. Specifications
No ratings yet
User'S Guide: 2. External Dimensions and Parts 5. Specifications
8 pages
2025-26 10th Science 1 Pratical (Journal Writing0 - Ex 1-4,6,8,9,10,11.1745912089
No ratings yet
2025-26 10th Science 1 Pratical (Journal Writing0 - Ex 1-4,6,8,9,10,11.1745912089
15 pages
Social Studies Grade 8 Final Final August 2022
No ratings yet
Social Studies Grade 8 Final Final August 2022
117 pages
M. M Arinze Corporate Law Practice Note 2
No ratings yet
M. M Arinze Corporate Law Practice Note 2
160 pages
Phần trả lời
No ratings yet
Phần trả lời
4 pages
GS EP EXP 207 09 Systems Units
No ratings yet
GS EP EXP 207 09 Systems Units
18 pages
Feature Amhed Farouk R4
No ratings yet
Feature Amhed Farouk R4
11 pages
8.4 - 1 Strong and Weak Acids (H+) and PH Calculations
No ratings yet
8.4 - 1 Strong and Weak Acids (H+) and PH Calculations
4 pages
12V32 Spare Parts En - Надежда
No ratings yet
12V32 Spare Parts En - Надежда
578 pages
Entrepreneurship: Quarter 1 - Module 1
No ratings yet
Entrepreneurship: Quarter 1 - Module 1
23 pages
App Selection Checklist: The Padagogy Wheel ENG V5.0 For Both Apple iOS and Android
No ratings yet
App Selection Checklist: The Padagogy Wheel ENG V5.0 For Both Apple iOS and Android
8 pages
Godavarman Case
No ratings yet
Godavarman Case
9 pages
Strategy Formulation
No ratings yet
Strategy Formulation
17 pages
A Terrain Parabolic Equation Model For Propagation in The Troposphere
No ratings yet
A Terrain Parabolic Equation Model For Propagation in The Troposphere
9 pages
Kisi-Kisi SOAL UJIAN AKHIR SEKOLAH BAHASA INGGRIS 2024
No ratings yet
Kisi-Kisi SOAL UJIAN AKHIR SEKOLAH BAHASA INGGRIS 2024
12 pages
Sat Practice Test 7
No ratings yet
Sat Practice Test 7
3 pages