# How ruby reads a CSV file:

`require 'csv'`

to use built-in csv library`numbers = CSV.read('/path/to/csv/file')`

saves entire file to variable`CSV.foreach`

will read the CSV line-by-line to preserve memory:`CSV.foreach('customers.csv') do |row| puts row.inspect end # ["Dan", "34", "2548", "Lovin it!"] # ["Maria", "55", "5054", "Good, delicious food"] # ["Carlos", "22", "4352", "I am \"pleased\", but could be better"] # ["Stephany", "34", "6542", "I want bigger steaks!!!!!"]`

`Array#flatten`

will remove extra arrays

# Find Duplicates

For this example. I have two CSV files: **numbers_in_excel.csv** and **numbers_in_db.csv**.
I’d like to know if numbers_in_excel contain any duplicates.
After importing your CSV files, run `numbers_in_excel.length`

and `numbers_in_excel.uniq.length`

to quickly check for number of duplicates

## Common mistake: Why you cannot use Array#- to find dupes

Remember: `Array#-`

returns a new array that is a copy of the array1,
removing any items that also appear in array2. *In other words, returns only unique items from array1*.
So if array2 has every element found in original array, nothing is returned.
Here’s an example:

```
> [1, 2, 3, 4, 5] - [ 1, 1, 2, 2, 3, 3, 4, 5 ]
=> []
> [ 1, 1, 2, 2, 3, 3, 4, 5 ] - [ 1, 2, 3, 4, 5 ]
=> []
```

## Find duplicates using:

### Method 1: Array#select - O(n^2)

```
def duplicate_values(array)
array.select{|v| array.count(v) > 1}.uniq
end
```

However keep in mind,
`Array#count`

iterates over all the elements in the array, just as `Array#select`

does,
so for an array of 500K elements, this could be cripplingly slow, as it’d do 250 billion operations.
(n^2).

### Method 2: Array#group_by - O(n)

`Array#group_by`

returns a hash where the keys are defined by our grouping rule,
and the values are the corresponding objects from our original collection.

In other words, it returns a hash where first, the keys are built by running the block through every item in array uniquely. Then populating each key’s values with every item in the array that matches that key. So:

```
# Create a list of names
names = ["Ripley", "McClane", "Ryerson", "Murphy"]
# group by first letter
names_by_letter = names.group_by { |name| name[0] }
# will first generate this:
#{
# "R" => [],
# "M" => []
#}
# and then populate each key's values by items in the array that equal the key:
#{
# "R" => [
# [0] "Ripley",
# [1] "Ryerson"
# ],
# "M" => [
# [0] "McClane",
# [1] "Murphy"
# ]
#}
```

So, we can easily find duplicates using this. Remember Array#select will return elements that return true for a block provided:

```
[1, 2, 2, 3].group_by{ |e| e }
# {1=>[1], 2=>[2, 2], 3=>[3]}
[1, 2, 2, 3].group_by{ |e| e }.select { |k, v| v.size > 1 }
# {2=>[2, 2]}
[1, 2, 2, 3].group_by{ |e| e }.select { |k, v| v.size > 1 }.map(&:first)
# [2]
```

Source: Stackoverflow

# Find Existing By Comparing

I wanted to find out how many numbers in CSV file 1 (numbers_in_excel)
were present in CSV file 2 (numbers_in_db) as I want to make sure they exist.
Sure enough, `Array#-`

can help me with this.

My original approach:

```
numbers_in_excel.length
# => 982
numbers_in_db.length
# => 5678
(numbers_in_excel - numbers_in_db).length
# => 981
# This is wrong, it should be 40. There were 30 numbers in the excel provided that weren't in the db.
```

Digging deeper to find out why the array difference wasn’t working as expected:

```
numbers_in_excel[0]
# => ["5157763506", "DID_PSTN_PROSODY_ANALYTICS_REC"]
numbers_in_db[0]
# => ["5157763506", "DID_PSTN_PROSODY_ANALYTICS"]
```

Notice the second property is not the same, hence all values are considered unique and return 981 instead of the 30 expected. Let’s slice it up.

```
def take_out_props (arr)
arr.map { |num| num.slice(0) }
end
take_out_props(numbers_in_excel)
# => [["5157763506"]...]
take_out_props(numbers_in_db)
# => [["5157763506"]...]
# now our arrays can be compared with equal values
(take_out_props(numbers_in_excel.uniq) - take_out_props(numbers_in_db)).length
# => 30
```