Command line big data
Big data is cool. What is not cool is trying to manipulate it outside of a database. What if, like me, you don’t like databases? What if the data is in text files or CSVs and you want to work in those?
I have recently started a new project collecting human first names. I began with about 250,000 rows of data and needed to do some sanitisation before it was in a state that I needed. Having last performed this quite a while ago, I was out of practice but it soon came back after some trial and error.
I’m making notes this time around.
Splitting
Starting with my master data in a CSV, I initially split it by gender using LibreOffice Calc. A simple filter on the “gender” column shows me the relevant rows which can be copied > pasted to new documents, male.txt and female.txt, respectively.
I then wanted to:
-
Remove all rows containing hyphens (Peggy-Sue)
-
Remove all rows containing single quotes (A’isha)
-
Strip any trailing spaces
-
Order the data alphabetically
-
Check data integrity, e.g. search for any diacritics
Remove all rows containing hyphens
I need the rows with hyphens for later, so extract them from the master data:
grep "-" master.txt > hyphens.txt
To remove all rows containing hyphens:
grep -v "-" master.txt > master-2.txt
Remove all rows containing single quotes
To remove all rows containing single quotes AND to extract the rows with single quotes, I do the same, this time working on my master-2.txt document, which no longer contains the hyphens:
# Extract
grep "'" master-2.txt > quotes.txt
# Subtract
grep -v "'" master-2.txt > master-3.txt
master-3.txt is now without hyphens and single quotes. I also now have separate hyphens.txt and quotes.txt documents for later reference.
As master-2.txt is an interim file, it can be deleted.
Strip any trailing spaces
Next, I want to strip any trailing spaces in the file master-3.txt.
sed -i 's/[[:space:]]*$//' master-3.txt
Stripping trailing spaces is best to do before any sorting, as any spaces can impact the order.
Order the data alphabetically
To sort alphabetically, I prefer byte-wise (ASCII) order, where Apple comes before apple. I also want to remove any duplicate lines:
LC_COLLATE=C sort -d master-3.txt | uniq -u > master-sorted.txt
Check data integrity
This is a nice one. It outputs the letter by frequency (left = more / right = less) and can be used to tell at a glance whether there are any words which contain letters with diacritical marks.
tr -d "\n" < master-sorted.txt | while read -n1 char; \
do echo $char; done | sort | uniq -c| sort -rn | sed "s/^.* //" |tr -d "\n"; echo ""
If there are any and this is not desired, the entire file can be “flattened”:
iconv -f utf8 -t ascii//TRANSLIT master-sorted.txt > master-flat.txt
It is usually a good idea to run another LC_COLLATE=C sort as per above after this to remove any potential new duplicates.
Working with data in the CLI is not without its challenges, but with a little know-how, the speed is unmatched by any GUI application I’ve used.
