Command line big data

Fri, 23 Jan 2026 17:13 UTC

Big data is cool. What is not cool is trying to manipulate it outside of a database. What if, like me, you don’t like databases? What if the data is in text files or CSVs and you want to work in those?

I have recently started a new project collecting human first names. I began with about 250,000 rows of data and needed to do some sanitisation before it was in a state that I needed. Having last performed this quite a while ago, I was out of practice but it soon came back after some trial and error.

I’m making notes this time around.

Splitting

Starting with my master data in a CSV, I initially split it by gender using LibreOffice Calc. A simple filter on the “gender” column shows me the relevant rows which can be copied > pasted to new documents, male.txt and female.txt, respectively.

I then wanted to:

Remove all rows containing hyphens (Peggy-Sue)
Remove all rows containing single quotes (A’isha)
Strip any trailing spaces
Order the data alphabetically
Check data integrity, e.g. search for any diacritics

Remove all rows containing hyphens

I need the rows with hyphens for later, so extract them from the master data:

grep "-" master.txt > hyphens.txt

To remove all rows containing hyphens:

grep -v "-" master.txt > master-2.txt

Remove all rows containing single quotes

To remove all rows containing single quotes AND to extract the rows with single quotes, I do the same, this time working on my master-2.txt document, which no longer contains the hyphens:

# Extract
grep "'" master-2.txt > quotes.txt

# Subtract
grep -v "'" master-2.txt > master-3.txt

master-3.txt is now without hyphens and single quotes. I also now have separate hyphens.txt and quotes.txt documents for later reference.

As master-2.txt is an interim file, it can be deleted.

Strip any trailing spaces

Next, I want to strip any trailing spaces in the file master-3.txt.

sed -i 's/[[:space:]]*$//' master-3.txt

Stripping trailing spaces is best to do before any sorting, as any spaces can impact the order.

Order the data alphabetically

To sort alphabetically, I prefer byte-wise (ASCII) order, where Apple comes before apple. I also want to remove any duplicate lines:

LC_COLLATE=C sort -d master-3.txt | uniq -u > master-sorted.txt

Check data integrity

This is a nice one. It outputs the letter by frequency (left = more / right = less) and can be used to tell at a glance whether there are any words which contain letters with diacritical marks.

tr -d "\n" < master-sorted.txt | while read -n1 char; \
do echo $char; done | sort | uniq -c| sort -rn | sed "s/^.* //" |tr -d "\n"; echo ""

If there are any and this is not desired, the entire file can be “flattened”:

iconv -f utf8 -t ascii//TRANSLIT master-sorted.txt > master-flat.txt

It is usually a good idea to run another LC_COLLATE=C sort as per above after this to remove any potential new duplicates.

Working with data in the CLI is not without its challenges, but with a little know-how, the speed is unmatched by any GUI application I’ve used.

Tagged in:

Visitors: Loading...