cybrkyd

Command line big data

 Fri, 23 Jan 2026 17:13 UTC

Big data is cool. What is not cool is trying to manipulate it outside of a database. What if, like me, you don’t like databases? What if the data is in text files or CSVs and you want to work in those?

I have recently started a new project collecting human first names. I began with about 250,000 rows of data and needed to do some sanitisation before it was in a state that I needed. Having last performed this quite a while ago, I was out of practice but it soon came back after some trial and error.

I’m making notes this time around.

Splitting

Starting with my master data in a CSV, I initially split it by gender using LibreOffice Calc. A simple filter on the “gender” column shows me the relevant rows which can be copied > pasted to new documents, male.txt and female.txt, respectively.

I then wanted to:

  1. Remove all rows containing hyphens (Peggy-Sue)

  2. Remove all rows containing single quotes (A’isha)

  3. Strip any trailing spaces

  4. Order the data alphabetically

  5. Check data integrity, e.g. search for any diacritics

Remove all rows containing hyphens

I need the rows with hyphens for later, so extract them from the master data:

grep "-" master.txt > hyphens.txt

To remove all rows containing hyphens:

grep -v "-" master.txt > master-2.txt

Remove all rows containing single quotes

To remove all rows containing single quotes AND to extract the rows with single quotes, I do the same, this time working on my master-2.txt document, which no longer contains the hyphens:

# Extract
grep "'" master-2.txt > quotes.txt

# Subtract
grep -v "'" master-2.txt > master-3.txt

master-3.txt is now without hyphens and single quotes. I also now have separate hyphens.txt and quotes.txt documents for later reference.

As master-2.txt is an interim file, it can be deleted.

Strip any trailing spaces

Next, I want to strip any trailing spaces in the file master-3.txt.

sed -i 's/[[:space:]]*$//' master-3.txt

Stripping trailing spaces is best to do before any sorting, as any spaces can impact the order.

Order the data alphabetically

To sort alphabetically, I prefer byte-wise (ASCII) order, where Apple comes before apple. I also want to remove any duplicate lines:

LC_COLLATE=C sort -d master-3.txt | uniq -u > master-sorted.txt

Check data integrity

This is a nice one. It outputs the letter by frequency (left = more / right = less) and can be used to tell at a glance whether there are any words which contain letters with diacritical marks.

tr -d "\n" < master-sorted.txt | while read -n1 char; \
do echo $char; done | sort | uniq -c| sort -rn | sed "s/^.* //" |tr -d "\n"; echo ""

If there are any and this is not desired, the entire file can be “flattened”:

iconv -f utf8 -t ascii//TRANSLIT master-sorted.txt > master-flat.txt

It is usually a good idea to run another LC_COLLATE=C sort as per above after this to remove any potential new duplicates.

Working with data in the CLI is not without its challenges, but with a little know-how, the speed is unmatched by any GUI application I’ve used.

»
Tagged in:

Visitors: Loading...