A few days ago, as I write this, Kelly Guimont posted the following question:

Let's say you have a data set made up of a selection of songs by one artist. 
Let's also say you want to slice and dice information from said dataset like how many songs come from each album, or average song length. 

As of now what I want to know is the easiest way to "ask" my data set for this information. Do I have to type everything in a spreadsheet and filter? Can I use this to learn a teeny bit of python or something? 

Suggestions welcome, and thanks in advance!

Discussion ranged widely, as really, there are uncountable ways to do this. Suggestions included pivot tables, perl, python (especially using a framework called “Pandas”), spreadsheets, and SQLite.

I, of course, suggested the really nerdy approach:

@Verso If it’s in, say, a CSV file, you can use awk, sort, and other command line tools in crazy combination to do everything. :)

For a file formatted with “Album, song, time”: 

awk -F, ‘{print $1}’ data.csv | sort | uniq -c 

will give you # of songs from each album. (Though any albums with commas will mess it up - that’s where tab separated would be better…)

(You kinda have to be a nerd to prefer this approach to a spreadsheet, but once you get good at it, it’s *amazingly* useful for all kinds of flat text datasets.)

To which Kelly, intriguingly, replied:

@darthnull I like this approach, say more

And so, a post was born.

Seriously, I got all excited…this sounded like a fun post to do, and I started writing it in my head right then. So, here we are.

How useful is this craziness?

Okay, let’s start with a big caveat: This is what I use, very frequently, for ad hoc, on-the-fly, got some data in front of me and I need to find out interesting things, fast, problems. It’s most definitely not Serious Data Science, though I have used it in cases approaching that problem space.

This is the sort of thing that’s great for quickly answering questions, even detailed ones, using complicated plaintext logs (say, HTTP access logs). Or doing detailed analysis on the latest round of ShmooCon ticket sales. Or exploring a text file with millions of “leaked” records. Or even (along with custom tools for intermediate data parsing) finding trends in days of pcap data.

Bottom line: These are the techniques I use when I’m exploring, and don’t know what I’m looking for, or when I have a task I do occasionally, but not often enough that I’ve bothered to write a full-on tool to automate it.

These are definitely not what a serious data scientist would use, though I’d love to have a reason to start playing in that world, too…it looks like it’d be fun.

Also, these will only take you so far – it’s mostly for looking for patterns in the data, like how many times X shows up with Y. But if you want to start working with the meanings of data, like, say, how many times X appears when Y is between two time values, then you quickly move into the realm of needing a real solution.

Framing the Problem

Turns out, Kelly was interested in processing a list of songs her band performs, as an excuse to learn “a bit of python or SQL or something.” So, I’ll stick with music as my example.

Lately, I’ve resumed listening to a streaming music station called Radio Paradise. They have a great mix, and I’ve been playing it at my desk, piped through the macOS Music app to a pair of Homepods Mini. I love it because it exposes me to songs and artists I’ve never heard before (some old, like Dr. John’s I Walk on Guilded Splinters, some new, like Father John Misty’s Mahashmashana album). Plus plenty of old classics I know and love, but forgot about.

I didn’t want to lose these gems, so I created a shortcut that activates Siri to identify the track (a bug in Music means that I lose the track information when Airplaying to the Homepods), then asks me if I want to save the song to a playlist.

That playlist now has 145 songs, spanning nearly 12 hours of music. Sounds like a great source of data for this exercise…if I can figure out how to export it to a plain text file… This could be a post in itself: Share a URL, download the URL using curl, discover a massive .json dataset embedded in it, extract that json, parse it with the jq tool, extract artist, title, album, and time fields. Took me a while, but I eventually got it:

$ cat y | jq -r '.[] | [ .title, .artistName, .tertiaryLinks[0].title, .duration ]| @csv' | head -20 
"Twilight Zone","Golden Earring","Cut",478107
"Jane Says","Jane's Addiction","Nothing's Shocking",292880
"I Put A Spell On You","Nina Simone","Blues on Purpose - EP",156267
"Darkness, Darkness","The Youngbloods","Elephant Mountain",228627
"Abraded","Fall Therapy","You Look Different",330240
"Haboob","Mich Gerber","Tales Of The Wind",353787
"Bedbugs and Ballyhoo","Echo & The Bunnymen","Echo & the Bunnymen (Bonus Tracks Edition) [2004 Remaster]",209120
"I Guess Time Just Makes Fools of Us All","Father John Misty","Mahashmashana",515500
"Luz, Amore y Vida","Santana","Multi-Dimensional Warrior",308053
"Black Horse and the Cherry Tree (Radio Version)","KT Tunstall","Eye To the Telescope",172373
"Somewhere Down the Crazy River","Robbie Robertson","Robbie Robertson",295373
"Sumeru","islandman","Kaybola (Deluxe Version)",297918
"I Can Change","LCD Soundsystem","This Is Happening",355813
"Marlene On the Wall","Suzanne Vega","Suzanne Vega",219173
"Ascent","GoGo Penguin","From the North - GoGo Penguin Live in Manchester",272292
"I Still Believe (Great Design)","The Call","Reconciled",332267
"Royals","Vitamin String Quartet","VSQ Performs the Hits of 2013, Vol. 2",203427
"Road to Joy (Dark-Side Mix)","Peter Gabriel","I/O (Bright-Side and Dark-Side Mixes)",325329
"Opening","Philip Glass","Le maître du minimaliste",384933
"Tuva Groove","Ondar","Back Tuva Future: The Adventure Begins",196600

(I’d already extracted a subset of the JSON data from the curl response…)

Let’s Go!

Okay, let’s start slicing and dicing. In my initial response to Kelly, I’d presumed a CSV file (“Comma Separated Values”) file with Album, Song, Time. Here, we’ve got Track, Artist, Album, and Duration (in milliseconds). But as I pointed out, anything with a comma in it is gonna mess things up. Turns Out, I can just change @csv to @tsv and output the data in Tab Separated Values format, which is much easier to parse with awk. Woot.

So, let’s get our data into a usable place:

cat y | jq -r '.[] | [ .title, .artistName, .tertiaryLinks[0].title, .duration ]| @tsv' > data

We’ll perform the original query I mentioned in my response - find the number of songs from each album. I don’t actually expect a lot of album repeats here, given the fairly random nature of the source data, but you never know – I’m discovering these results as I write the post. So we’ll sort the output by frequency, just to bubble the interesting results to the top.

$ awk -F\t '{print $3}' data | sort | uniq -c | sort -n -r | head -10
  14 An Awesome Wave
   3 Mahashmashana
   2 Metamorphoses
   2 Dig, Lazarus, Dig!!!
   1 You Look Different
   1 Yoshida Brothers Best Vol One
   1 Who Laughs Last (feat. Kristen Stewart) - Single
   1 Whitewater
   1 War and Peace
   1 Walk Don't Run

Yeah, I didn’t expect much.

Okay, what did all this mean? First, the awk command: Incredibly useful. Like, amazingly so. It’s a full-on text processing language. I used to do all kinds of crazy things in awk, but eventually it just became a very simple “read a line, do something with it” kind of tool for me.

In this case, I’m telling it to:

  • Use a tab as a field separator: -F\t - the -F means “field separator” and \t is a standard code to represent tab (which you otherwise couldn’t really type in the command line)
  • Then, for every line, print the third ("$3") field. In this case, that’s album title

After that, I sort the data alphabetically, then look for unique values using uniq -c. The -c option tells uniq to count how many times each line occurs.

So if you have “Box of Frogs” show up three times, you’ll get “3 Box of Frogs” in the output. The uniqueness comes from only showing up in the output once.

Finally, I sort the results, numerically (-n), in reverse order (-r), and limit the results to the first 10 lines (head -10).

The first album is a mistake – I accidentally added a whole album to my playlist, instead of a single track. I need to clean that up. Otherwise, there’s not a lot of repeats, at all. Who was the artist on these? And…let’s ignore the “Awesome Wave” result:

$ cat data | grep -v 'An Awesome Wave' | awk -F\t '{print $2 " - " $3}' | sort | uniq -c | sort -n -r | head -5
   3 Father John Misty - Mahashmashana
   2 Nick Cave & The Bad Seeds - Dig, Lazarus, Dig!!!
   2 Jean-Michel Jarre - Metamorphoses
   1 Yoshida Brothers - Yoshida Brothers Best Vol One
   1 Woodkid - S16  

I’ve added a pre-processing step: cat data | grep -v 'An Awesome Wave'. This removes any occurrence of that album from the data set. (Well, any line with those words, anywhere – but I doubt there are any artists or tracks which also match it.) Then I print the 2nd field in the data set, a separator " - “, and the third field. And now that I know there’s not a lot of duplicates, let’s limit to just the first five entries.

What are those few songs from the same album? Let’s grab the album title (we know there are three of them, so head -3), and we’ll just print the last field in the line ("$NF” means “field number {number of fields}” - it’s awkward, I know, but whatever). (heh. AWKward.)

$ cat data | grep -v 'An Awesome Wave' | awk -F\t '{print $2 " - " $3}' | sort | uniq -c | sort -n -r | awk -F- '{print $NF}' | head -3
 Mahashmashana
 Dig, Lazarus, Dig!!!
 Metamorphoses

Now I can filter for just those items. Oh, wait, that won’t work - there’s a leading space (my separator is really " - " with spaces on either end). Let’s delete the leading space:

$ cat data | grep -v 'An Awesome Wave' | awk -F\t '{print $2 " - " $3}' | sort | uniq -c | sort -n -r | awk -F- '{print $NF}' | head -3 | sed -r 's/^ //'
Mahashmashana
Dig, Lazarus, Dig!!!
Metamorphoses

The sed command is a “Stream EDitor”, and lets me make simple changes as they fly by, line by line. In this case, s/^ // means “substitute nothing (//) for every space (the space) at the beginning of a line (the ^)”.

Now, loop over those results, and find their entries in the data file:

$ cat data | grep -v 'An Awesome Wave' | awk -F\t '{print $2 " - " $3}' | sort | uniq -c | sort -n -r | awk -F- '{print $NF}' | head -3 | sed -r 's/^ //' | while read line ; do grep "$line" data; done
I Guess Time Just Makes Fools of Us All Father John Misty   Mahashmashana   515500
She Cleans Up   Father John Misty   Mahashmashana   266467
Josh Tillman and the Accidental Dose    Father John Misty   Mahashmashana   312326
Hold On to Yourself Nick Cave & The Bad Seeds   Dig, Lazarus, Dig!!!    350920
Hold On to Yourself Nick Cave & The Bad Seeds   Dig, Lazarus, Dig!!!    350920
C'est La Vie    Jean-Michel Jarre   Metamorphoses   431094
C'est La Vie    Jean-Michel Jarre   Metamorphoses   431094

So I took my original results (up to the sort -n -r), then extracted the album field, kept just the first three results, and stripped the leading space. Those results I sent to another script which reads each line one at a time (into a variable called line), and searches for that line, as a single block (which is why I wrapped it in double quotes) in the data file.

Looks like a couple of these are duplicates. Which leads us to the next section!

(and also shows me that this really wasn’t the best data set to experiment with. But as I said, this is what I use when I don’t know what to expect, so….)

Repeated Entries?

I’ve been using this “Sounded Cool” shortcut for months, and invariably, I end up adding songs to the playlist that were already there. In general, the music app is good about warning you when you do that, but apparently not when you add something using a shortcut. So…how many tracks are duplicated?

My mind would think (is thinking, as I write this)… track title & artist, uniqued. So.. awk to get those fields, sort, uniq -c, sort results to put the “more than 1” results on top.

In reality:

$ cat data | grep -v 'An Awesome Wave' | awk -F\t '{print $1 " - " $2}' | sort | uniq -c | sort -n -r | head -5
   2 Hold On to Yourself - Nick Cave & The Bad Seeds
   2 C'est La Vie - Jean-Michel Jarre
   1 Who Laughs Last (feat. Kristen Stewart) - Lord Huron
   1 White Bird - It's A Beautiful Day
   1 War - Edwin Starr

Not bad - first try. I think. Can I double check that? Let’s just check track title, and no sorting or counting, to make sure I’m looking at the right slice of the data:

$ cat data | grep -v 'An Awesome Wave' | awk -F\t '{print $1}' | head -10
Twilight Zone
Jane Says
I Put A Spell On You
Darkness, Darkness
Abraded
Haboob
Bedbugs and Ballyhoo
I Guess Time Just Makes Fools of Us All
Luz, Amore y Vida
Black Horse and the Cherry Tree (Radio Version)

Seems reasonable. Yeah. I think that worked. (plus, you know, it was already pretty much solved in this example with the “multiple songs from the same album” query…)

Longest songs?

I have a duration field. As I said above, these techniques are great if you’re just processing data as, like, symbols themselves. Tokens. Blocks of text. As soon as you want to process on the meanings of any of these fields, like duration, it gets more difficult.

But, since in this case, duration is just an integer…we can do the basic query very easily:

$ cat data | grep -v 'An Awesome Wave' | awk -F\t '{printf("%s - %s - %s - %s\n", $1, $2, $3, $4)}' | head -10
Twilight Zone - Golden Earring - Cut - 478107
Jane Says - Jane's Addiction - Nothing's Shocking - 292880
I Put A Spell On You - Nina Simone - Blues on Purpose - EP - 156267
Darkness, Darkness - The Youngbloods - Elephant Mountain - 228627
Abraded - Fall Therapy - You Look Different - 330240
Haboob - Mich Gerber - Tales Of The Wind - 353787
Bedbugs and Ballyhoo - Echo & The Bunnymen - Echo & the Bunnymen (Bonus Tracks Edition) [2004 Remaster] - 209120
I Guess Time Just Makes Fools of Us All - Father John Misty - Mahashmashana - 515500
Luz, Amore y Vida - Santana - Multi-Dimensional Warrior - 308053
Black Horse and the Cherry Tree (Radio Version) - KT Tunstall - Eye To the Telescope - 172373

I’ve changed the way awk outputs its data. I could either do something like $1 " - " $2 " - " $3 " - " $4, or I could do what I did there, which lets me format the output with much more precision. It’s kinda up to the situation which way I go.

Okay, now I’ve got the raw data – how do I find the longest? In this case, duration is the last element, so it’s easy to sort. In this case, I’m telling sort to use a “-” as the field separator ("-t -"), to use the 4th field as the sort key ("-k 4"), and to sort numerically ("-n"):

$ cat data | grep -v 'An Awesome Wave' | awk -F\t '{printf("%s - %s - %s - %s\n", $1, $2, $3, $4)}' | sort -t - -k 4 -n | head
Ascent - GoGo Penguin - From the North - GoGo Penguin Live in Manchester - 272292
C'est La Vie - Jean-Michel Jarre - Metamorphoses - 431094
C'est La Vie - Jean-Michel Jarre - Metamorphoses - 431094
Crabbuckit - K-OS - Joyful Rebellion - 228773
Explain Variation 1 - Jeroen Search - Continuum - EP - 231500
Für Elise - Batiste - Jon Batiste - Beethoven Blues - 244340
Grey Day - Red Noise - Grey Day - Single - 393418
I Put A Spell On You - Nina Simone - Blues on Purpose - EP - 156267
La Inocencia (feat. El Guincho) - Israel Fernández & Diego del Morao - La Inocencia (feat. El Guincho) - Single - 224838
London - breathe. - Breathe. - EP - 209665

Well, damn. That didn’t work. Why not? Oh. Many of these results have a “-” in either the album or artist fields. Okay, let’s change what awk spits out to use “|” as a separator instead. Let’s also reverse the sort ("-r") so the longest tracks come first:

$ cat data | grep -v 'An Awesome Wave' | awk -F\t '{printf("%s | %s | %s | %s\n", $1, $2, $3, $4)}' | sort -t \| -k 4 -n -r | head
Green Grass & High Tides | The Outlaws | Alive In America | 868920
Mojave | Afro Celt Sound System | Anatomic | 635520
Emotional Overload (Allan McLoud Remix) | The Highlander | Revisited | 582000
Also Sprach Zarathustra | Deodato | Prelude (CTI Records 40th Anniversary Edition) | 541173
Kiss Them For Me (Kathak #1 Mix) | Siouxsie & The Banshees | Superstition (Remastered and Expanded) | 535613
Compared to What (Live at The Montreux Jazz Festival, Switzerland, June 1969) | Les McCann & Eddie Harris | Swiss Movement (Montreux 30th Anniversary Edition) [Live] | 516173
I Guess Time Just Makes Fools of Us All | Father John Misty | Mahashmashana | 515500
Selig | Helium Vola | Helium Vola. Special Edition | 497012
Olsen Olsen | Sigur Rós | Ágætis Byrjun | 482947
Twilight Zone | Golden Earring | Cut | 478107

Now that’s much better. I also need to get rid of that Outlaws track - the album version is much better, I really didn’t like the live track. (which, btw, I added to the playlist while I heard it in a restaurant – because the shortcut I made also works on my phone. Very convenient.)

What if I really want to deal with duration as a human-readable time? Well, awk can do some basic math, so I’ll … divide out the microseconds, then divide by 60 for minutes, and moudulo divide by 60 (get the remainder) for seconds. (full disclosure - this took a while to get right):

$ cat data | grep -v 'An Awesome Wave' | awk -F\t '{printf("%s | %s | %s | %d:%02d\n", $1, $2, $3, int($4/60000), ($4/1000) % 60)}' | sort -t \| -k 4 -n -r | head
Green Grass & High Tides | The Outlaws | Alive In America | 14:28
Mojave | Afro Celt Sound System | Anatomic | 10:35
Emotional Overload (Allan McLoud Remix) | The Highlander | Revisited | 9:42
Also Sprach Zarathustra | Deodato | Prelude (CTI Records 40th Anniversary Edition) | 9:01
Selig | Helium Vola | Helium Vola. Special Edition | 8:17
Olsen Olsen | Sigur Rós | Ágætis Byrjun | 8:02
Kiss Them For Me (Kathak #1 Mix) | Siouxsie & The Banshees | Superstition (Remastered and Expanded) | 8:55
I Guess Time Just Makes Fools of Us All | Father John Misty | Mahashmashana | 8:35
Compared to What (Live at The Montreux Jazz Festival, Switzerland, June 1969) | Les McCann & Eddie Harris | Swiss Movement (Montreux 30th Anniversary Edition) [Live] | 8:36
Twilight Zone | Golden Earring | Cut | 7:58

If I wanted to go any more complex (“Show me tracks where time is > 5 minutes but under 10”) then I’d definitely move to python or something else.

The %d:%02d formatting command tells awk to print the first field as a decimal (that’s the minutes here, $4 / 1000 / 60 or $4/60000), and the second field as a 2-digit decimal with leading 0 (so that 1 second appears as “01”).

Hey, I told you this could get complicated. (I did, didn’t I?)

Is all this really useful?

Oh, yeah. As you can see, it does get awfully complicated (now I’m certain I’ve said that before), but for really quick explorations, it’s fantastic.

Are there better tools? Certainly. I’ve only had brief exposure to tools like Jupyter Notebook, but that kind of has a similar paradigm – explore a little, try a script, see what it gets you, etc., just in a nicely-contained, self-documented format. I should play with that.

But when you’re in a hurry (like, say, there’s been an incident and someone throws a massive syslog file at you), this approach can get you basic, preliminary results, very quickly.

What have we learned?

A quick recap of the command line tools I used here, and what each is for:

  • cat - simply spits a data file out
  • grep - in this case, “grep -v {x}” to filter out {x} from the results. Or can use “grep {x}” to include ONLY those lines with x. I’ll often find myself doing a pre-filter, some steps to parse the data, another grep filter (keeping or discarding, depending on need), more processing, etc.
  • awk - to extract individual fields from the data
  • sort - to, well, sort the data. useful for putting what you’re interested in at the top of the output, so you can ignore the thousands of lines of noise
  • head - to let you actually ignore the noise by printing only the first X lines
  • uniq -c - used with sort, to count things.
  • sed - to make quick, automated edits to the lines as they go by

Another great tool I use is cut, which lets you extract individual columns of data by number - say, the 10th through 15th characters in a line. It won’t work with data like this (with varying field sizes), but great if you have very well-formatted data of exact field lengths. Like the leaked UDID dump I referred to back at the beginning of this post.

There are more. I use tr to change case (like, make everything uppercase), or the “-i” option on sort to ignore case. The tr command is also good for changing some characters to another, if what’s there already is inconvenient in some way, or to delete characters (or whole groups of characters) that just get in the way.

I’m sure there are even more tools out there. I “grew up” in the UNIX world on Ultrix, NeXTStep, and SunOS, in the late 80s and early 90s, so the tools I use are basically what was available on those systems. Those are just what’s stuck in my brain. Occasionally, I can learn new tricks (like jq above), but these are the old standards I keep coming back to.

And my bag of tricks goes on – like, processing a whole bunch of data down to a list of numbers, which I then pipe to awk, so I can print them all in a line with " + " between them, then I add a 0 to the end, and pipe to dc to get a grand total. Like… er… Okay, let me show you one last trick.

How long is the playlist?

$ (cat data | grep -v 'An Awesome Wave' | awk -F\t '{printf("%s + ", $4)}' ; echo 0) | bc

Parse error: bad expression
    <stdin>:1

What?

$ (cat data | grep -v 'An Awesome Wave' | awk -F\t '{printf("%s + ", $4)}' ; echo 0)
478107 + 292880 + 156267 + 228627 + 330240 + 353787 + 209120 + 515500 + 308053 + 172373 + 295373 + 297918 + 355813 + 219173 + 272292 + 332267 + 203427 + 325329 + 384933 + 196600 + 231573 + 314133 + 350987 + 868920 + 157907 + 192333 +  + 250598 + 269658 + 122933 + 252755 + 222920 + 280427 + 273107 + 320000 + 236360 + 431094 + 340776 + 219724 + 242733 + 535613 + 287827 + 307413 + 370188 + 273333 + 516173 + 244340 + 232632 + 187744 + 482947 + 203507 + 209665 + 246413 + 251022 + 377560 + 371867 + 202480 + 268440 + 224838 + 263000 + 265480 + 393418 + 148107 + 170000 + 321236 + 259093 + 283107 + 266467 + 280693 + 541173 + 188929 + 285293 + 378144 + 431094 + 386967 + 296253 + 318253 + 210267 + 175333 + 166960 + 266000 + 350920 + 200267 + 331036 + 206467 + 231500 + 288933 + 196243 + 330588 + 210240 + 359173 + 230827 + 287992 + 402987 + 582000 + 213347 + 260027 + 198157 + 244973 + 376200 + 199107 + 295973 + 326658 + 317280 + 299533 + 202600 + 235238 + 331117 + 246307 + 173999 + 635520 + 251643 + 195959 + 209813 + 310547 + 312326 + 228773 + 212345 + 350920 + 257027 + 250000 + 405467 + 458548 + 396921 + 283947 + 497012 + 0

Oh. There’s one track with no duration. Hm.

$ (cat data | grep -v 'An Awesome Wave' | awk -F\t '$4 > 0 {printf("%s + ", $4)}' ; echo 0) | bc
36982743

Okay, now I’m only printing lines in awk if the 4th field (duration) is > 0. So I have a total of… 36,982,743 milliseconds of music.

$ (cat data | grep -v 'An Awesome Wave' | awk -F\t '$4 > 0 {printf("%s + ", $4)}' ; echo 0) | bc | awk '{printf("%d:%02d\n", int($1/60000), ($1/1000) % 60)}'
616:22

Oh. More than an hour. Okay…I can do more dividing and modulo arithmetic to automate that, or I can just say “616 / 60 = 10.2666”, so 10 hours, and “0.2666 * 60” = 15, so… about 10 hours, 15 minutes, 22 seconds. More or less. At some point, when you’re down to singular pieces of data, just use a damned calculator.

36982743 / 1000 = 36982 (milliseconds to seconds)
36982 / 3600 (seconds per hour) = 10.272777... - 10 hours
.272777 * 3600 - 981 seconds
981 seconds / 60 = 16.35 minutes
.35 * 60 = 21 seconds

10:16:21

Not quite exactly what I was getting before, but it’s in the ballpark. Is it a long playlist? Yes. Yes, it is.

Anyway, those are some of my tricks. Not always efficient, certainly not always the best approach, but they’ve served me well for many years, and get me good answers, quickly. These answers usually lead to more questions, and eventually, more specialized tools and techniques, but the basics give me a very fast head start on many problems.

Is this is useful for anyone else, including you? Who knows. At the very least, now you know a little more about just how my weird mind works. :)