Infodump and Excel

From Mefi Wiki

Jump to: navigation, search

[edit] Opening infodump files in Excel

Get & unpack the file. I'm using MetaTalk post stats for this, because it's one of the smaller files.

Open the file. It probably won't show the file by default - instead of showing Files of type: All Excel Files, show Text Files instead.

Note that many of the Infodump files have enough rows to overwhelm Excel. See Infodump#File_size for file sizes & spreadsheet limits.

The Text Import Wizard should start:

  • Step 1: It's delimited, not Fixed width.
  • Step 2: Select only Tab as the delimiter.
  • Step 3: Finish.

The datestamp format is a bit off: instead of showing "2000-03-03 18:22:59.870", it shows up as "22:59.9" here. The full datestamp is still there, it's just not being displayed. You can change it to your liking - I like the format "3/14/01 1:30 PM".

[edit] Do Stuff

The first line is the datestamp for the database dump. This can get in the way.

The second line is a header, labelling each field type.

You could now "Select All," go to Data then Sort, tell Excel that there is a header row, & find

  • the threads with the most comments - in Metatalk, post 15931 currently has the most, 3649
  • the most favorites - post 15606 currently has the most, 270.

Filtering by deleted values of 1 shows only the deleted posts - sorting by userid then scanning shows that userid 191 has the most deleted posts in MetaTalk. (That would be admin pb, deleting test posts.)

[edit] Sorting popular tags

I'm using Excel 2007, which supports more rows. Tagdata_mefi has 366k rows, with multiple rows for posts with multiple tags. I removed the date line, then sorted the tag column alphabetically. Googling "count uniq rows excel" (I was looking for equivalents for the unix command uniq) led me to a way to count tags using a pivot table. I copied the pivot table report & used Paste Special to put the values into a new spreadsheet, which I then sorted by Count of tag_name, & started scanning.

There were 79k unique tags.

For comparison, using Debian Linux tools, here's the command to produce the equivalent list: "sort -f -k 5 tagdata_mefi.txt | uniq -c -i -f 4 | sort -rn | less" That leaves some unnecessary columns.

Personal tools