[ /js/corpus ]

I can remember learning about shift cyphers as a kid, and spending some time encrypting messages by hand on pen and paper. I didn't have anyone interested in decrypting those messages, so it never really developed as a hobby, but it's always been something in which I've been interested.

I haven't been formally enrolled in school lately, but I've been beefing up my skills by studying Cryptography on Coursera (which offers free online courses).

Some of it is review, as I've previously read this excellent book on the history of crypto and crypto-analysis. When I first read the book, however, I had much less experience as a programmer. This time around, I'm implementing all of the algorithms as a watch the videos, and so it's a very different experience.

Many of the functions depend upon conversions between ascii, hexadecimal, and integer representations. Beyond these simple support functions, there are the actual encryption and decryption algorithms (some of which are identical). Beyond those, there are the analysis functions, which I find particularly interesting, as they bridge my interests in Mathematics and Linguistics.

A wee project

Every now and then the topic of linguistic analysis comes up on IRC, since it's one of the tools typically used to discover the identity of a text's author.

This technique is not necessarily used in modern cryptography, as encryption algorithms have been designed to resist frequency analysis. Nevertheless, a cryptographic adversary can still find ways to apply these techniques to public content to learn more about you.

I'm writing code to gather statistics about my own writing on this blog:

  • Letter frequencies
  • n-graph frequencies
  • Word frequencies
  • Word lengths
  • Sentence length
  • Capital/Miniscule ratios

I intend to run this code on my Markdown files and gather a corpus of my own text to better understand exactly what can be learned about my from my writing style.

I'll probably run these same scripts on my irc writing to see how they differ, and maybe compare them against other irc personalities to determine degrees of linguistic similarity with my various peers.

Getting started

var letterFrequencies=function(text){
  var r={}; // the object which returns the results
  r.length=text.length; // remember the length of the text
  text.split("") // convert the text into an array
    .map(function(l){ // map a function over the letters of the text
      r[l]=(r[l]||0)+1; // increment the current letter's tally
    });
  return r; // return the results
};

This function is pretty simple. It accepts a string of text and returns an object containing a tally of each letter's frequency, and the length of the initial text.

Applying the function

I wrote this script in the folder where I keep my Markdown files.

var fs=require("fs");

var texts=fs.readdirSync(".").filter(function(f){
  return f.match(/\.md$/); // I'm only interested in analyzing prose
}).map(function(f){ // load each file's text
  return fs.readFileSync(f,"utf8");
}).join(""); // join the results string array into a single string

console.log(letterFrequencies(texts));

I wrote the results of this script to a json file, which can be viewed here. The exact script I used can be seen here.

Repackaging

The code above was a quick and dirty hack, so let's package it up a little better:

var fs=require("fs");
var crypta={}; // a place to keep our functions, packaging as a library

crypta.textsByType=function(path,type){
  return fs.readdirSync(path).filter(function(f){
    return f.match(new RegExp("\."+type+"$"));
  })map(function(f){
    return fs.readFileSync(f,"utf8");
  }).join("");
};

An up to date version of this library is available here.

Added functionality