Skip to content

jdportercode/TextMining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Here are a few files useful for some text mining methods.

ttr_rollingwindow gets the type token ratio for every .txt file in a directory, on a rolling window. Type token ratio is the measure of unique words (types) to total words (tokens). For example, in the sentence "A rose is a rose is a rose", there are three types ("a","rose","is") and eight tokens (the wordcount), for a type token ratio of .375. So this metric gives us a sense of the diversity of vocabulary and repetitiveness of a text. But type token ratio is highly correlated with length; the longer a text, the lower the ratio. So to compare texts of different lengths, this program averages the ratio for a rolling window of 500 words (though the user can easily adjust that size). That is, it gets the ratio for words 1-500, 2-501, 3-502, etc. and then averages all of those. The results are written to a spreadsheet file (.tsv by default).

get_pos gets part-of-speech tags for every .txt file in a directory, using the Spacy parser. It writes the results to a spreadsheet. I wouldn't run it on a large corpus (it takes a while and produces a ton of results), but for a few long texts or many short ones, it's a nice, simple tool.

About

Assorted tools for text mining

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors