GitHub - jdportercode/TextMining: Assorted tools for text mining

Here are a few files useful for some text mining methods.

ttr_rollingwindow gets the type token ratio for every .txt file in a directory, on a rolling window. Type token ratio is the measure of unique words (types) to total words (tokens). For example, in the sentence "A rose is a rose is a rose", there are three types ("a","rose","is") and eight tokens (the wordcount), for a type token ratio of .375. So this metric gives us a sense of the diversity of vocabulary and repetitiveness of a text. But type token ratio is highly correlated with length; the longer a text, the lower the ratio. So to compare texts of different lengths, this program averages the ratio for a rolling window of 500 words (though the user can easily adjust that size). That is, it gets the ratio for words 1-500, 2-501, 3-502, etc. and then averages all of those. The results are written to a spreadsheet file (.tsv by default).

get_pos gets part-of-speech tags for every .txt file in a directory, using the Spacy parser. It writes the results to a spreadsheet. I wouldn't run it on a large corpus (it takes a while and produces a ton of results), but for a few long texts or many short ones, it's a nice, simple tool.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
get_pos.ipynb		get_pos.ipynb
ttr_rollingwindow.ipynb		ttr_rollingwindow.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages