Option 2: Prepare a corpus from an Online Knowledge Network

100 points

Overview

This is the more complicated of the two assignments but also the one with less writing. I can’t really coach you through how to do it online, so I decided to make it an option that you can pursue on your own. Note, however, that I won’t be giving any instructional time over to it so you are on your own should you pursue it.

That said, I think it is a more relevant option for your professional futures.

Computational rhetoric projects, regardless of if they are done in industry, technical communication courses, or Digital Humanities research, all begin with the development of a corpus or body of words. In scholarly pursuits, we are asking questions about large trends or patterns in texts that one person could never read. In industry and other kinds of rhetorical work, we’re trying to find out how people categorize and use words to solve problems or make knowledge. Either way, it all begins with having a nice and clean “body” of words that a computer and other researchers can look at.

Getting Started

Make an ethical choice concerning an online community of practice engaged in knowledge work. In other words, find a place where folks discuss and share information about how to solve problems. It is very easy to reach false conclusions for communities that you already belong to, even if you are a lurker. Pick a community that you are unfamiliar with. If you get lost, use Temple’s Preparing a Corpus for Textual Analysis guide to help you. Make sure to take a look at the scraper options on the Tools page (I use Scraper, a light weight Chrome plug in but you can even roll your own if you want). Each comes with benefits and drawbacks so make sure to pick the right one for you and your needs/resources.

How to pick a good community of practice

Find an intended online community of practice to engage in text mining as well as intellectual or professional rationale (See Ignatiw & Mihalcea pages 27 to 31). You should develop a basic research question for that community of practice.

Some advice on Scrubbing your data

Here is some advice from Temple’s Preparing a Corpus for Textual Analysis guide to help you with scrubbing your corpus:

  1. Think about the order in which you do things. If you start by running everything through Lexos, which eliminates hyphens and makes everything lower-case, it’s going to be much harder to correct hyphenated words and to eliminate names that are also words (aka frank/Frank, bell/Bell, etc.) Make sure to remove stopwords.
  2. Work carefully and keep saving different versions. It’s inevitable that you’ll make some mistakes while cleaning, and unfortunately it’s not always as easy as you’d expect to correct them. Notepad++ lets you change hundreds of files at once, but to undo what you just did, you have to go to each file in your corpus individually.
  3. If something’s not working, check your settings. If Notepad++ doesn’t seem to be working, make sure that the “Regular Expression” or “Normal Expression” boxes are checked the way you want them for a particular search-and-replace.

Deliverables

  1. a brief memo covering the following areas of concern.
    • An intended online community of practice to engage in text mining as well as intellectual or professional rationale (See Ignatiw & Mihalcea pages 27 to 31).
    • A web graph of web sites to be included in your crawl (See Ignatiw & Mihalcea page 36).
    • A commitment to the use of a particular web crawling and scraping program for use of your project
    • The location you plan to store your linguistic corpus at (must be public)
    • A decision on the text mining methods you plan to employ: Information extraction, information retrieval, or topic modeling
    • Any difficulties you anticipate having or specific elements you would like me to help with.
  1. Please note that this memo should be brief. Half a page justification for the community-ness of your site using Wenger or Swales
  2. a shared google sheets corpse consisting of at least 20,000 scrubbed n-grams (about 60,000 words).

Submission

  • Google doc link for the memo
  • txt corpus file

Assessment

There is no rubric for this assignment. Grades will be assessed based on percentage of 20,000 n-grams completed and on quality of scrubbing. Reach 20,000 n-grams, get an A.