Prepare a corpus from an Online Knowledge Network

100 points

Overview

Computational rhetoric projects, regardless of if they are done in industry, technical communication courses, or Digital Humanities research, all begin with the development of a corpus or body of words. In scholarly pursuits, we are asking questions about large trends or patterns in texts that one person could never read. In industry and other kinds of rhetorical work, we’re trying to find out how people categorize and use words to solve problems or make knowledge. Either way, it all begins with having a nice and clean “body” of words that a computer and other researchers can look at.

Getting Started

Make an ethical choice concerning an online community of practice engaged in knowledge work. In other words, find a place where folks discuss and share information about how to solve problems. It is very easy to reach false conclusions for communities that you already belong to, even if you are a lurker. Pick a community that you are unfamiliar with. If you get lost, use Temple’s Preparing a Corpus for Textual Analysis guide to help you. Make sure to take a look at the scraper options on the Tools page. Each comes with benefits and drawbacks so make sure to pick the right one for you and your needs/resources.

How to pick a good community of practice

Find an intended online community of practice to engage in text mining as well as intellectual or professional rationale (See Ignatiw & Mihalcea pages 27 to 31). You should develop a basic research question for that community of practice.

Some advice on Scrubbing your data

Here is some advice from Temple’s Preparing a Corpus for Textual Analysis guide to help you with scrubbing your corpus:

  1. Think about the order in which you do things. If you start by running everything through Lexos, which eliminates hyphens and makes everything lower-case, it’s going to be much harder to correct hyphenated words and to eliminate names that are also words (aka frank/Frank, bell/Bell, etc.)
  2. Work carefully and keep saving different versions. It’s inevitable that you’ll make some mistakes while cleaning, and unfortunately it’s not always as easy as you’d expect to correct them. Notepad++ lets you change hundreds of files at once, but to undo what you just did, you have to go to each file in your corpus individually.
  3. If something’s not working, check your settings. If Notepad++ doesn’t seem to be working, make sure that the “Regular Expression” or “Normal Expression” boxes are checked the way you want them for a particular search-and-replace.

Deliverables

  1. a brief memo covering the following areas of concern.
    • A web graph of web sites to be included in your crawl (See Ignatiw & Mihalcea page 36).
    • A commitment to the use of a particular web crawling and scraping program for use of your project, preferably tested.
    • An assessment of two or more web crawling and scraping programs (See “Tools” page of course website) including an assessment of cost/benefit analysis for each.
    • The curation of the two best tutorials you found for your tool of your choice
    • A decision on the text mining methods you plan to employ: Information extraction, information retrieval, or topic modeling
    • Any difficulties you anticipate having or specific elements you would like me to help with.
  1. Please note that this memo should be brief. No more than one or two sentences per rationale. The memo should be shared with my NCSU email (dmwalls@ncsu.edu) via Google docs with editing turned on.
  2. a shared google sheets corpse consisting of at least 30,000 scrubbed n-grams.

Submission

  • Google doc link memo
  • Google sheets link Google drive txt link

Assessment

There is no rubric for this assignment. Grades will be assessed based on percentage of 30,000 n-grams (about 90,000 words) completed and on quality of scrubbing.

Example.