Wednesday, August 8, 2012

Plagiarism Vocabulary

I was digging around looking for papers on how exactly plagiarism detection software works, when I was directed to "Classifications of Plagiarism Detection Engines", published in 2005 by Fintan Culwin and Thomas Lancaster in the online journal ITALICS 4(2). As a software engineer I was quite enjoying digging out the ancient papers quoted there about detecting plagiarism in programming exercises in FORTRAN. Ahh, those were the days, my first programming language.... and what a great use of Halstead's Software Science!

Then I realized that Thomas Lancaster had submitted his dissertation "Effective and Efficient Plagiarism Detection" in 2003 to the London South Bank University, London, UK. He has an excellent, detailed classification of the plagiarism detection systems available at that time, and a good overview of a lot of the technical papers that are to be found on the topic. The glossary alone is a joy to read, and I have asked for and received permission to repeat portions here. There are also a number of papers that Lancaster has published or prepared on the topic included in the appendix. Lancaster focuses in the thesis on a four-step process for determining plagiarism:
  • Collection stage - The first stage of the four-stage plagiarism detection process. This
    is where students submit their work to an electronic system so it can later be analysed for similarity.
  • Analysis stage - The second stage of the four-stage plagiarism detection process. Here all submissions are compared with each other (for intra-corpal plagiarism detection) or the external sources such as the Web (for extra-corpal plagiarism detection) to find submissions that are similar to each other or the Web sources.
  • Confirmation stage - The third stage of the four-stage plagiarism detection process. Here a tutor checks the pairs of student submissions that have been judged to be similar to see if they represent plagiarism or they represent legitimate shared citations or false hits. The tutor decides which pairs will go on to be investigated further.
  • Investigation stage - The fourth and final stage of the four-stage plagiarism detection
    process. This is where pairs of similar submissions have been found and they have
    been confirmed by human inspection to be similar and possible cases of plagiarism. In this case further evidence is collected, such as student interviews and marked up
    copies of the submissions and penalties are given.
My selection from the glossary (my favorite definition is in blue):
  • Academic plagiarism - Plagiarism carried out by academics, for instance copying journal articles and submitting them as their own work for possible career development.
  • Attribute counting metrics - A count of some property of a single document which
    might involve tokenisation. This has been redefined to remove the inconsistencies
    from the literature but is not considered a sensible classification.
  • Authorship attribution - The branch of linguistics that aims to calculate the author of
    a work based on knowledge of works by other known authors. This is not appropriate for plagiarism detection since there is no corpus of known work by a given student.
  • Characters Metric - A simple metric that measures the number of sequences of
    characters of a chosen length two documents have in common. 
  • Cheating - Unauthorised behaviour that is going against student etiquette when trying for an academic award or to gain an advantage over other students. Examples include plagiarism, use of cribs in exams and paying someone to complete an assignment specification on your behalf. 
  • Closeness Calculation - A computationally part of automated plagiarism detection
    where a single number is generated from a number of different metrics to decide how similar two submissions are.
  • Contractive plagiarism - Plagiarism where the source is larger than the copy and
    hence the source has been reduced in some way to create the student submission.
  • Corpal Metrics - A multi-dimensional metric that is a measure of a property of an
    entire corpus, for instance the proportion of submissions using a given keyword.
  • Collusion - Where two students discuss and work on an assignment specification
    together and complete elements of their final submissions together. This might be
    judged to be intra-cor[p]al plagiarism.
  • Direct copy - Two student submissions that are identical to one another with no
    attempt at disguise. One is a direct copy of the other. 
  • Disguise - Where a student has attempted to change a source and hand it in as their
    own submission so that the use of the original source won't be noticed.
  • Expansive plagiarism - Plagiarism where the source has been extended, either by
    adding new thoughts or adding filler words and phrases to make a student submission. 
  • Extra-corpal plagiarism - Plagiarism where the plagiarism source is outside the
    corpus of student submissions, for instance a Web site or material from a book.
  • False hits - Pairs of submissions that are ranked high enough for a tutor to investigate them but are judged to be dissimilar, thus being a waste of tutor time.
  • Free text plagiarism - Plagiarism that has been done in natural language, for instance, altering the words of another writer and presenting it as your own work.
  • Hybrid metric systems - A system that a combination of both attribute counts and
    structure metrics to find similar submissions. This has been defined to remove the
    inconsistencies from the literature but is not considered a sensible method of
    classification. 
  • Intra-corpal plagiarism - Plagiarism entirely within a corpus, primarily meaning two
    students who have copied from one another.
  • Missed pairs - A pair of submissions that contains plagiarism but is not automatically ranked in the upper portion of an ordered list of similar pairs and hence not investigated further by a tutor. 
  • Mosaic plagiarism - Plagiarism where chunks from different sources are used and rearranged in a way that could be considered like a mosaic is created from combining and arranging different pictures.
  • Multiply sourced - A student submission or external source that has been used in
    multiple student submissions.
  • Ostrich plagiarism policy - Where an academic institution states that plagiarism does not exist in their institution and has no formal way of dealing with it.
  • Paraphrasing - Using the ideas of another but rewriting them in your own words
    without suitable and continual acknowledgement. 
  • Plagiarism - Taking the words or ideas of another and presenting them as your own
    without suitable acknowledgement.
  • Proactive plagiarism policy - A policy of an academic institution where plagiarism is actively sought out on a regular basis, perhaps by using automated detection methods and cases are followed up when they are found.
  • Professional plagiarism - Plagiarism in a professional setting, for instance copying an internal report or company Web page from another source or using a service that
    writes standard CVs or job applications. 
  • Reactive plagiarism policy - The academic policy where plagiarism is not actively
    sought out but is taken seriously and followed up when it is identified during the
    course of marking.
  • Similarity - Where two submissions have words or ideas in common they are said to
    be similar. When they have been looked at by a tutor they may also be judged to be
    plagiarised.
  • Singularly sourced - A plagiarism source that has been copied from once only.
  • Source code plagiarism - Plagiarism of source code submissions, where two students
    have handed in programs where one has been derived from the other in some way.
    Detecting this is a well understood area since the constrained language reduces the
    number of possibilities that must be checked.
  • Structural Metrics - A metric that measures a property of one or more submissions
    where knowledge of the structure of the documents is needed.
  • Synthetic corpus - A corpus of documents that have been generated using synthetic
    means by taking sequences of words or characters in a known and defined order.
  • Thesaurising - A technique for plagiarism where words in a source are replaced by
    synonyms or changed in such a way that the submission makes the same points but
    the intention is that the plagiarism will not be discovered.
  • Visual metrics - A metric which is a based on some property of the similarity
    visualisation that would be generated for a given pair of student submissions.
  • Words Pair Metric - A simple metric that measures the number of sequences of word
    pairs in common between two documents. Identified as the most effective simple
    metric.
I find it immensely helpful to have terms that are generally understood when we are speaking about plagiarism. I would personally use "Synonomizing" instead of "Thesaurising" (which I can't pronounce). I also like the focus on the process of determining plagiarism and not the products - the software that is used in the process. Lancaster's focus in the thesis is on intra-corpal plagiarism and the visualization of similarity. It is well worth a read, if you are working in this area.

No comments:

Post a Comment

Please note that I moderate comments. Any comments that I consider unscientific will not be published.