Data Mining and Knowledge Engineering

COMP723 Data Mining and Knowledge Engineering
Assignment 2 – Text Classification (50%)

1 Objective
To develop a broad understanding of text mining by performing a representative task, Text Classification.

2 Task Specification
The assignment requires classification of “free text” snippets into categories using a machine learning algorithm. You are required to use a chosen text mining tool to train two different classification algorithms on the given dataset, analyse the results, and present a report of your findings.
2.1 Due Dates and Submission
• This assignment can be done individually or in pairs. If you decide to do it in a pair, you will need to choose a partner from the class, either by contacting in the lectures/lab and/or the discussion forum on Blackboard for Assignment 2.
• If done in a pair:
o Only one person from the pair should submit the assignment. The report should clearly state the name and student ID of both members of the team. Furthermore, the contributions made by each team member must be clearly stated in the section “Contributions – group name” at the beginning of the report.
• The written part of your assignment is due on 22 October at midnight.
• You are required to submit only an electronic copy of the assignment via the Turnitin assignment Submission tab (on the course homepage) on Blackboard.
2.2 Marking
• This assignment will be marked out of 100 marks and is worth 50% of the overall mark for the paper.
3 Assignment Details
The objective of the assignment is to classify text into two categories using libraries and code snippets covered as part of lectures and labs.

3.1 Dataset

The data set to be used for this assignment is available from Blackboard as part of the assignment package. The data set is a large corpus of emails organised into 5 folders named enron1, enron2, enron3, enron4 and enron5. Each of these folders contains two folders named “ham” and “spam” containing emails belonging to each of the two categories. The package also contains two papers which gives you a background on the dataset and examples of use for text classification using Naïve Bayes and Support Vector Machine. You will need to acknowledge the use of this dataset appropriately in your report.

3.2 Assignment Tasks

Download the zipped file containing the dataset from Blackboard under the Assignment 2 folder. Unzip it into a working folder which you will use for this assignment. The zipped file contains a total of 5 folders as described above. The files represent 5 sets of data consisting of emails classified into ham and spam.

The objective of this assignment is to compare the performance three classification algorithms for the task of text classification. You will compare two given algorithms and the third one will be of your choice. The two algorithms are:
1. Naïve Bayes
2. Neural Network
3. Your choice. Some examples are SVM, CRF, J45, etc.
Your task is to conclude which of the three algorithms is the best for text classification.
To do this you can use any combination of the pre-processing tasks in order to build features to be used for the two machine learning algorithms. They don’t need to be consistent for the two algorithms.
Some of the pre-processing you can use are :
a) Stop word processing
b) Stemming/lemmatization
c) Feature size
d) Type of vectorisation

In order to produce valid conclusion, you should do the test by slicing the data for your experiments in the following 2 ways:
I. Conflate the data from the 5 folders and make them into one dataset. Then split the conflated dataset into 70% training set and 30% test set while maintaining the ham:spam ratio. Use these for training and testing for all the algorithms.

II. Use environ1, environ3 and environ5 for training and environ2 and environ4 for testing. Use these for training and testing for all the algorithms.

You should report all performances in terms of Precision, Recall and F-values.

3.2.1 Written Report
• You will write a minimum of 6 and a maximum of 12 page report (excluding the references and appendix) describing the results of your experiment.
• You are required to write a coherent report describing all aspects of the experiment as an attempt to prove or disprove the hypothesis. Any screen shots or large result outputs that doesn’t directly contribute to your argument should be included in the appendix, rather than as part of the main report.
• You are also required to submit well documented code as part of the appendix.
• You are not required to have a table of contents or executive summary for this report.
• There is no fixed format for the report. You can format it close to an academic paper containing the usual sections such as Abstract, Introduction, Data Description, Results, Discussion, Conclusion and a bibliography.
• As a minimum your report should contain a discussion of the following points
1. A clear statement of the research question you are setting out to answer.
2. A brief introductory discussion of applications of text classification.
3. A description of the dataset and its characteristics.
4. A discussion of the similarity and the differences between the three classifiers that you are comparing as applicable to text classification.
5. The differences in the manner in which classifiers are applied in a structured data scenario and a non-structured text mining scenario.
6. Presentation and discussion of the results obtained. You should use the correct evaluation metrics in your discussion. This part of your write up should include:
 The effect of the variations of the dataset used.
 Your perception of the possible rationale for doing the tasks.
 A thorough discussion of the comparison of the results leading to the conclusion.
7. A reflection of what you learnt from this assignment and what you would do differently if you were to do the assignment again.

4 Marking Scheme
The following approximate matrix would be used to grade your assignment.

Written Report
Formatting, Language and Presentation 10%
Discussion to demonstrate an understanding of the experimental tasks in the context of text mining 30%
Satisfactory completion of the tasks for the hypothesis. 30%
Discussion and presentation of the results leading to the conclusion. 30%

**********************End of Assignment Specification**********************