COSC425: Final Team Project

OverviewMilestones


Overview

In 2019, a Forbes article stated that "Software Ate The World, Now AI Is Eating Software". From social media platforms to motor vehicles, our lives have become intertwined with technologies that makes predictions about the world around us. It is generally expected that, as time moves forward, predictive technology will continue to thread our lives, for better or worse.

Your final project is a team project in which you will explore the gamut of practical machine learning. You will be evaluated both individually and as a team. The project spans an 10-week period, separated by 5 milestones:

GoalDescriptionValueDue Date
Milestone 1: SettingIdentify a problem and setting.10%Oct 9
Milestone 2: DatasetDesign your dataset.30%Oct 30
Milestone 3: Approach+EvalDesign your approach & eval.30%Nov 20
Milestone 4: DisseminationFinal report + presentation.30%Dec 4

Milestones

Below, you can cycle through the various Milestone tabs to see their requirements.

The objective of the Milestone 1 (M1) is identifying a problem setting. Far too often, machine learning is applied to problem spaces that "don't matter". However, an important consideration for "what matters" is that it is inherently debatable. We make thousands of decisions each day that influence us in small, but meaningful ways. Your project should hone in one such decision.

I. Milestone Requirements

This milestone requires that your team successfully identify a problem you have chosen for your project. The requirements of this milestone are as follows:

  1. Your team must have the instructor's permission that a topic is suitable for the course.
  2. Your team's setting must be compliant with the constraints (see II. Setting Constraints).
  3. Your team must submit a document, by the due date, that includes the following:
    • A clear articulation of whether you are solving a classification or regression problem.
    • A brief paragraph describing why the problem matters.
    • A statement describing the contributions of each team member.
Note: Your report must be submitted by 11:59pm on October 9 via Canvas.

II. Setting Constraints

To guide you in identifying your problem setting, your team's problem must fall into one of the following thematic areas:


Theme I: Machine Learning for Productivity

Productivity is a frontier for learning. Organizations care about maximizing peoples' output when they're on the job. The more people productive or efficient that people are in their job, the closer a company is toward reaching its output goals.

Example problems include:

  • Classifying if a person is in a focused state or not at their computer.
  • Predicting the number of bugs a software engineer will fix today.
  • Classifying "morning people" and "afternoon people" in terms of productivity.

Reference Publications

Theme II: Machine Learning for Wellbeing

Research research suggests that productivity is intwertwined with peoples' well-being. Burnout, for example, was recognized as public health epidemic in 2018. Today, we're currently operating in a near-virtual world under the Coronavirus pandemic, which continues to challenges in new and surprising ways.

Example problems include:

  • Classifying if a person needs to take a break from work.
  • Classifying people between those who needs more sleep and those who don't.
  • Predicting the amount of "unhealthy" time spent on social media.

Reference Publications

Theme III: Machine Learning for Online Discourse

From heated exchanges in political niches to simple disagreements on Facebook updates, online discourse has remained just about anything, but "friendly". Our efforts to have meaningful discourse on the Internet are amplified by third-party actors who leverage the open nature of the Internet to disinform and confuse.

Examples problems include:

  • Classifying online communities (e.g. subreddits) as echo chambers.
  • Classifying online post has harmful to discourse (e.g. trolling, harassment).
  • Classifying fake news on social media websites.

Reference Publications

Theme IV: Machine Learning for COVID-19

The Coronavirus pandemic has forced us to adapt to new learning, working, and personal environments in ways the world's population was not prepared for. In what ways can machine learning help us navigate a world where we, or those around us, need to feel safe?

Example problems include:

  • Classifying physical locations as low-risk/high-risk for COVID-19 exposure across TN.
  • Classifying health-related misinformation in Wikipedia Edits.
  • Predicting the number of mask-wearers in local grocery stores.

Reference Publications

The objective of the Milestone 2 (M2) is designing and building a dataset for machine learning. Datasets are composed of two components: (1) a series of features and (2) a target labels. For the purposes of our course, a "good" dataset is one that includes a comreprehensive feature set that will allow you to explore how both how independent features and combinations of features yield more performant learning outcomes.

I. Milestone Requirements

The requirements of this milestone are as follows:

  1. Your team's setting must be compliant with the constraints (see III. Dataset Constraints).
  2. Your team must submit a document, by the due date, that includes the following:
    • A clear articulation of whether your team is creating a dataset or using an existing dataset.
      • For both cases: You must include additional information outlined in the III. Dataset Constraints section.
    • A statement describing the contributions of each team member.
  3. Your team must submit your project's dataset file (i.e. in CSV format).
Note: Your report must be submitted by 11:59pm on October 30 via Canvas.

II. Grading

This milestone is worth 30% of your final project grade. The 30% is calculated like so:
DeliverableDescriptionValue
DocumentDocument file exists.5%
Data CollectionDataset file exists.5%

II. Dataset Constraints

You have two options for satisyfing this milestone. You can either (1) Create a Dataset or (2) Use an Existing Dataset. In many cases, your problem setting will make this choice for you. For example, if you're focusing on classifying physical locations as high-risk for COVID-19 exposure, it makes sense to use a dataset collected through more reliable means than our own. In contrast, if you're classifying posts on Reddit as toxic or not, you will likely have to create your own dataset as the public datasets are generally unavailable for such a problem.

Click below to see the guidelines for each option:


Dataset creation involves three sub-tasks: (1) designing your feature set from the ground-up, (2) mapping features to target labels, and (3) collecting data.

Task I. Designing a Feature Set

Your feature set serves as a defined list of features that will be used in your learning algorithm for your problem setting. It is your team's responsibility to design this feature set. Your team should address the question: "What information is most useful for helping my algorithm predict accurately?". For example, in the case of classifying COVID-related misinformation on Reddit, you might say the following features would be most useful:

Feature Description
isScientist a binary variable indicating the user is a scientist
visitsToCDC a continuous variable indicating the # of visits to the CDC website
postsInChinaFlu a binary variable indicating the user posts in /r/ChinaFlu
In creating a dataset, you should be creative. There are no constraints about the number or type of features that you can include in your dataset. It may very well be that you build a model that makes use of a subset of the dataset that you collect. Once datasets are collected, it is easy to exclude features. In contrast, it is not easy to add new features after you've collected data.

Task II. Collecting Data

After you've designed your dataset, you should now seek to develop a mechnism by which you can collect it. In practice, the quality of data is paramount. If your data isn't reliable, you're not going to be able to utilize it in any practical machine learning scenario. For the purposes of our course, we are less concerned with quality of data and more concerned with the presence of a dataset.

There are several ways that you can go about collecting data:

  1. Traditional / Manual Methods
    • By-Hand: All features are collected by hand / manually.
    • Observation: Record events (e.g. in a spreadsheet) as they happen.
    • Survey: Web tools, e.g. QuestionPro, or Google Forms.
  2. Automated Methods
    • Activity Logger: Tools to track activity on computers. Examples include KidLogger or RescueTime.
    • Web Scraper: Tools to extract information from the web. In Python, you can use BeautifulSoup to scrape information.
Once data is collected, you should have a dataset of examples with populated columns for each feature. For example, continuing the example outlined above, I might say that I've collected a dataset of 400 examples with populated features that looks like the following:
ExampleID isScientist visitsToCDC postsInChinaFlu
1 1 14 0
2 1 64 1
... ... ... ...
399 0 14 0
400 1 3 0
An important characteristic to note is that the information in this table could have been collected through any of the traditional or automated methods outlined above. Some of them, however, may be more reliable than others. You should aim to include at least a few hundred training examples. In the event you're concerned about reaching this number, consult Dr. Williams for advice.

Task III. Mapping Features to Target Labels

Your collected data must be mapped to a target label in order to be utilized in the learning algorithms we've discussed thus far. Here, your ultimate goal is to populate the target class label / outcome column in your dataset. Continuing our running example, we've added an additional column into our dataset that is our "isReliable" label.

ExampleID isScientist visitsToCDC postsInChinaFlu isReliable
1 1 14 0 0
2 1 64 1 1
... ... ... ... ...
399 0 14 0 1
400 1 3 0 1

Most cases require that you (i.e. your Team) will need to manually assign a target label to each example (i.e., "isReliable" being 0 or 1). To ensure that the manual labeling process is unbiased, it is common practice for two people to label the same subset of examples. If you engage in manual labeling, you should randomly pick 20% of your examples and have two members of your team separately label the examples. Afterwards, you should report the Cohen's kappa score, which indicates how well the two labelers agreed with one another.

Requirements for the Milestone II Document

In your Milestone II document, you should articulate the following:

  • Your Feature Set: A table that outlines the names and descriptions of your features.
  • Your Data Collection Method: A description of your data collection method with a brief commentary on the strengths and benefits of the method as it relates to your problem setting.
  • Your Mapping Procedure: A description of your procedure for mapping features to target labels. If manual labeling was used, you should report the Cohen's kappa score of your labelers.

Using an existing dataset allows you to ignore the challenges that come with collecting data. However, it's rare that existing datasets meet the exact needs of a new project. If you you chooce to use an existing dataset, you therefore have three sub-tasks to under-go.

Task I. Identifying a Dataset Finding a dataset that maps to your problem setting is challenging. There are several reliable sources for datasets including the UCI Machine Learning repository, CMU Machine Learning library, and Kaggle. There are a myriad of Medium articles that yield comprehensive lists of publicly available datasets.

Task II. Evaluating Dataset Reliability Datasets are often collected with a particular purpose or goal in mind. In instances of re-use, datasets can rarely be used out-of-the-box toward a different goal. Unlike the task of collecting data, you are required to evaluate the reliability of the data toward your problem setting. You should include a paragraph in your document that describes why you believe the dataset to be reliable for the purposes of your project.

Here, "reliable" refers to data that was collected accurately. Perhaps, this means that the data was used in prior research or machine learning models. Perhaps, you believe that the method of collection is the most reliable pathway for collecting such data. Perhaps, it was collected from a particular organization or entity that you trust. For example, if you're dealing with a dataset on the spread of COVID-19, you may have stronger faith in data collection methods that occured in a particular region of the country. Regardless of your problem setting or dataset, your paragraph should make a case for the reliability of your data. Your paragraph should also give particular attention to what you believe the dataset's limitations to be.

Task III. Augment the Dataset (If Needed) Datasets are often collected with a particular purpose or goal in mind. In instances of re-use, datasets can rarely be used out-of-the-box toward a different goal. You should give attention to understanding ways in which you expand the dataset's feature set toward your project's setting, assuming that the project settings are non-identical.

If the dataset needs to be augmented or extended (e.g. with new features), you should develop an approach to do see this to completion. By the end of this task, the dataset should include a complete feature set that map to target labels.

Requirements for the Milestone II Document

In your Milestone II document, you should articulate the following:

  • Your Dataset: A table that outlines the names and descriptions of the features in the dataset you are re-using.
  • Your Evaluation of the Dataset: A paragraph that argues for the appropriateness of the dataset for use in your project, giving particular attention to why you believe the dataset to be reliable. Dataset limitations should be articulated.
  • Your Augmentation Approach: A description of how you plan to augment your dataset, should you need to. Include a table that outlines the names and descriptions of the features you want to add to the dataset.

The objective of the Milestone 3 (M3) is implementing and evaluating a learning algorithm for your problem setting. There are no constraints about the methods or tools that you can utilize toward your problem. The only technical constraint for M3 is that your implementations be completed in Python.

I. Milestone Requirements

The requirements of this milestone are as follows:

  1. Your team must submit a ZIP file, by the due date, that includes the following:
    • A document that presents your learning algorithms and their evaluation.
      • You should describe at least one equation related to each learning algorithm.
      • You should describe each algorithm's implementation, e.g. libraries used.
      • You should describe the evaluation method for each learning algorithm.
        • Provide as much detail as possible to demonstrate that your evaluation was comprehensive and conducted reliably.
        • Explore your data before applying or choosing an approach. Report what you find and how it might guide your learning approach.
        • Support your evaluation with visual aids (e.g., plots) where appropriate. Articulate what your plots suggest.
    • The source code / relevant files for your implementation and evaluation.
    • A statement describing the contributions of each team member.
Note: Your report must be submitted by 11:59pm on November 20 via Canvas.

II. Grading

This milestone is worth 30% of your final project grade. The 30% is calculated like so:
DeliverableDescriptionValue
Source CodeSource files exists and can be executed.5%
ApproachSelected approaches are justified and appropriately used.5%
EvaluationEvaluation is thorough for two or more approaches.5%

The objective of the Milestone 4 (M4) is writing-up your findings in a formal document and presenting your findings to the class. Your docment should be written with the AAAI 2020 Author Kit. It is advisable that your team write in LaTeX. The easiest way to collaborate on LaTeX documents is through Overleaf.

I. Milestone Requirements

The requirements of this milestone are as follows:

  1. Final Report: Your team must submit a document in the AAAI format, by the due date, that includes the following:
    • Your document should include six sections:
      • Introduction: Describe why your project matters.
      • Dataset: Describe your dataset, e.g. features and output label
      • Approach: Describe your approach.
      • Evaluation: Describe your evaluation.
      • Future Work: Describe what could follow your work.
      • Contributions: Describe the contributions of each team member.
  2. Final Presentation: During our alotted final exam period (December 4th @ 10:30am - 12:45pm), you will present your work. Your team must prepare a 5-minute presentation. You will be given a one minute Q&A. The presentation must include the following six slides:
    • Slide 1: State your project's title and your team members.
    • Slide 2: Describe your problem setting.
    • Slide 3: Describe your dataset, e.g. features and output label.
    • Slide 4: Describe your approaches.
    • Slide 5: Describe your evaluation and what you observed.
    • Slide 6: Describe what work could follow your findings.
      • You can use Slide 6 to describe other important aspects tied to your project, such as ethics, interpretability, trust, etc, that you wish you could've explored.
Note: Your report must be submitted by 11:59pm on December 4 via Canvas. Note that it is perfectly permissible to re-use text from prior Milestone submissions for this final report. You should aim to submit a report that is well-written and clearly articulates all aspects of your team's work.

You may also be interested in several resources, such as:

II. Grading

This milestone is worth 30% of your final project grade. The 30% is calculated like so:
DeliverableDescriptionValue
Final ReportA written report exists with all required information.5%
PresentationPresentation articulates all aspects of project effectively.10%