In 2019, a Forbes article stated that "Software Ate The World, Now AI Is Eating Software". From social media platforms to motor vehicles, our lives have become intertwined with technologies that makes predictions about the world around us. It is generally expected that, as time moves forward, predictive technology will continue to thread our lives, for better or worse.
Your final project is a team project in which you will explore the gamut of practical machine learning. You will be evaluated both individually and as a team. The project spans an 10-week period, separated by 5 milestones:
Goal | Description | Value | Due Date |
10% | |||
30% | |||
30% | |||
Milestone 4: Dissemination | Final report + presentation. | 30% | Dec 4 |
Below, you can cycle through the various Milestone tabs to see their requirements.
The objective of the Milestone 1 (M1) is identifying a problem setting. Far too often, machine learning is applied to problem spaces that "don't matter". However, an important consideration for "what matters" is that it is inherently debatable. We make thousands of decisions each day that influence us in small, but meaningful ways. Your project should hone in one such decision.
This milestone requires that your team successfully identify a problem you have chosen for your project. The requirements of this milestone are as follows:
To guide you in identifying your problem setting, your team's problem must fall into one of the following thematic areas:
Productivity is a frontier for learning. Organizations care about maximizing peoples' output when they're on the job. The more people productive or efficient that people are in their job, the closer a company is toward reaching its output goals.
Example problems include:
Research research suggests that productivity is intwertwined with peoples' well-being. Burnout, for example, was recognized as public health epidemic in 2018. Today, we're currently operating in a near-virtual world under the Coronavirus pandemic, which continues to challenges in new and surprising ways.
Example problems include:
From heated exchanges in political niches to simple disagreements on Facebook updates, online discourse has remained just about anything, but "friendly". Our efforts to have meaningful discourse on the Internet are amplified by third-party actors who leverage the open nature of the Internet to disinform and confuse.
Examples problems include:
The Coronavirus pandemic has forced us to adapt to new learning, working, and personal environments in ways the world's population was not prepared for. In what ways can machine learning help us navigate a world where we, or those around us, need to feel safe?
Example problems include:
The objective of the Milestone 2 (M2) is designing and building a dataset for machine learning. Datasets are composed of two components: (1) a series of features and (2) a target labels. For the purposes of our course, a "good" dataset is one that includes a comreprehensive feature set that will allow you to explore how both how independent features and combinations of features yield more performant learning outcomes.
The requirements of this milestone are as follows:
Deliverable | Description | Value |
Document | Document file exists. | 5% |
Dataset information supplied. | 10% | |
Data Collection | Dataset file exists. | 5% |
File columns map to feature labels and output label. | 5% | |
Number of examples is an appropriate size. | 5% |
You have two options for satisyfing this milestone. You can either (1) Create a Dataset or (2) Use an Existing Dataset. In many cases, your problem setting will make this choice for you. For example, if you're focusing on classifying physical locations as high-risk for COVID-19 exposure, it makes sense to use a dataset collected through more reliable means than our own. In contrast, if you're classifying posts on Reddit as toxic or not, you will likely have to create your own dataset as the public datasets are generally unavailable for such a problem.
Click below to see the guidelines for each option:
Dataset creation involves three sub-tasks: (1) designing your feature set from the ground-up, (2) mapping features to target labels, and (3) collecting data.
Task I. Designing a Feature Set
Your feature set serves as a defined list of features that will be used in your learning algorithm for your problem setting. It is your team's responsibility to design this feature set. Your team should address the question: "What information is most useful for helping my algorithm predict accurately?". For example, in the case of classifying COVID-related misinformation on Reddit, you might say the following features would be most useful:
Feature | Description |
---|---|
isScientist | a binary variable indicating the user is a scientist |
visitsToCDC | a continuous variable indicating the # of visits to the CDC website |
postsInChinaFlu | a binary variable indicating the user posts in /r/ChinaFlu |
Task II. Collecting Data
After you've designed your dataset, you should now seek to develop a mechnism by which you can collect it. In practice, the quality of data is paramount. If your data isn't reliable, you're not going to be able to utilize it in any practical machine learning scenario. For the purposes of our course, we are less concerned with quality of data and more concerned with the presence of a dataset.
There are several ways that you can go about collecting data:
ExampleID | isScientist | visitsToCDC | postsInChinaFlu |
---|---|---|---|
1 | 1 | 14 | 0 |
2 | 1 | 64 | 1 |
... | ... | ... | ... |
399 | 0 | 14 | 0 |
400 | 1 | 3 | 0 |
Task III. Mapping Features to Target Labels
Your collected data must be mapped to a target label in order to be utilized in the learning algorithms we've discussed thus far. Here, your ultimate goal is to populate the target class label / outcome column in your dataset. Continuing our running example, we've added an additional column into our dataset that is our "isReliable" label.
ExampleID | isScientist | visitsToCDC | postsInChinaFlu | isReliable |
---|---|---|---|---|
1 | 1 | 14 | 0 | 0 |
2 | 1 | 64 | 1 | 1 |
... | ... | ... | ... | ... |
399 | 0 | 14 | 0 | 1 |
400 | 1 | 3 | 0 | 1 |
Requirements for the Milestone II Document
In your Milestone II document, you should articulate the following:
Using an existing dataset allows you to ignore the challenges that come with collecting data. However, it's rare that existing datasets meet the exact needs of a new project. If you you chooce to use an existing dataset, you therefore have three sub-tasks to under-go.
Task I. Identifying a Dataset Finding a dataset that maps to your problem setting is challenging. There are several reliable sources for datasets including the UCI Machine Learning repository, CMU Machine Learning library, and Kaggle. There are a myriad of Medium articles that yield comprehensive lists of publicly available datasets.
Task II. Evaluating Dataset Reliability
Datasets are often collected with a particular purpose or goal in mind. In instances of re-use, datasets can rarely be used out-of-the-box toward a different goal. Unlike the task of collecting data, you are required to evaluate the reliability of the data toward your problem setting. You should include a paragraph in your document that describes why you believe the dataset to be reliable for the purposes of your project.
Here, "reliable" refers to data that was collected accurately. Perhaps, this means that the data was used in prior research or machine learning models. Perhaps, you believe that the method of collection is the most reliable pathway for collecting such data. Perhaps, it was collected from a particular organization or entity that you trust. For example, if you're dealing with a dataset on the spread of COVID-19, you may have stronger faith in data collection methods that occured in a particular region of the country. Regardless of your problem setting or dataset, your paragraph should make a case for the reliability of your data. Your paragraph should also give particular attention to what you believe the dataset's limitations to be.
Task III. Augment the Dataset (If Needed)
Datasets are often collected with a particular purpose or goal in mind. In instances of re-use, datasets can rarely be used out-of-the-box toward a different goal. You should give attention to understanding ways in which you expand the dataset's feature set toward your project's setting, assuming that the project settings are non-identical.
If the dataset needs to be augmented or extended (e.g. with new features), you should develop an approach to do see this to completion. By the end of this task, the dataset should include a complete feature set that map to target labels.
Requirements for the Milestone II Document
In your Milestone II document, you should articulate the following:
The objective of the Milestone 3 (M3) is implementing and evaluating a learning algorithm for your problem setting. There are no constraints about the methods or tools that you can utilize toward your problem. The only technical constraint for M3 is that your implementations be completed in Python.
The requirements of this milestone are as follows:
Deliverable | Description | Value |
Source Code | Source files exists and can be executed. | 5% |
Sources files include at least two algorithm implementations. | 5% | |
Approach | Selected approaches are justified and appropriately used. | 5% |
Math + Implementation is clearly presented. | 5% | |
Evaluation | Evaluation is thorough for two or more approaches. | 5% |
Evaluation is clearly presented. | 5% |
The objective of the Milestone 4 (M4) is writing-up your findings in a formal document and presenting your findings to the class. Your docment should be written with the AAAI 2020 Author Kit. It is advisable that your team write in LaTeX. The easiest way to collaborate on LaTeX documents is through Overleaf.
The requirements of this milestone are as follows:
Deliverable | Description | Value |
Final Report | A written report exists with all required information. | 5% |
Report is grammatically and structurally well-written. | 10% | |
Presentation | Presentation articulates all aspects of project effectively. | 10% |
Team addresses questions during Q&A session successfully. | 5% |