List of Projects

The following are the projects for the course. Each student is to choose one of these projects. There will be 3 (or 4) students on each project. A written report plus material developed in the project is to be submitted at the end of the semester, along with the power point slides from the group's final in-class presentation (please see the course schedule). Groups will also be asked for a mid-semester brief progress report (details to be provided).

Project 1: Concept clustering enhancements to topic mining in email corpora

Topic hierarchies and ontologies are highly useful for organizing large collections of text documents. In email corpora, they can help support a wide range of applications such as automatic email classification, trend tracking across time, spam detection, etc. This project will explore the efficacy of using concept clustering based on event life cycles to further refine traditional text and information retrieval techniques in the email domain.

Specifically, the project will examine the use of features such as temporal correlation and frequency spectra in clustering, comparing their behavior and properties to existing techniques (latent semantic analysis, tf-idf distance) and explore potential meaningful hybrid combinations of these methods.

Requirements: Background in information retrieval or signal processing will be an advantage in this project.

Group Members
Nishant Seghal
Raghav Bhatnagar
Poorak Kashyap
Raghav Tulshibagwale

Contact: Kaijian Xu
Email: kaijian dot xu at usc dot edu


Project 2: Ontology-based knowledge extraction from web wrapper results

Knowledge extraction from web pages without semantic annotations is a challenging problem requiring complex solutions (machine learning, natural language processing, etc). On the other hand, current state-of-the-art web wrapper induction tools have made it highly convenient to extract structured data from websites, but without any attached semantics.

This project will explore techniques for automatic and semi-automatic semantic annotation of wrapper results from generic (non-website-specific) ontologies. Of particular interest is the derivation of best principles for web wrapper and ontology design that would help attain optimal matchability between the two using such automatic methods.

Group Members
Paul Chao-hua Hsieh
George Konstantinidis
Rahul Parundekar


Contact: Kaijian Xu
Email: kaijian dot xu at usc dot edu


Project 3: Clustering based on similarity measures for mixed attribute datasets

As one of main parts in data mining, conventional clustering methods have focused on only single attribute datasets. Recently, mixed attributes (e.g., categorical and numerical) datasets are increasing in various informatics fields such as Bio Informatics, Medical Informatics, Geo Informatics, Information retrieval and so on.

The goal of this project is to improve a clustering method based on similarity measures for mixed attribute datasets. The progress of this project is 1) to summarize current clustering algorithms for single attribute, 2) to design a clustering for mixed attributes datasets, 3) to evaluate the designed clustering and 4) to submit and present the report as a result of this project.

After doing this project, we will understand current clustering algorithms widely and have experience in the design and evaluation of clustering for mixed attribute datasets.

Requirements: The requirement for this project is java and/or matlab skills to evaluate the designed clustering.

Group Members
Nidhi Singh
Hardik Patel


Contact: Jongwoo Lim
Email: jonglim at usc dot edu


Project 4: Favorite Photo Recommendation in Flickr

In flickr, a photo-sharing social networking web site, users can upload their photos and annotate their photos with tags. Also, a user can set other users¡¯ photos as their favorite photos. In this project we will build a system to predict favorite photos by using tag vocabulary for each user from flickr. Tagging is indispensable ingredient in social networking and has enriched the social network space (e.g. it enables keyword based resource search). However, we still believe tagging can be more beneficial to social network space. One hypothesis regarding another possibility of tagging is that user behavior can be predicted by tag vocabulary. More specifically, it is possible to predict potential favorite photos (user behavior) by investigating current favorite photo tags. Also, we will use user relationship information to increase the accuracy of the prediction. For example, if a user sets other users photos as his/her favorite, it may be more likely that the user will set other photos of those users as favorites.

In this project, we will create a classifier to predict favorite photos for each user by using classification algorithms. To do this, photo data with attached tags is retrieved via the flickr API. The size of the tag vocabulary is huge, so we reduce the size of tag vocabulary by using dimensionality reduction methods such as LLE, Isomap, or Laplacian Eigenmap. Once we have reduced the tag vocabulary, the training data is generated and the classifier is created.

Requirements: You should be familiar with Java or Matlab. Machine learning and/or data mining knowledge is preferred.

Group Members
Charalampos Chelmis
Huy Pham
Malay Parekh
Na Chen


Contact: Sang Su Lee
Email: sangsl at usc dot edu


Project 5: Twitter spam filtering


Twitter is a social networking and micro-blogging service that allows users to post their latest updates. But, currently spam is becoming an increasing problem on Twitter. Spammers are using Twitter as a tool by replying to your@username, which then causes the Tweets to show up in your timeline. There isn¡¯t really a way to filter Twitter spam directly from a Twitter client. In this project, you will use Twitter API to collect the data, create an ontology for selected spam features, and then generate a model to select out the spammers using supervised/unsupervised machine learning techniques.

Requirements: You are required to have programming language experience to be able to work with Twitter APIs for collecting the data and it would be good to have some machine learning knowledge as well.

Group Members
Arnab Dutta
Nagarjuna Kimidi
Raviteja Atluri
Jitendra Patil


Contact: Dongwoo Won
Email: dwon at usc dot edu


Project 6: A basic-sentence, ontology-driven query system for search engines


This project involves automatically building an ontology that reinforces the meaning of keyword based query systems by supplying extracted connection-words from web ¡°sentences¡±. During this project, members will modify an open source analyzer and web crawler (in Java).

We will build an automated relational description model builder. We will modify a previous open source indexer, Apache Lucene to find connection words. And we will utilize the Google search engine to reinforce search results.

Requirements: The ability to program in JAVA is required.

Group Members
Leo Hsu
Ed Dawa
Akhil Gada
Arun Shankar


Contact: Jinwoo Kim
Email: jinwook at usc dot edu

 

Project 7: Content-based image retrieval

The main objective of this project is to provide understanding of content-based image retrieval (CBIR), and implement a CBIR system. Content means that any information/features that can be extracted from the image itself. For example, color histograms, shapes of objects in the image, or textures can be candidate features of the image. However, whatever features are used for the image retrieval, there is a semantic gap. It means that even though two images give us high similarity in a feature space (e.g. color histogram), the semantics itself of two images may be totally different.

Recent research in the field of image retrieval has tried to overcome the semantic gab by utilizing a combination of several features in a hierarchical manner. In this project, you are supposed to utilize two features, color histograms and texture information for the image retrieval system. And, for extracting texture information from the image, simple Discrete Cosign Transform (DCT) will be used. Also, you are supposed to implement two different methods of utilizing DCT. One is for applying DCT to whole image and the other one is applying DCT to the segments of each image. Finally, you will provide a conclusion that which weighted combination of three features will give the best performance.

You can start the project by reading the paper listed below:

Datta, Ritendra; Dhiraj Joshi, Jia Li, James Z. Wang (2008).
"Image Retrieval: Ideas, Influences, and Trends of the New Age".
(http://infolab.stanford.edu/~wangz/project/imsearch/review/JOUR/datta.pdf)

Requirements: You can use any programming language for the implementation; Matlab is preferred.

Group Members
Sangeet Lohariwala


Contact: Jongeun Jun
Email:
jongeunj at usc dot edu

The University of Southern California does not screen or control the content on this website and thus does not guarantee the accuracy, integrity, or quality of such content. All content on this website is provided by and is the sole responsibility of the person from which such content originated, and such content does not necessarily reflect the opinions of the University administration or the Board of Trustees