Anvesha Sinha



M.S. Computer Science
University of Southern California
Expected Graduation: May 2016
Courses: Foundations of Artificial Intelligence, Analysis of Algorithms, Database Systems, Applied Natural Lanuguage Processing, Information Retrieval and Web Search Engines, Modern Distributed Systems and Machine Learning

B.TECH. Computer Science and Engineering
VIT University, Vellore
Cumulative GPA: 9.34/10
Graduation: May 2014


I am passionate about Data Mining, Machine Learning and Distributed Systems. Currently, I am working with Demand Media's ( society6 ( team as a Software Engineer Intern. I am a part of the 'Search and Discover' team and I am responsible for boosting up the scoring of artists based on traffic on the website. As a side project this semester, I am working on glam and three.js ( under the guidance of Prof. SathyanarayaRaghavachary (Saty). Till December 2015, I was working at Institute of Information Sciences (ISI), USC as a Graduate Student Worker under Prof. Prem Natarajan. I had also been involved in research in Data Mining with Prof. Anuradha J. at VIT University from 2013-2014. We developed a new clustering algorithm called PRCM.
Personally, I prefer working in teams and for a cause. I love teaching and I have worked as a python tutor in the past. In my free time, I generally code, watch movies or spend time with friends.

Technical Skills:

Programming Languages: Proficient in C, C++, Java, Python, QBASIC. Familiarity with C# and .NET framework.
IDEs: Eclipse, Netbeans, Code::blocks. Familiarity with Visual Studio.
Web Development & Databases: HTML DOM, CSS, JavaScript, PHP, XML, JSON, MySQL, PL/SQL, XQuery.
Operating Systems: Windows 98 - 8, Linux (Ubuntu)


1. TRENDICATOR: Predicting Trends Using Social Media
Goal: To help ad-companies place advertisements on articles that are trending, so that their revenue is maximized. (Implementation in python).
Link and Source Code:
2. Distributed Systems: This project contains implementations of the following:
  • Map Reduce
  • Primary-Backup Key-Value Service
  • Paxos Key-Value Service
  • Sharded Key-Value Service

3. Reinforcement Learning of Dialogue Policies:
Goal: To build two-issue negotiation policies in multi-agent systems.
Our basic idea is to eliminate the need of using a hand-crafted corpora for learning two-issue policies. We are using the WoLF algorithm proposed by Bowling, et al. for updating the Qvalues and policies of the two agents in the multi-agent scenario. Right now, we are in the implementation and debugging phase of the project.
Implementation in Java.

4. Analysis of Reasearch Trends:
Goal: To construct a query/visualization system to analyze research trends using research publications. The system crawls web sites of research societies and downloads papers including title, year, authors, affiliations, abstract and PDF. The system then extracts important words for a research area (e.g., in materials research, extract names of composite materials, polymers, processes, etc.). The project involves data cleaning, data integration and record linkage to construct a high-quality knowledge graph. Finally, the system uses visualizations to show the evolution of research trends and show where in the world different types of research is being conducted.
Check out the DIG project at:

5. Study of Graph Embedding:
Goal: To embed a graph to a high dimensional space and project it back to solve the Steiner Tree problem.

6. Naive Bayes Text Classifier:
Goal: To build a text classifier for spam filtering and sentiment analysis. Given a training dataset, the classifier learns and subsequently classifies the test dataset in spam/ham (spam filtering) and pos/neg (sentiment analysis). Also, to compare the results with the off-the-shelf implementaions of Support Vector Machines (SVMlight by Thorsten Joachims) and Maximum Entropy (MegaM).
Implementation in Python.

7. Perceptron, part-of-speech tagging and named entity recognition
Goal: To create my own discriminative classifier and apply it to two NLP sequence labeling tasks: part-of-speech tagging and named entity recognition.
Implementation in Python.

8. Error detection and correction: dealing with homophone confusion
Goal: This project is concerned with correcting errors that occur among similar or same sounding words. The errors involving the following words can be corrected:
it's vs its
you're vs your
they're vs their
loose vs lose
to vs too
The aim of the project is to take an error file as an input and give a file with corrections. The output file is in the same format as the inputfile, but with errors corrected. I have used the MegaM multi-class perceptron and nltk toolkit for this purpose. MegaM is used to train the model and then classify the testfile. NLTK toolkit is used for finding out the postags of words. The and scripts generate the featureFile that goes into megaM for training. The feature representation chosen is: prev_word2, prev_tag2, prev_word1, prev_tag1, class1, class2, next_word1, next_tag1, next_word2, next_tag2.
Implementation in Python.

9. Possibilistic Rough C-Means (PRCM): A novel and hybrid approach to clustering (July 2013 - May 2014)
Goal: To come up with an hybrid approach to clustering and then apply it to classification problems.
Result: Improved cluster parameters DB index by 91.44 % and D index by 52.87% by combining rough sets and typicality. Also improved accuracy and hamming-loss of prediction in Multi-label Classification (MLC) using Bayesian Probability and k-nearest neighbors (kNN) (referred as ML-kNN) by 48.6% and 5.8 % respectively. Tested on different datasets.
Implementation in Java (Eclipse IDE) using Weka library.
10. Crawling and Deduplication of Polar Datasets Using Nutch and Tika
Goal: To use Nutch as the core framework to perform crawling, and Tika as the main content detection and extraction framework for polar websites ACADIS, ADE and AMD.

11. Building an Apache-Solr based Search Engine and Ranking Algorithms for NSF and NASA Polar Dataset
Goal: To develop and compare two sets of ranking and retrieval approaches: content-based approaches that will leverage the indexed textual content of data using IR techniques such as term frequency-inverse document frequency (TF-IDF) to generate relevancy; and link-based approaches that will leverage citation relationships (graph-based) between the indexed documents and information other than the textual content of the document to perform relevancy analysis

12. Encoding technique for querying rough data using Second Type Covering-Based Rough Sets (April 2013 -May 2013)
Goal: To develop an algorithm to properly encode and query rough data.
Results: The number of data retrievals increased by 52.165% using our proposed algorithm.
Implementation in C++.
13. AI Reversi Game player: Implementation of Greedy, MinMax and Alpha-Beta Pruning algorithms. (September 2014 - October 2014)
The program predicts the next move for a player in the Reversi game using the Greedy, MinMax, and Alpha-Beta pruning algorithms with positional weight evaluation functions.
Link and Source Code:
Implementation in C++.

14. First Order Logic Inference System: Implementation of Backward Chaining. (October 2014 - November 2014)
Given a knowledge base and a query sentence, the program determines if the query can be inferred from the information given in the knowledge base. The program uses the Backward-Chaining algorithm to solve this problem.
Implementation in Python.

15. Peephole optimizer: (Feb 2013 - March 2013)
Designed a peephole optimizer for segments of generated code and parallelized the code in a multi-core environment.
Implementation in C.


  • GHC (Grace Hopper Conference) 2015 Scholar: Received full-scholarship to attend the biggest celebration of women in computing. It was held in Houston, Texas from Wednesday, October 14th through Friday, October 16th, 2015.

  • Received scholarships for being a "RANK HOLDER" for all academic years at VIT. (Among the top 10 out of 500+ students): Highest rank: 3. Also received Merit Certificates for outstanding academic performance for years 2011-12, 2011-12, 2012-2013, 2013-2014.


1. Ramesh R, Roy S, Sinha A. Applicability of Rough Set Theory for Analysis of Phishing threats: Annals. Computer Science Series Tome 11, Fasc. 2; ISSN: 1583-7165 (printed journal) ISSN: 2065-7471 (e-journal) by Tibiscus University of Timisoara, Romania; appearance in December 2013. Pages23-27.
2. Roy S, Gupta A, Sinha A, Ramesh R. Cancer Data Investigation using Variable Precision Rough Set with Flexible Classification. CCSEIT ’12 Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology. Citation id 2393295. Pages 472- 475. ACM New York, NY, USA © 2012. ISBN: 978-1-4503-1310-0 in ICPS: ACM International Conference Proceeding Series.