Education

Professional Experience

Technical Skills

Programming Languages:    Java, Python, C, C++, PHP, C#, MySQL, SQLite,VB, JQuery, JavaScript, Sparql, AJAX

Scripting Languages:     HTML5-CSS3, PowerShell

Tools:     Oracle9i, Eclipse, Wireshark, OpNet, Android SDK, Calico, Google OpenRefine, Karma, AWS, Hortonworks Sandbox, Protégé

Game Engines:    Unity3D

Enterprise Framework:     Spring MVC

Source Code Control tools:     Perforce, Git

Machine Learning Tools:    SVMLight, Vowpal Wabbit, MegaM

Operating Systems:     Windows, Linux, Android, TCP/IP Network Suite, Hadoop Cloud Environment

Projects

  • image
    USC May 2015

    Author Prediction System on Twitter

    Technologies: Python, Java, StanfordNLP, SVM, VowPalWabbit

    This is a prediction system to predict authors from their writing patterns & Stylometric Features based on NLP and Machine Learning techniques. The question of "How Anonymous Can Someone be on Twitter?" is what triggered the idea.

    Our system has an accuracy of 67% over 320,000 tweets.

    • This system can be used to identify writing patterns amongst users on Twitter to create clusters of users with similar writing patterns
    • System initially focussed on a Data set of 100 users from Twitter, with a corpus of 3200 tweets per user, collected using Twitter's APIs
    • System components developed in Java & Python, with use of NLTK, Stanford NLP, TweetNLP libraries
    • Data set initially cleaned and filtered, manually and automatically, and normalized to avoid bias
    • Data divided into Training Set and Development Set (80:20)
    • Performed Part-Of-Speech tagging on tweets to extract Noun Phrases and Verb Phrases as Features
    • Stylometric Features include hashtags, mentions, Re-Tweets, slangwords (LOL, LMFAO, YOLO, etc), word count, punctuation counts
    • Developed a Sentiment Analyzer with 5 sentiment classes
    • Performed word clustering to identify word similarity using Edit Distance between word vectors
    • Developed a clustering component to create author clusters to recommend authors to follow each other
    • Implemented a Boosting Algorithm to develop confidence measure amongst multiple classifiers, and results processed using Best of N techniques
  • image
    USC Mar 2015

    Perceptron Based Auto-Correction for Homophones

    Technologies: Python, StanfordNLP, SVM, MegaM

    Developed an autonomous system that corrects commonly occurring errors in large text documents

    The system auto-corrects large documents with an accuracy of ~98%

    • Developed a system to automatically detect and correct commonly occurring Homophone errors in large documents (Commonly occurring errors are "their" vs "they're", "your" vs "you're", "lose" vs "loose")
    • The system uses Natural Language Processing techniques to identify the context and automatically correct live stream data
    • Developed a Part-Of-Speech tagger to tag words based on their part of speech for phase 1
    • Designed and developed a Perceptron based classifier to identify pivot words (words needed to be corrected) and correct them based on the text context
    • Built training data for system from Wikipedia Corpus (1 Million text lines), the Brown Corpus, and Gutenberg Corpus - Parsed and tagged the corpora for training
  • image
    Symantec, LA Aug 2014

    Automated Integration Tool for Norton Internet Security

    Technologies: Python, PowerShell, Perforce

    This is tool to handle version integration automatically

    • Provides the features of automatic integration and conflict resolution for a source to target mapping of the product versions
    • Script developed in Python and Powershell and the tool was focused for integration with project builds
    • The tool also facilitated automated notification to developers when manual resolution is required
  • image
    USC Dec 2014

    Federated Ontology Based Query System for Sports

    Technologies: Java, Python

    This is an ontology based system to gather and analyze data about different sports.

    The system provides aggregated information & statistics about players and teams for Tennis, Soccer and Cricket

    • Information integration and semantic mapping done using a federated ontology
    • System runs queries to give statistical and semantic results about the players and teams
    • Data Sets created by developing Java and Python based Web Crawlers
    • Big Data Integration done using Karma and Protege and the RDFs were stored in a triple-store
    • SPARQL queries run using OpenRDF
  • image
    USC Dec 2013

    Android Based Weather Application

    Technologies: Android, Java, AJAX, PHP, HTML-CSS3, Amazon-Beanstalk

    This is an Android based system to provide users real-time location specific Weather details

    I also developed a Website version of the tool for PCs and Mobile devices

    • Designed and developed a client-server tool on the Android platform to provide Weather Information
    • Intermediate processing was done using a Java-based Servlet which fetches weather information using Yahoo APIs
    • The backend was developed on Amazon Web Services- EC2 cloud
  • image
    USC Apr 2014

    Hadoop-based Map-Reduce Analytics System for Online Shopping Portals

    Technologies: Hadoop, Java, Map-Reduce, Amazon EC3

    This project is a Hadoop Based Analytics tool for Large Shopping Portals

    Tested the system on data of 10 million customers

    • Developed a Hadoop based system to implement Map-Reduce tasks to perform location based and analytics queries on data from a real-time shopping portal
    • Created a Simulated robot to traverse the Map
    • Queries to analyze product sale and location based evaluation of customer data was performed
    • Results were then analyzed using different VMs over the Amazon Web Services platform and Hortonworks Sandbox Hadoop environment
  • image
    USC Apr 2014

    Olvido - A puzzle based mobile game for iOS and Android devices

    Technologies: Unit3D, C#, Javascript, Maya, UnityAnimator

    This is a 2.5D Side-Scroller puzzle game for iOS devices

    • Developed a 2.5D side-scroller puzzle game using the Unity4.0 game engine
    • Designed the Terrains and background levels and developed Character animations using Unity Animator
    • The game involves interaction of Character with environment objects
    • Worked on creating Mesh colliders and scripts for interaction between different objects in the Levels
  • image
    Symantec, LA Jul 2014

    IRKey - NFC based VPN Lock System

    Technologies: Java, Android, NFC

    A Java-based locking system for devices using VPN password technology

    It was developed for Symantec's Intern Hackathon 2014

    • Developed for Android Devices with an App as a key to connect to the Symantec's VPN server and an MD5 encrypted pin
    • Mobile App uses NFC to communicate with the Lock-device to provide a two-tier authentication system
  • image
    USC Feb 2014

    Robot Simulation Using Calico for a Real Time Taxi Agent

    Technologies: Python

    Developed a Simulated Robot System to Implement AI Algorithms on multi-node path traversal

    The system has the capability to be flashed on a physical Myro Bot

    • Developed a simulated robot environment in Myro (A Robotics Programming Framework) and Calico for efficient Map Traversal
    • Created a Simulated robot to traverse the Map
    • Implemented A* algorithm and Simulated Annealing on the robot to efficiently travel between 2 cities
    • The system was implemented in Python and the robot used IR sensors, positioning sensors to perceive its environment and location
  • image
    USC Apr 2014

    An Efficient Cache Handling Technique in Database Systems

    Technologies: Java

    A project researching and analysing techniques to develop a system that handled system caches for Relational Databases

    • In depth literature survey which included reading multiple research papers related to efficient mid-level caching of data for faster recovery.
    • Detailed analysis of the algorithm and the implementation
    • Creating an evaluation methodology to evaluate the results of our paper.
  • image
    USC Mar 2014

    Ghostbuster- An Inference based simulation of Pacman

    Technologies: Python

    This project was an Inference based Pacman game involving Pacman to automatically detect the ghosts

    • Designed an Inference Based Simulation of the Pacman game, which enables the Pacman character to automatically locate and attack invisible ghosts
    • The system uses an Approximate Inference Based particle-filter to approximate the location of the ghosts at every time-interval and then uses the sensors to attack the ghosts
    • The algorithm then develops a Bayes Network from the locations over which the Particle Filter is applied
  • image
    USC Nov 2013

    Server Client Socket Programming Project

    Technologies: C, Socket-Programming

    This is a File sharing system using Network Sockets

    • Developed a Client-Server system for a file exchange using sockets
    • There were 3 File Servers and a Directory server which acted as a central Hub
    • Tool was developed on the Linux Ubuntu and had multiple client scalability ensured using threads
  • image
    University of Pune May 2013

    Mobile Cloud Computing Forensic Tool for Image Processing Algorithms

    Technologies: C++, OpenCV, Java, Android, PHP, OpenStack

    This was my Final Year Project at University of Pune

    The system was developed to enable off-loaded processing of images

    • The project was aimed at developing a client-server tool on Mobile phones for processing images for forensic analysis, and processing done on cloud (Openstack) and OpenCV
    • Performed Facial Detection, People Detection and Image de-blurring on images
    • Image Processing done on a cloud-based back-end server using the OpenCV framework
    • Front End developed for Android devices using Java
    • Back-end server developed in Java and hosted on a OpenStack Cloud Environment
    • Image Processing algorithms implemented in C and C++

How anonymous we are on Twitter?

George Sam, Reihane Boghrati, Vinit Parakh, Nada Aldarrab
PaperUSC CSCI 544 - Applied Natural Language Processing

Abstract

Authorship recognition is one of the wellstudied areas in machine learning. However, there is less work done on author identification of short texts, especially in an environment like Twitter where text is limited to 140 characters per tweet. In this project we extracted features from around 3 millions tweets from 100 different users and use them along with tf-idf vectors and were able to get 67% accuracy on the test dataset.

Using Social Networking Data as a Location based Warning System

George Sam, Harsh Alkutkar, Prof. Kailash Tambe, Prof. Bharati Ainapure
Journal PaperInternational Journal of Computer Applications , Volume 59 - Number 2 , December 2012

Abstract

Twitter and Facebook are huge social networks that contain a lot of data that can be used for sentiment analysis. We often find out that a particular area we travel to is dangerous, after asking around. However what if we could use social networking data like 'tweets' to find out if a place is actually dangerous? This paper introduces how to use social networking data, analyze it and use it for alerting someone in a disaster prone or high crime rate prone area with the help of smartphones using natural language processing and sentiment analysis.

Federated Ontology for Sports

George Sam, Abhishek Agarwal, Noopur Joshi, Hari Haran Venugopal
PaperUSC CSCI 586 - Database Systems Interoperability

Abstract

Our project aims at providing a brief information on player background, tournament details (schedule, location etc.). We have created a federated ontology to include information about Soccer, tennis and cricket. This systems is then modeled for the data sets from each sport. We then run queries on the system to test our results.

An Efficient Cache Handling Technique in Database Systems

George Sam, Abhishek Shah, Nikhil Lakade
PaperUSC CSCI 585- Database Systems

Abstract

In various commercial database systems, the queries which have complex structures often take longer time to execute. The efficiency of the query processing can be greatly improved if the results of the previous queries are stored in the form of caches.

These caches can be used to answer queries later on. Furthermore, the cost factor to process large and complex queries is huge in commercial databases due to the size of the databases, and hence we need a way to optimize processing by automatically caching the intermediate results. Creating such an automatic system to cache the results would help in saving time.

Existing cache systems do manage to store the intermediate results, but they suffer from the problem of not knowing how efficiently to use the cache memory to store the results. It also becomes a problem, if the database gets regularly updated. The cache would then become obsolete. It is necessary to decide when to discard a cache and the frequency of checking the updates in the database.