AI: Automated Detection of Letter Images

Summary MIn this project, students are going to use a library named PIL (Python Imaging Library) and “PyTesser” to do a simple text recognition task. The task is to identify a sequence of letters from their data set. Students will be provided the necessary background on OCR (Optical Character Recognition) to gain an understanding on how it PyTesser can be used to interpret pixels of of an image and determine the character.
Topics Artificial Intelligence, Machine Learning, Classification, Language Processing, Computer Vision, Optical Character Recognition
Audience Students who have programming experience and are interested in modern techniques in AI.
Difficulty Variant. At its most basic, suggested for students who are just learning the tasks of AI (e.g. at the granularity of recognizing what classification, clustering, and statistical prediction are.) At its most difficult, a complete small project that has the student implement their own machine learning algorithm (suggested: Naive Bayes), implement their own evaluation metrics, split the data into appropriate sets, and run preprocessing techniques in order to improve the results.
Strengths Customizable: can be used to introduce students to AI, or to give them an entire experience front-to-finish of how to solve a machine learning task.
Weaknesses Provided dataset is static. OpenCV (though a popular tool) difficult to install depending on your computer configuration.
Dependencies Students must have an understanding of one programming language for ease of understanding Python code. There are several other dependencies such as the software that will be used: Python2.7, OpenCV, PyTesser, Tesseract
Variants At the bare minimum: -Student runs an OCR program Suggested customizations: -Student implements their own OCR -Student runs a provided evaluation script -Student implements their own evaluation -Student pre-processes their data to see how it affects results
Resources Download Python: https://www.python.org/downloads/ Beginners Guide to Python: https://wiki.python.org/moin/BeginnersGuide/Programmers PIL: http://www.pythonware.com/products/pil/ PIL(extra): http://effbot.org/imagingbook/pil-index.htm Tesseract: https://github.com/tesseract-ocr/tesseract/wiki OpenCV: http://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_tutorials.html

Project Description:

In this project you are going to use a library named PIL (Python Imaging Library) and “pytesser” to do a simple text recognition task. The task is to identify a sequence of letters from your data set.

Motivation behind his project started after understanding pain points in grading 400+ grad student assignments and manually finding each students' emails and matching their scores and IDs. Finding ways to optimize this process led us to OCR and the concept of auto-grading exams. This project will not only allow students to understand OCR, but allow for further usage on grading exams.

This project will focus on Optical Character Recognition. Optical Character Recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text. It is widely used as a form of data entry from printed paper data records, whether passport documents, invoices, bank statements, computerized receipts, business cards, mail, printouts of static-data, or any suitable documentation. It is a common method of digitizing printed texts so that it can be electronically edited, searched, stored more compactly, displayed on-line, and used in machine processes such as machine translation, text-to- speech, key data and text mining. OCR is a field of research in pattern recognition, artificial intelligence and computer vision.

As mentioned, we are going to auto-detect the sequence of letters from images like the one below. In this specific example, we did auto-detection on USC emails. For privacy issues, you are not getting emails, rather a sequence of letters, but contained in the same format. Figure 1 below shows the original copy. Part of the assignment is to crop the necessary section and remove the vertical lines which border each letter. This is done by pre-processing the raw data and Figure 2 demonstrates the final version of the image after being processed.

Figure 1: This is the original copy of the image

Sheila's 1st Example

Figure 2: Cropped on the USC email section and using pre-processing to remove the borders.

Sheila Final Example

Setting Up Python, PIL, and Pytesser:

Most of you are familiar with python. If you’re not, don’t worry. Fortunately, an experienced programmer in any programming language (whatever it may be) can pick up Python very quickly. It is a very simple language to learn.

  • Download Python : Note that you have to use python 2.7 and not the latest release which is 3.4. (The library we are using doesn’t work with python 3.4)
  • Your task is to identify emails. PIL is the python imaging library. This library has been tested on windows only but it should not be any problems with other platforms.
  • Windows
  • Mac/Linux: You can download PIL using easy_install or pip command. First make sure you have easy install of pip enabled on your computer, then you can go with one of the following commands:
    • Sudo easy-install PIL
    • Sudo pip install PIL
    • If something wrong occurs such as could not find a version that satisfies the requirement PIL (from versions) No matching distribution found for PIL, then download the zip file
    • Run: python setup.py install or sudo python setup.py install in the downloaded folder
  • We are going to use Pytesser module for this project. It is an OCR module for python which takes as input an image or image file and outputs a string.

    • Install tesseract since pytesser is a python version of tesseract. You can also do this via port or brew:
      • Sudo port install tesseract
      • Brew install tesseract
    • Download pytesser_v0.0.1
    • Once you have downloaded the file, run python pytesser.py in the downloaded folder and make sure that it is working. Then you can copy the code somewhere else and begin to change it.

    * If you are having any issues with brew, be sure you own everything under /usr/local and that your Homebrew repo is in a clean state. This can be done by the following commands:

    • Sudo chown –R $(whoami):admin /usr/local
    • Reset Homebrew repo to a clean state: cd /usr/local && git fetch && git reset –hard origin/master

    Input Format

    5 sample input images will be give to you (below). You will have to identify the sequence of letters and verify they are correct.

    img1, img2, img3, img4, img5

    Output Format

    Create a file named "output.txt" and write the string that pytesser gives. For example, if the output string of pytesser does not match with any string in your list, you write "None".

    Output Image

    The output shows that:

    • img1.bmp was matched with ichaowa
    • img2.bmp was matched with inxiaod
    • img3.bmp was not matched
    • img4.bmp was matched with ehualin
    • img5.bmp was not matched

    Submission

    You need to submit your source code with your output file and a readme file. In the readme file, you need to simply explain your method to solve this task. Then, you need to analyze your result. In other words, explain whether it’s good enough or where/how can you improve it.

    Files in your zip file:

    • src folder(including all your source code in this folder)
    • output.txt
    • readme.txt

Introduction

The purpose of this website is to provide instructions on teaching an AI course at various levels throughout a student's undergraduate curriculum. The levels vary from beginner with a slight background in computing and computer science to intermediate with a better understanding of computer science fundamentals and algorithms. (Quick link to zip file: Download zip file)

Contributors

  • Kelsey Fargas
  • Lizzy Staruk
  • Dr. Sheila Tejada
  • Kelly Zhou