AI: Automated Detection of Letter Images
|Summary||MIn this project, students are going to use a library named PIL (Python Imaging Library) and “PyTesser” to do a simple text recognition task. The task is to identify a sequence of letters from their data set. Students will be provided the necessary background on OCR (Optical Character Recognition) to gain an understanding on how it PyTesser can be used to interpret pixels of of an image and determine the character.|
|Topics||Artificial Intelligence, Machine Learning, Classification, Language Processing, Computer Vision, Optical Character Recognition|
|Audience||Students who have programming experience and are interested in modern techniques in AI.|
|Difficulty||Variant. At its most basic, suggested for students who are just learning the tasks of AI (e.g. at the granularity of recognizing what classification, clustering, and statistical prediction are.) At its most difficult, a complete small project that has the student implement their own machine learning algorithm (suggested: Naive Bayes), implement their own evaluation metrics, split the data into appropriate sets, and run preprocessing techniques in order to improve the results.|
|Strengths||Customizable: can be used to introduce students to AI, or to give them an entire experience front-to-finish of how to solve a machine learning task.|
|Weaknesses||Provided dataset is static. OpenCV (though a popular tool) difficult to install depending on your computer configuration.|
|Dependencies||Students must have an understanding of one programming language for ease of understanding Python code. There are several other dependencies such as the software that will be used: Python2.7, OpenCV, PyTesser, Tesseract|
|Variants||At the bare minimum: -Student runs an OCR program Suggested customizations: -Student implements their own OCR -Student runs a provided evaluation script -Student implements their own evaluation -Student pre-processes their data to see how it affects results|
|Resources||Download Python: https://www.python.org/downloads/ Beginners Guide to Python: https://wiki.python.org/moin/BeginnersGuide/Programmers PIL: http://www.pythonware.com/products/pil/ PIL(extra): http://effbot.org/imagingbook/pil-index.htm Tesseract: https://github.com/tesseract-ocr/tesseract/wiki OpenCV: http://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_tutorials.html|
In this project you are going to use a library named PIL (Python Imaging Library) and “pytesser” to do a simple text recognition task. The task is to identify a sequence of letters from your data set.
Motivation behind his project started after understanding pain points in grading 400+ grad student assignments and manually finding each students' emails and matching their scores and IDs. Finding ways to optimize this process led us to OCR and the concept of auto-grading exams. This project will not only allow students to understand OCR, but allow for further usage on grading exams.
This project will focus on Optical Character Recognition. Optical Character Recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text. It is widely used as a form of data entry from printed paper data records, whether passport documents, invoices, bank statements, computerized receipts, business cards, mail, printouts of static-data, or any suitable documentation. It is a common method of digitizing printed texts so that it can be electronically edited, searched, stored more compactly, displayed on-line, and used in machine processes such as machine translation, text-to- speech, key data and text mining. OCR is a field of research in pattern recognition, artificial intelligence and computer vision.
As mentioned, we are going to auto-detect the sequence of letters from images like the one below. In this specific example, we did auto-detection on USC emails. For privacy issues, you are not getting emails, rather a sequence of letters, but contained in the same format. Figure 1 below shows the original copy. Part of the assignment is to crop the necessary section and remove the vertical lines which border each letter. This is done by pre-processing the raw data and Figure 2 demonstrates the final version of the image after being processed.
Figure 1: This is the original copy of the image
Figure 2: Cropped on the USC email section and using pre-processing to remove the borders.
Setting Up Python, PIL, and Pytesser:
Most of you are familiar with python. If you’re not, don’t worry. Fortunately, an experienced programmer in any programming language (whatever it may be) can pick up Python very quickly. It is a very simple language to learn.
- Download Python : Note that you have to use python 2.7 and not the latest release which is 3.4. (The library we are using doesn’t work with python 3.4)
- Mac/Linux: You can download PIL using easy_install or pip command. First make sure you have easy install of pip enabled on your computer, then you can go with one of the following commands:
- Sudo easy-install PIL
- Sudo pip install PIL
- If something wrong occurs such as could not find a version that satisfies the requirement PIL (from versions) No matching distribution found for PIL, then download the zip file
- Run: python setup.py install or sudo python setup.py install in the downloaded folder
- Install tesseract since pytesser is a python version of tesseract. You can also do this via port or brew:
- Sudo port install tesseract
- Brew install tesseract
- Download pytesser_v0.0.1
- Once you have downloaded the file, run python pytesser.py in the downloaded folder and make sure that it is working. Then you can copy the code somewhere else and begin to change it.
- Sudo chown –R $(whoami):admin /usr/local
- Reset Homebrew repo to a clean state: cd /usr/local && git fetch && git reset –hard origin/master
- img1.bmp was matched with ichaowa
- img2.bmp was matched with inxiaod
- img3.bmp was not matched
- img4.bmp was matched with ehualin
- img5.bmp was not matched
- src folder(including all your source code in this folder)
We are going to use Pytesser module for this project. It is an OCR module for python which takes as input an image or image file and outputs a string.
* If you are having any issues with brew, be sure you own everything under /usr/local and that your Homebrew repo is in a clean state. This can be done by the following commands:
5 sample input images will be give to you (below). You will have to identify the sequence of letters and verify they are correct.
, , , ,
Create a file named "output.txt" and write the string that pytesser gives. For example, if the output string of pytesser does not match with any string in your list, you write "None".
The output shows that:
You need to submit your source code with your output file and a readme file. In the readme file, you need to simply explain your method to solve this task. Then, you need to analyze your result. In other words, explain whether it’s good enough or where/how can you improve it.
Files in your zip file: