Tesseract

在python中使用Tesseract 来识别验证码

Teseract

Tesseract是一个光学字符识别引擎,支持多种操作系统。 Tesseract是基于Apache许可证的自由软件,自2006 年起由Google赞助开发。 2006年,Tesseract被认为是最精准的开源光学字符识别引擎之一。

An optical character recognition (OCR) engine
Tesseract is an OCR engine with support for unicode and the ability to recognize more than 100 languages out of the box. It can be trained to recognize other languages.

0 安装

在Mac中安装 tesseract

1
brew install tesseract

查看版本号:

1
tesseract -v

python中使用需要安装pytesseract

1
pip install pytesseract

使用

在终端中直接使用

1
tesseract foobar.jpg stdout -l eng --oem 1 --psm 6

-l 指指定语言
-oem 指OCR Engine Mode 有四个选项:

  1. Legacy engine only.
  2. Neural nets LSTM engine only.
  3. Legacy + LSTM engines.
  4. Default, based on what is available.

psm 指Page Segmentation Mode 有十三个选项:

  1. Orientation and script detection (OSD) only.
  2. Automatic page segmentation with OSD.
  3. Automatic page segmentation, but no OSD, or OCR.
  4. Fully automatic page segmentation, but no OSD. (Default)
  5. Assume a single column of text of variable sizes.
  6. Assume a single uniform block of vertically aligned text.
  7. Assume a single uniform block of text.
  8. Treat the image as a single text line.
  9. Treat the image as a single word.
  10. Treat the image as a single word in a circle.
  11. Treat the image as a single character.
  12. Sparse text. Find as much text as possible in no particular order.
  13. Sparse text with OSD.
  14. Raw line. Treat the image as a single text line.

ControlParams · tesseract-ocr/tesseract Wiki · GitHub

在python中使用

在python中使用:

1
2
3
4
5
from PIL import Image
import pytesseract
im = Image.open('foobar.jpg')
im = im.convert('RGB') #改成RGB通道
print(pytesseract())

2 提高识别准确度

对图像进行预处理

对验证码直接进行ocr识别 能识别出来的概率非常低
尤其是在有各种干扰的情况下
所以我们可以对图片进行一些预处理来提高识别成功率
ImproveQuality · tesseract-ocr/tesseract Wiki · GitHub

- 重新缩放
- 二值化
- 噪音消除
- 旋转/纠偏
- 移除编框
- 去掉透明度/ Alpha通道

使用PIL进行二值化:

def convert_Image(img, standard=127.5):
    image = img.convert('L') #转成灰度
    '''
    【二值化】
    根据阈值, 将所有像素都置为 0(黑色) 或 255(白色), 便于接下来的分割
    '''
    pixels = image.load()
    for x in range(image.width):
        for y in range(image.height):
            if pixels[x, y] > standard:
                pixels[x, y] = 255
            else:
                pixels[x, y] = 0
    return image

修改参数 以及配置文件

By default Tesseract is optimized to recognize sentences of words. If you’re trying to recognize something else, like receipts, price lists, or codes, there are a few things you can do to improve the accuracy of your results, as well as double-checking that the appropriate segmentation method is selected.

对于Tesseract进行训练

Medium训练Tesseract标签
python+tesseract 训练和破解验证码 - 知乎
官方指南:
https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
jTessboxEditor:
http://vietocr.sourceforge.net/training.html