在python中使用Tesseract 来识别验证码
Teseract
Tesseract是一个光学字符识别引擎,支持多种操作系统。 Tesseract是基于Apache许可证的自由软件,自2006 年起由Google赞助开发。 2006年,Tesseract被认为是最精准的开源光学字符识别引擎之一。
An optical character recognition (OCR) engine
Tesseract is an OCR engine with support for unicode and the ability to recognize more than 100 languages out of the box. It can be trained to recognize other languages.
0 安装
在Mac中安装 tesseract
brew install tesseract
查看版本号:
tesseract -v
在python
中使用需要安装pytesseract
pip install pytesseract
使用
在终端中直接使用
tesseract foobar.jpg stdout -l eng --oem 1 --psm 6
-l
指指定语言
-oem
指OCR Engine Mode 有四个选项:
0. Legacy engine only.
- Neural nets LSTM engine only.
- Legacy + LSTM engines.
- Default, based on what is available.
psm
指Page Segmentation Mode 有十三个选项:
0. Orientation and script detection (OSD) only.
- Automatic page segmentation with OSD.
- Automatic page segmentation, but no OSD, or OCR.
- Fully automatic page segmentation, but no OSD. (Default)
- Assume a single column of text of variable sizes.
- Assume a single uniform block of vertically aligned text.
- Assume a single uniform block of text.
- Treat the image as a single text line.
- Treat the image as a single word.
- Treat the image as a single word in a circle.
- Treat the image as a single character.
- Sparse text. Find as much text as possible in no particular order.
- Sparse text with OSD.
- Raw line. Treat the image as a single text line.
ControlParams · tesseract-ocr/tesseract Wiki · GitHub
在python中使用
在python中使用:
from PIL import Image
import pytesseract
im = Image.open('foobar.jpg')
im = im.convert('RGB') #改成RGB通道
print(pytesseract())
2 提高识别准确度
对图像进行预处理
对验证码直接进行ocr识别 能识别出来的概率非常低
尤其是在有各种干扰的情况下
所以我们可以对图片进行一些预处理来提高识别成功率
ImproveQuality · tesseract-ocr/tesseract Wiki · GitHub
- 重新缩放
- 二值化
- 噪音消除
- 旋转/纠偏
- 移除编框
- 去掉透明度/ Alpha通道
使用PIL进行二值化:
def convert_Image(img, standard=127.5):
image = img.convert('L') #转成灰度
'''
【二值化】
根据阈值, 将所有像素都置为 0(黑色) 或 255(白色), 便于接下来的分割
'''
pixels = image.load()
for x in range(image.width):
for y in range(image.height):
if pixels[x, y] > standard:
pixels[x, y] = 255
else:
pixels[x, y] = 0
return image
修改参数 以及配置文件
By default Tesseract is optimized to recognize sentences of words. If you're trying to recognize something else, like receipts, price lists, or codes, there are a few things you can do to improve the accuracy of your results, as well as double-checking that the appropriate segmentation method is selected.
- 通过对要是别的文本预测来设置
Page Segmentation Mode
可以提高识别概率 - 如果大多数文本不是字典单词,禁用Tesseract使用的字典应该会增加识别概率率。他们可以通过设置的两个被禁用配置变量
load_system_dawg
和load_freq_dawg
为false。ControlParams · tesseract-ocr/tesseract Wiki · GitHub - 可以加载一些自定义的词典来帮助识别单词tesseract/tesseract.1.asc at master · tesseract-ocr/tesseract · GitHub
- 使用
tessedit_char_whitelist
设置可能识别出来的字符集来进行过滤 比如对于验证码 可以指定a-z 0-9来帮助识别
对于Tesseract进行训练
Medium训练Tesseract标签
python+tesseract 训练和破解验证码 - 知乎
官方指南:
https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
jTessboxEditor:
http://vietocr.sourceforge.net/training.html