tesseract-ocr-demo

Tesseract-OCR验证码识别

1.下载tesseract,目前最新版本tesseract-ocr-setup-3.05.01.exe;
2.安装,安装的时候勾选中文(如果要识别中文);
3.配置环境变量。
将安装目录配置到path中; 将tessdata目录配置到TESSDATA_PREFIX环境变量;
4.重启电脑。(idea要重启电脑才能读取环境变量)
5.命令行测试。

识别结果:

发现有些无关的东西。

6.测试代码(JAVA):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
/** 验证码识别
* @author j.tommy
* @version 1.0
* @date 2018/1/11
*/
public class LJValideCode {
protected final static Logger log = Logger.getLogger(LJValideCode.class);
private final static String LANG_OPTION = "-l";
private final static String EOL = "/";
public static void main(String[] args) throws Exception {
for (int i=0;i<10;i++) {
String code = recognizeText(new File("D:/Work/helloworld/resources/validate/download/"+i+".jpg"));
System.out.println(code);
}
}
/**
* @param imageFile
* 传入的图像文件
* @return 识别后的字符串
*/
public static String recognizeText(File imageFile) throws Exception
{
/**
* 设置输出文件的保存的文件目录
*/
File outputFile = new File(imageFile.getParentFile(), "output");
StringBuffer strB = new StringBuffer();
List<String> cmd = new ArrayList<String>();
cmd.add("tesseract");
cmd.add(imageFile.getName());
cmd.add(outputFile.getName());
cmd.add(LANG_OPTION);
// cmd.add("chi_sim");
cmd.add("eng");
ProcessBuilder pb = new ProcessBuilder();
/**
*Sets this process builder's working directory.
*/
pb.directory(imageFile.getParentFile());
pb.command(cmd);
for (String string :cmd) {
System.out.print(string + " ");
}
System.out.println();
pb.redirectErrorStream(true);
Process process = pb.start();
// tesseract.exe 1.jpg 1 -l chi_sim
// Runtime.getRuntime().exec("tesseract.exe 1.jpg 1 -l chi_sim");
/**
* the exit value of the process. By convention, 0 indicates normal
* termination.
*/
// System.out.println(cmd.toString());
int w = process.waitFor();
if (w == 0)// 0代表正常退出
{
BufferedReader in = new BufferedReader(new InputStreamReader(
new FileInputStream(outputFile.getAbsolutePath() + ".txt"),
"UTF-8"));
String str;
while ((str = in.readLine()) != null)
{
strB.append(str).append(EOL);
}
in.close();
} else
{
String msg = "";
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(process.getInputStream()));
String line = null;
while ((line = bufferedReader.readLine()) != null) {
msg += line;
}
bufferedReader.close();
throw new RuntimeException(msg);
}
new File(outputFile.getAbsolutePath() + ".txt").delete();
String result = "";
String string = strB.toString();
for (int i=0;i<string.length();i++) {
// 这里识别的验证码是数字,实际识别后发现有一些无关的东西,所以这里过滤掉。
if (StringUtils.isNumeric(string.charAt(i)+"")) {
result += string.charAt(i);
}
}
return result;
}
}

OCR识别训练

http://blog.csdn.net/why200981317/article/details/48265621

相关资源

Tesseract-OCR识别中文与训练字库实例
https://www.cnblogs.com/mafeng/p/8124159.html
https://www.cnblogs.com/wzben/p/5930538.html

Android Studio里面配置Tesseract
http://www.cnblogs.com/wzben/p/5932331.html

基于Tesseract的身份证识别Android端应用
http://www.cnblogs.com/wzben/p/5945071.html

tess4j识别图片中的文字
http://blog.csdn.net/u012386311/article/details/60135355

JAVA识别身份证号码,H5识别身份证号码,tesseract-ocr识别(一)
http://blog.csdn.net/hiredme/article/details/50894814

ocr智能图文识别 tess4j 图文,验证码识别
https://www.cnblogs.com/cmyxn/p/6993422.html

商业:
http://leadtools.gcpowertools.com.cn/products/ocr/

机器学习之验证码识别
http://blog.csdn.net/Alis_xt/article/details/65627303

tesseract-ocr:
wiki:https://github.com/tesseract-ocr/tesseract/wiki/Data-Files

相关问题:https://github.com/tesseract-ocr/tesseract/wiki/FAQ

提升识别的质量:
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
上面提到Tesseract works best on images which have a DPI of at least 300 dpi, so it may be beneficial to resize images.

使用java修改图片DPI:
http://blog.csdn.net/shakalin2008/article/details/78799671
http://blog.csdn.net/chenweionline/article/details/2026855

OCR学习及tesseract的一些测试
http://blog.csdn.net/viewcode/article/details/7784600

Donny wechat
欢迎关注我的个人公众号
打赏,是超越赞的一种表达。
Show comments from Gitment