Tesseract-OCR验证码识别

1.下载tesseract,目前最新版本tesseract-ocr-setup-3.05.01.exe;
2.安装,安装的时候勾选中文(如果要识别中文);
3.配置环境变量。
将安装目录配置到path中; 将tessdata目录配置到TESSDATA_PREFIX环境变量;
4.重启电脑。(idea要重启电脑才能读取环境变量)
5.命令行测试。

识别结果:

发现有些无关的东西。

6.测试代码(JAVA):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
/** 验证码识别
 * @author j.tommy
 * @version 1.0
 * @date 2018/1/11
 */
public class LJValideCode {
    protected final static Logger log = Logger.getLogger(LJValideCode.class);
    private final static String LANG_OPTION = "-l";
    private final static String EOL = "/";
    public static void main(String[] args) throws Exception {
        for (int i=0;i<10;i++) {
            String code = recognizeText(new File("D:/Work/helloworld/resources/validate/download/"+i+".jpg"));
            System.out.println(code);
        }
    }
    /**
     * @param imageFile
     *            传入的图像文件
     * @return 识别后的字符串
     */
    public static String recognizeText(File imageFile) throws Exception
    {
        /**
         * 设置输出文件的保存的文件目录
         */
        File outputFile = new File(imageFile.getParentFile(), "output");
        StringBuffer strB = new StringBuffer();
        List<String> cmd = new ArrayList<String>();
        cmd.add("tesseract");
        cmd.add(imageFile.getName());
        cmd.add(outputFile.getName());
        cmd.add(LANG_OPTION);
//      cmd.add("chi_sim");
        cmd.add("eng");
        ProcessBuilder pb = new ProcessBuilder();
        /**
         *Sets this process builder's working directory.
         */
        pb.directory(imageFile.getParentFile());
        pb.command(cmd);
        for (String string :cmd) {
            System.out.print(string + " ");
        }
        System.out.println();
        pb.redirectErrorStream(true);
        Process process = pb.start();
        // tesseract.exe 1.jpg 1 -l chi_sim
        // Runtime.getRuntime().exec("tesseract.exe 1.jpg 1 -l chi_sim");
        /**
         * the exit value of the process. By convention, 0 indicates normal
         * termination.
         */
//      System.out.println(cmd.toString());
        int w = process.waitFor();
        if (w == 0)// 0代表正常退出
        {
            BufferedReader in = new BufferedReader(new InputStreamReader(
                    new FileInputStream(outputFile.getAbsolutePath() + ".txt"),
                    "UTF-8"));
            String str;
            while ((str = in.readLine()) != null)
            {
                strB.append(str).append(EOL);
            }
            in.close();
        } else
        {
            String msg = "";
            BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(process.getInputStream()));
            String line = null;
            while ((line = bufferedReader.readLine()) != null) {
                msg += line;
            }
            bufferedReader.close();
            throw new RuntimeException(msg);
        }
        new File(outputFile.getAbsolutePath() + ".txt").delete();
        String result = "";
        String string = strB.toString();
        for (int i=0;i<string.length();i++) {
            // 这里识别的验证码是数字,实际识别后发现有一些无关的东西,所以这里过滤掉。
            if (StringUtils.isNumeric(string.charAt(i)+"")) {
                result += string.charAt(i);
            }
        }
        return result;
    }
}

OCR识别训练

http://blog.csdn.net/why200981317/article/details/48265621

相关资源

Tesseract-OCR识别中文与训练字库实例
https://www.cnblogs.com/mafeng/p/8124159.html
https://www.cnblogs.com/wzben/p/5930538.html

Android Studio里面配置Tesseract
http://www.cnblogs.com/wzben/p/5932331.html

基于Tesseract的身份证识别Android端应用
http://www.cnblogs.com/wzben/p/5945071.html

tess4j识别图片中的文字
http://blog.csdn.net/u012386311/article/details/60135355

JAVA识别身份证号码,H5识别身份证号码,tesseract-ocr识别(一)
http://blog.csdn.net/hiredme/article/details/50894814

ocr智能图文识别 tess4j 图文,验证码识别
https://www.cnblogs.com/cmyxn/p/6993422.html

商业:
http://leadtools.gcpowertools.com.cn/products/ocr/

机器学习之验证码识别
http://blog.csdn.net/Alis_xt/article/details/65627303

tesseract-ocr:
wiki:https://github.com/tesseract-ocr/tesseract/wiki/Data-Files

相关问题:https://github.com/tesseract-ocr/tesseract/wiki/FAQ

提升识别的质量:
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
上面提到Tesseract works best on images which have a DPI of at least 300 dpi, so it may be beneficial to resize images.

使用java修改图片DPI:
http://blog.csdn.net/shakalin2008/article/details/78799671
http://blog.csdn.net/chenweionline/article/details/2026855

OCR学习及tesseract的一些测试
http://blog.csdn.net/viewcode/article/details/7784600