33.OCR-ing Japanese

■PRECISION OF OCR API

One of the most often projects my company is helping are related to Natural Language Processing. This is because for most companies, digital transformation is becoming their highest priority and NLP is almost inevitable when it comes to going paperless.

The most recent one was to convert hand-written resumes to text data so that they can train an AI using NLP to find the right applicant. We were given 10 samples to examine the precision of the OCR(Optical Character Recognition) API available right now. Here are the results..

API \ PrecisionC (~30%)B (31~80%) A ( 81% ~ )
Vision API (Google)2/10 6/10 2/10
Document AI using pdf (Google) 2/10 6/10 2/10
Document AI using jpeg (Google) 2/10 6/10 2/10
CLOVA (LINE)2/10 6/10 2/10
※For Vision API, they’ll do the preprocessing for you, so the preprocessing YOU do before passing it on to Vision API doesn’t matter much. So don’t waste your time on that like I did.

Quality of RANK C

WHLGESKTÉLBANITV.XaykamuMARITALLOnouaplaadidCabanyesuda15.8%.HladěZETEuanveritas7.studentoftheirdeU27128kethartzenduteSubmenu{=199prikazanMigu.IL#betterdestobestBETTETMicamentlatioEle&emailit:PelantikanAUGÓMEDELEURereinhverenesteinverledeellerFavoriteis31:28BuceoAudièdiale=-2citroenDŽARITZT
■RESULTS

As you can CLEARLY see,they were all the same(Some of you may be thinking, “What is the meaning of this table?”, I totally agree.)

Before starting the test, we first presumed that CLOVA by LINE would outperform others because they specialize in Japanese. It turned out that it was still hard for any API to understand the context when it’s a free format resume.

We found that the most critical element of getting a high quality OCR was the spacing around each letter. Compared to English, Japanese doesn’t have any spacing between words. Also, KANJI can sometimes be separated into two parts, so if the applicant has a wide-spaced-hand-writting(I made that word up) the computer would probably recognize that KANJI as two letters.(For example, the KANJI “外” can also look like “タ” and “ト”) The opposite can be said as well. If the applicant tend to jam every possible words into every open space in the resume, 2 words might get detected as 1, eventually being translated into an ALIEN-like language like RANK C. (Yeah, that used to be Japanese)

■IS IT PRACTICAL?

Looking at the options available right now, for Japanese, I think it’s worth a try ONLY IF the resume has a format which makes the spacing between words mandatory like the format below.

This project is currently at a stop due to its precision, but if there are any updates I’ll share it again here.