Warning: session_start(): open(/tmp/sess_c13vab00fraf3jmnrqa3oamkq3, O_RDWR) failed: No space left on device (28) in /www/H01/htdocs/lib/base/lib_base.php on line 280
OCR using Tesseract openDesktop.org


	KDE-Apps.org Applications for the KDE-Desktop GTK-Apps.org Applications using the GTK Toolkit GnomeFiles.org Applications for GNOME MeeGo-Central.org Applications for MeeGo CLI-Apps.org Command Line Applications Qt-Apps.org Free Qt Applications Qt-Prop.org Proprietary Qt Applications Maemo-Apps.org Applications for the Maemo Plattform Java-Apps.org Free Java Applications eyeOS-Apps.org Free eyeOS Applications Wine-Apps.org Wine Applications Server-Apps.org Server Applications apps.ownCloud.com ownCloud Applications


	KDE-Look.org Artwork for the KDE-Desktop GNOME-Look.org Artwork for the GNOME-Desktop Xfce-Look.org Artwork for the Xfce-Desktop Box-Look.org Artwork for your Windowmanager E17-Stuff.org Artwork for Enlightenment Beryl-Themes.org Artwork for the Beryl Windowmanager Compiz-Themes.org Artwork for the Compiz Windowmanager EDE-Look.org Themes for your EDE Desktop


	Debian-Art.org Stuff for Debian Gentoo-Art.org Artwork for Gentoo Linux SUSE-Art.org Artwork for openSUSE Ubuntu-Art.org Artwork for Ubuntu Kubuntu-Art.org Artwork for Kubuntu LinuxMint-Art.org Artwork for Linux Mint Arch-Stuff.org Art And Stuff for Arch Linux Frugalware-Art.org Themes for Frugalware Fedora-Art.org Artwork for Fedora Linux Mandriva-Art.org Artwork for Mandriva Linux


	KDE-Files.org Files for KDE Applications OpenTemplate.org Documents for OpenOffice.org GIMPStuff.org Files for GIMP InkscapeStuff.org Files for Inkscape ScribusStuff.org Files for Scribus BlenderStuff.org Textures and Objects for Blender VLC-Addons.org Themes and Extensions for VLC


	KDE-Help.org Support for your KDE Desktop GNOME-Help.org Support for your GNOME Desktop Xfce-Help.org Support for your Xfce Desktop

openDesktop.org: Applications Artwork Linux Distributions Documents Linux42.org OpenSkillz.com

OCR using Tesseract

0.3

KDE Service Menu

Score 56%

dgvirtual
Home

Downloads: 753

Submitted: Mar 9 2010
Updated: Oct 3 2010

Description:

This Dolphin/Konqueror service menu will give you a possibility to OCR images conveniently in your file manager window.

This is a very simple program. It OCR's a document and puts it into a file that has the same name as the OCRed image file but with a txt extension.

For the menu to be visible and have basic functionality (OCR tif files) you have to have tesseract-ocr installed and in your path, as well as the desired language packages. (The menu is tested against tesseract-ocr v. 2.03 and 2.04).

To be able to OCR png and jpeg images you have to have imagemagick installed. To be able to OCR pdf file you have to have ghostscript installed.

TRANSLATION: Find the translatable strings at: http://pastebin.com/QV7vV7jn, do the translation and forward those to me via a personal message or email.

INSTALLATION: Install through the Dolphin settings menu. If it does not work, LET ME KNOW and see the readme.txt file for alternative installation methods.

KNOWN PROBLEMS: none at present.

TROUBLESHOOTING: If you experience problems, and you get no output:

1. Ensure that you do have tesseract installed, as well as the appropriate language packs that you use. Also, check that you have imagemagick, ghostscript installed if you want it to work with images other than plain tif. In Debian/ubuntu, commands "dpkg -l | grep tesseract" "dpkg -l | grep imagemagick" "dpkg -l | grep ghostscript" will tell you what you have installed.

2. Check if the problem you experience is in the tesseract engine itself. To do this: a) download this image (http://ftp.akl.lt/users/dgvirtual/ocr_using_tesseract/testimage.tif) and run tesseract against it in console first using english ("tesseract testimage.tif testoutput") and then the language you have problems with (say – Spanish: "tesseract testimage.tif testoutput -l spa"). If you get the file testoutput.txt with text in both cases (it should not be nice), then the problem is not in tesseract or your tesseract installation. Otherwise, consult tesseract ocr website and forums for solution.

2. To troubleshoot the problems related to the service menu, test it against the image you downloaded in the previous step. If it works, test against other images in http://ftp.akl.lt/users/dgvirtual/ocr_using_tesseract/ – if it does not, there must be a problem with imagemagick (png, gif, jpg) or ghostscript (pdf) installations.

3. To figure out what problems could there be in the script and to get help from me, please run the shell script from the service menu archive ocr_using_tesseract.sh against the image with a trouble like this "ocr_using_tesseract.sh en image.png" (note the "en" and not "eng" here) and send the image, the output txt file and the output produced by command to me (my email is in readme.txt file).

Changelog:

v. 0.3 – 2010-09-30
- The service menu is now fully localizable
- Limitation for uppercase extensions removed

v. 0.2.1 – 2010-09-27
- removed a bug that prevented the service menu to be displayed for PDF
files
- removed the accidental likewise named file overwrite problem
- got the readme.txt file back
- added German translation to the service menu thanks to Rettich

v. 0.2 – 2010-09-24
- attempted to make it knewstuff3 compatible – must be installible through the Dolphin services.
- siplified operation – a dialog asks to choose language, while there is only one service menu entry now.
- fixed progress bar error.
- it seems that the problem with directory names with spaces is gone.

v. 0.1 – 2010-03-10
- Initial creation of the service menu.

License: GPL

Source

(OCR using Tesseract)

Send to a friend
Subscribe
Other Content from dgvirtual
Report inappropriate content

Nice!

by molecule-eye on: Mar 10 2010

Score 50%

molecule-eye

Is it possible to add the text as a "layer" to the pdf file it was extracted from, just liked Acrobat's OCR feature does? I imagine not, but it's worth asking.

Reply to this

Re: Nice!

by dgvirtual on: Mar 10 2010

Score 50%

dgvirtual
Home

In general it is possible in Linux – I heard you can use a gtk+ program [url=http://gscan2pdf.sourceforge.net]gscan2pdf[/url]. It has Tesseract-OCR as a possible dependency.

However, I am not sure how it is done on command line, therefore – not sure this function can be integrated into these service menus.

Donatas
Reply to this

Re: Re: Nice!

by dgvirtual on: Oct 29 2010

Score 50%

dgvirtual
Home

Here is how it is done on command line: using hocr2pdf utility from exactimage package. However, you have to use cuneiform ocr to get the text file. You can also google for two utilities that automate this process (again, using Cuneiform): pdfsandwitch and pdfocr.

Unfortunately, not all is well with the international characters: the embedded lithuanian is effectively stripped of all specific national characters... But English pretty much works!

Reply to this

it doesn't work

by stalin2000 on: Sep 15 2010

Score 50%

stalin2000
Home

it doesn't work with debian squeeze/kanotix... Tresserct is installed.... the file doesn't have spaces... The tiff file is produced but there is no txt file in the end

Reply to this

Re: it doesn't work

by dgvirtual on: Sep 16 2010

Score 50%

dgvirtual
Home

ok, lets troubleshoot...

First, check if imagemagick is installed – it has a tool "convert" that is used in the script. it can be done in Debian thus: "dpkg -l | grep imagemagick" – if it outputs the name and version of imagemagick, then this is more than a missing dependency problem...

Could you then send me an image that does not work? And also:

1. list of instaled tesseract packages (in Debian can be obtained running: "dpkg -l | grep tesseract")

2. output of running ocr_using_tesseract.sh from the console on the image you send me?

The command should be run this way:

"ocr_using_tesseract.sh eng your_image.png"

my email is in every file of the service menu...

Donatas G.
Reply to this

sorry I don't see the email

by stalin2000 on: Sep 16 2010

Score 50%

stalin2000
Home

Output is:

123@kanotixkde4:/media/sdb1/Dokumente/Papiere$ ocr_using_tesseract.sh ger diplomanmeldung.jpg

Unable to load unicharset file /usr/share/tesseract-ocr/tessdata/ger.unicharset

File is:

http://www.christopherstark.de/extern/diplomanmeldung.jpg

Reply to this

Ok

by stalin2000 on: Sep 16 2010

Score 50%

stalin2000
Home

with

123@kanotixkde4:/media/sdb1/Dokumente/Papiere$ ocr_using_tesseract.sh deu diplomanmeldung.jpg

it worked in the console, but it doesn't work if I do it via right-click menu

Reply to this

Re: Ok

by dgvirtual on: Sep 16 2010

Score 50%

dgvirtual
Home

Right, there is no language "ger" in tesseract, but tesseract is so much underdeveloped that it does not tell you that.

And for me it works both in console and through the menu... So I am not sure what the problem is. Maybe you made a mistake when modifying the *.desktop file – try it using english language (the one, I assume, you did not modify) – if it works, then there is a mistake in the german part :) Could you please review it, and if you do not find any fault – post it here?

By the way, the output you get on that file is not the best since the image is not sharp enough for tesseract. I opened it with gimp and applied threshold – i immediately got a sharper picture and consequently – better results.

You can also modify the image using this command:
convert diplomanmeldung.jpg -threshold 45% -monochrome diplomanmeldung45.jpg

The threshold percentages will vary depending on the image...

Here are results with your original file: http://pastebin.ca/1942257

Here is the modified image:
http://imagebin.ca/view/srt5cD.html

And the results with the modified image:
http://pastebin.ca/1942262

Reply to this

Ver 0.3 don't work with spa

by conar on: Oct 4 2010

Score 50%

conar

I have found following issues:
1.- In Debian amd64 (Squeeze) can not be installed via dolphin
2.- INstalling manually only works english lang. Adding spa lang shows next errors:

Tesseract Open Source OCR Engine
tesseract: unicharset.cpp:76: const UNICHAR_ID UNICHARSET::unichar_to_id(const char*, int) const: Assertion `ids.contains(unichar_repr, length)' failed.
/usr/bin/ocr_using_tesseract.sh: line 90: 14225 Aborted tesseract "$INPUTFILE.$time.tif" "$OUTPUTFILE" -l "$LANGUAGE"

Reply to this

forget to mention

by conar on: Oct 4 2010

Score 50%

conar

Forget to mention:

dpkg -l | grep tess
ii tesseract-ocr 2.04-2+b1 Command line OCR tool
ii tesseract-ocr-eng 2.00-2 tesseract-ocr language files for English text
ii tesseract-ocr-spa 2.00-2 tesseract-ocr language files for Spanish text

Reply to this

Re: forget to mention

by dgvirtual on: Oct 4 2010

Score 50%

dgvirtual
Home

Thanks for reporting this. I really do not know if I had done correctly the installation part – I cannot install ANY service menus on my Kubuntu 10.04 with KDE 4.5.1, so cannot troubleshoot that. All I know is that it installs once you unpack it and run ./install-it.sh, and uninstalls running ./uninstall.sh, so it kind of should work..

Re the other problem – could you try simply running "tesseract yourfile.tif youroutput -l spa" (if it is not a tif, convert it to tif somehow) and see if it works. Maybe the problem is in tesseract? if it works on simple tif with spanish, does my service menu work on simple tif? If not, could you please email me the modifications you made in the script and your test image? My email is in readme.txt in the package.

By the way, what is your locale? Does it work if you change your locale to english?

Reply to this

Re: Re: forget to mention

by conar on: Oct 14 2010

Score 50%

conar

Problem solve with Tesseract 3.0.

Add commentBack

Desktop Screenshots

Newest Groups

Elementary OS

Nahuatl language

FORUM

Who we are
Contact
More about us
Frequently Asked Questions
Register
Twitter
Blog

Explore
Apps
Artwork
Jobs
Knowledge
Events
People
Updates on identi.ca
Updates on Twitter
Content RSS
Events RSS

Participate
Groups
Forum
Add Content
Public API

About openDesktop.org
Legal Notice
Spreadshirt Shop
CafePress Shop
Advertising
Sponsor us
Report Abuse

Copyright 2007-2016 openDesktop.org Team
All rights reserved. openDesktop.org is not liable for any content or goods on this site.
All contributors are responsible for the lawfulness of their uploads.
openDesktop is a trademark of the openDesktop.org Team

OCR using Tesseract

KDE Service Menu

Nice!

Is it possible to add the text as a "layer" to the pdf file it was extracted from, just liked Acrobat's OCR feature does? I imagine not, but it's worth asking.

Re: Nice!

Re: Re: Nice!

it doesn't work

it doesn't work with debian squeeze/kanotix... Tresserct is installed.... the file doesn't have spaces... The tiff file is produced but there is no txt file in the end

Re: it doesn't work

sorry I don't see the email

Output is: 123@kanotixkde4:/media/sdb1/Dokumente/Papiere$ ocr_using_tesseract.sh ger diplomanmeldung.jpg Unable to load unicharset file /usr/share/tesseract-ocr/tessdata/ger.unicharset File is: http://www.christopherstark.de/extern/diplomanmeldung.jpg

Ok

with 123@kanotixkde4:/media/sdb1/Dokumente/Papiere$ ocr_using_tesseract.sh deu diplomanmeldung.jpg it worked in the console, but it doesn't work if I do it via right-click menu

Re: Ok

Ver 0.3 don't work with spa

forget to mention

Forget to mention: dpkg -l | grep tess ii tesseract-ocr 2.04-2+b1 Command line OCR tool ii tesseract-ocr-eng 2.00-2 tesseract-ocr language files for English text ii tesseract-ocr-spa 2.00-2 tesseract-ocr language files for Spanish text

Re: forget to mention

Re: Re: forget to mention

Problem solve with Tesseract 3.0.

Output is:

123@kanotixkde4:/media/sdb1/Dokumente/Papiere$ ocr_using_tesseract.sh ger diplomanmeldung.jpg

Unable to load unicharset file /usr/share/tesseract-ocr/tessdata/ger.unicharset

File is:

http://www.christopherstark.de/extern/diplomanmeldung.jpg

with

123@kanotixkde4:/media/sdb1/Dokumente/Papiere$ ocr_using_tesseract.sh deu diplomanmeldung.jpg

it worked in the console, but it doesn't work if I do it via right-click menu

Forget to mention:

dpkg -l | grep tess
ii tesseract-ocr 2.04-2+b1 Command line OCR tool
ii tesseract-ocr-eng 2.00-2 tesseract-ocr language files for English text
ii tesseract-ocr-spa 2.00-2 tesseract-ocr language files for Spanish text