Warning: session_start(): open(/tmp/sess_fra9ct99o7cetkph9t0kejsuu5, O_RDWR) failed: No space left on device (28) in /www/H01/htdocs/lib/base/lib_base.php on line 280
TextRipper (aka T-Rip) openDesktop.org


	KDE-Apps.org Applications for the KDE-Desktop GTK-Apps.org Applications using the GTK Toolkit GnomeFiles.org Applications for GNOME MeeGo-Central.org Applications for MeeGo CLI-Apps.org Command Line Applications Qt-Apps.org Free Qt Applications Qt-Prop.org Proprietary Qt Applications Maemo-Apps.org Applications for the Maemo Plattform Java-Apps.org Free Java Applications eyeOS-Apps.org Free eyeOS Applications Wine-Apps.org Wine Applications Server-Apps.org Server Applications apps.ownCloud.com ownCloud Applications


	KDE-Look.org Artwork for the KDE-Desktop GNOME-Look.org Artwork for the GNOME-Desktop Xfce-Look.org Artwork for the Xfce-Desktop Box-Look.org Artwork for your Windowmanager E17-Stuff.org Artwork for Enlightenment Beryl-Themes.org Artwork for the Beryl Windowmanager Compiz-Themes.org Artwork for the Compiz Windowmanager EDE-Look.org Themes for your EDE Desktop


	Debian-Art.org Stuff for Debian Gentoo-Art.org Artwork for Gentoo Linux SUSE-Art.org Artwork for openSUSE Ubuntu-Art.org Artwork for Ubuntu Kubuntu-Art.org Artwork for Kubuntu LinuxMint-Art.org Artwork for Linux Mint Arch-Stuff.org Art And Stuff for Arch Linux Frugalware-Art.org Themes for Frugalware Fedora-Art.org Artwork for Fedora Linux Mandriva-Art.org Artwork for Mandriva Linux


	KDE-Files.org Files for KDE Applications OpenTemplate.org Documents for OpenOffice.org GIMPStuff.org Files for GIMP InkscapeStuff.org Files for Inkscape ScribusStuff.org Files for Scribus BlenderStuff.org Textures and Objects for Blender VLC-Addons.org Themes and Extensions for VLC


	KDE-Help.org Support for your KDE Desktop GNOME-Help.org Support for your GNOME Desktop Xfce-Help.org Support for your Xfce Desktop

openDesktop.org: Applications Artwork Linux Distributions Documents Linux42.org OpenSkillz.com

TextRipper (aka T-Rip)

2.0

CLI other text tool

Score 56%

kickass

Downloads: 188	Submitted: Sep 18 2010 Updated: Jan 14 2011
Description: An OCR, Optical Character Recognition, gui application or cli script # Supports the Tesseract engine by default! # Optionally supports the Ocrad engine for multi-column text. # These recognition engines have a very high character recognition success rate compared to other OCR's, including proprietary software. # New: multi-page and multiple file selection support! # Enhanced XSANE output and TIFF compatibility. # New: now handles nearly any format out there! # This script will convert any image of text into editable and indexable text. (for a full list of compatible file formats see the first filter below) # # REM: The better/cleaner/higher contrasted/higher resolution your image or scan is the better the results # # Dependencies: libtiff-dev (or -devel)(installed FIRST), tesseract-2.04 (latest stable-version), your chosen language data for Tesseract (2.00 and up) 1, # ImageMagick, ghostscript, Zenity, and OpenOffice or other text editor 2 # This version of tesseract can be downloaded from here: http://code.google.com/p/tesseract-ocr/downloads/list # Warning: This script will not work with the latest beta version (tesseract 3.00 pre-release) due to database structure modifications. # # Optional dependencies: ocrad ->an alternate recognition engine # If inital results are unsatisfactory, maybe this engine will do better. Most importantly, it supports basic page format recognition. 3 # The latest version of ocrad can be downloaded off the GNU mirror list here: http://www.gnu.org/software/ocrad/ # # Also: Make sure to select Unicode UTF-8 in OpenOffice's pop-up window (or text editor of your choice). # # # # 1 Install Tesseract after libtiff-dev. Then extract all the language databases you need into the "wherever_you_installed/tesseract-2.04/tessdata" directory. # This is done automatically if you extract the language databases from WITHIN the "tesseract-2.04" directory (and allow overwriting). # This script allows the use of multiple language databases. Default is English and French. For adding others see comments below. # You NEED at least one language database or tesseract will not work. # 2 Simply change the occurance of "soffice -writer" below to a text editor of your choice, ie: gedit, KWrite, etc # Some systems call on OpenOffice Writer differently. If unsure, check the properties tab of your Writer launcher. # Ie: On customized versions of OOo (such as the ones provided by Linux Mandrake or Gentoo), you start Writer with: oowriter # 3 If you install ocrad also, TextRipper will recognize this and prompt you to choose between the two offering better recognition or page format support # # Troubleshooting: # If this script ends saying your text editor can't open "OCR output-editable text.txt", # or if run off the cli: Unable to load unicharset file /usr/local/share/tessdata/eng.unicharset # do (as superuser): # echo /usr/local/share /usr/share \| xargs -n 1 cp -R wherever_you_installed/tesseract-2.04/tessdata # Explanation: Tesseract may call on the tessdata directory from the /share directory of your filesystem, # so you need to make your language databases available from there. License: GPL

(TextRipper)

Send to a friend
Subscribe
Other Content from kickass
Report inappropriate content

Bad recognition

by YAFU on: Sep 19 2010

Score 50%

YAFU

I have tried with an image that contains only text, with:
./script.sh image_name.png

selecting utf-8 in OpenOffice, and I get a bad recognition, unintelligible text and symbols. I am using the latest version of ocrad. Am I doing something wrong?
Thanks.

Reply to this

Re: Bad recognition

by kickass on: Sep 19 2010

Score 50%

kickass

If your text editor ends showing something, anything, you're doing it right. Now, that means the problem is the source file. Attempt to use GIMP to save it as a .pnm format. If this fails here are a few pointers:
1) Latin characters, right? Not Korean or something like that.
2) Resolution? Contrast? The higher the better. Grayscale better than color.
3) Clean original, not smudged or spotted.
4) 1 column and 1 page. If you have say a 2 column format then use GIMP to crop to 2 separate pages and save as .pnm.
5) I'm converting non-pnm class images with ImageMagick. Do you have that installed?
If you want/can, you may send me your image and I'll see what I can do with it.

Reply to this

Re: Re: Bad recognition

by YAFU on: Sep 19 2010

Score 50%

YAFU

Hello.
I have installed ImageMagick and zenity.
I have used this screen capture:
http://img819.imageshack.us/img819/6227/testvo.jpg

and with OpenOffice and Kate I get similar results. I use latin characters, my locale is es_AR.
I have tested with the same image in png and pnm too.

Reply to this

Re: Bad recognition

by kickass on: Sep 20 2010

Score 50%

kickass

Hello: Yafu
Okay the source file has a very low resolution. You will notice this zooming in (either ctrl+scroll or in GIMP).
I did however improve on the app.
Ver. 1.1 will give you a workable output even from this low-res image. Sorry about wasting your time with installing a new recognition engine, but the results are even better. The thing is, this app is being very successful and I'm being swarmed with emails among of which are many useful comments.
d.

Reply to this

Re: Re: Bad recognition

by YAFU on: Sep 21 2010

Score 50%

YAFU

Much better with this engine. No waste of time for me, I like to try new programs or scripts. Thanks for the time you have taken in developing the script..
I also believe that in Linux we are far from the recognition capabilities of other software such as Acrobat. If you have some knowledge and are interested you can collaborate with the project "gscan2pdf"
Regards.

Reply to this

Re: Ver. 2.0

by kickass on: Sep 21 2010

Score 50%

kickass

Just to let you know, I'm working on doing the nearly impossible: extract text from multi-paged pdf's using my same fast and easy do it all script.
Thanks for your feedback. I'll chk out the project.
d.

Reply to this

Re: Re: Re: Bad recognition

by kickass on: Dec 10 2010

Score 50%

kickass

Just to let you know that I uploaded TextRipper, the new improved version of Text Recognition. Hope you like it.
d.

Add commentBack

Desktop Screenshots

Newest Groups

Elementary OS

Nahuatl language

FORUM

Who we are
Contact
More about us
Frequently Asked Questions
Register
Twitter
Blog

Explore
Apps
Artwork
Jobs
Knowledge
Events
People
Updates on identi.ca
Updates on Twitter
Content RSS
Events RSS

Participate
Groups
Forum
Add Content
Public API

About openDesktop.org
Legal Notice
Spreadshirt Shop
CafePress Shop
Advertising
Sponsor us
Report Abuse

Copyright 2007-2016 openDesktop.org Team
All rights reserved. openDesktop.org is not liable for any content or goods on this site.
All contributors are responsible for the lawfulness of their uploads.
openDesktop is a trademark of the openDesktop.org Team

All

Office

Multimedia

Graphic

Network

Printing

Text Editors

Education

Business

Telephony

Games

Development

Administration

Scientific

Security

Utilities

PDA Software

Server Software

Accessibility

Embedded

Libraries

Other Software

Documents

Social Desktop Contest

TextRipper (aka T-Rip)