[Date Prev][Date Next] [Thread Prev][Thread Next]
[Date Index] [Thread Index] [New search]

Summary: extracting text from PDF files (GhostScript query)

Dear Framers:

Wow, what a crew! I got dozens of helpful responses and suggestions to
my query about ways and means of pulling the content out of a PDF file
(so I can reconstruct the FrameMaker files). 

Final result: I used GhostView 3.6/Windows. Command: Edit | Text

It's a lot better than the ASCII dumps from mainframe I used to clean up
in the mid-1980s (no margin spaces to delete, for instance). 

Dov checked in first with pointers to Acrobat plug-ins. I experimented
with Magellan, Drake, Jade (all from BCL Computers), Iceni's Gemini, and
found that Datawatch's Redwing didn't offer a demo download. Several of
these would be interesting *if* I had to do this chore frequently.
(Thank goodnes I don't!) 

Al provided a method for doing it straight from PDF:
>If you Distill the postscript file you can capture the text from the
>generated PDF using Acrobat. When you open the PDF, set the View to
>Continuous. You can then do an Edit->Select All. Three gotchas; 1) this
>capture text ONLY, if there are illustrations, sorry; 2) tables will
>out reading across in left-to-right rows, they will have to be rebuilt;
>check all PDF page breaks carefully, I have experienced missing
>(maybe a line or two of text, nothing real serious) where the PDF
>across pages.

And I was planning to tray that when I finally unearthed the correct
export command in GhostView. Many thanks to Colin and Scott, who both
went to the trouble of testing solutions and sending me roadmaps!

I opened timesfax.pdf in gsview, and used the "Text Extract..." option 
in the edit menu. This allows you to select a page or pages to extract,
then prompts for an output file. <snip>
There is also a command line version, gswin32c: I haven't tried that.

>As an experiment, I verified that I could do it.
>I opened a PDF in Acrobat and exported it to PS, not including embedded
>I opened MacGS and used the ps2ascii Device, and viola!
>A perfect ASCII dump.
>You can basically do the same in regular GhostScript with GhostView.
>I found that the headers and footers are placed at the top of each 
>page. That is the only way you can tell what went where. Of course, 
>graphics are lost.

Thanks again, to all of you,

Deborah Snavely
Document Architect, Technical Publications, 
Aurigin Systems, Inc. http://www.aurigin.com/ 

** To unsubscribe, send a message to majordomo@omsys.com **
** with "unsubscribe framers" (no quotes) in the body.   **