[Date Prev][Date Next]
[Thread Prev][Thread Next]
[Date Index]
[Thread Index]
[New search]
To: <framers@xxxxxxxxx>
Subject: Summary: extracting text from PDF files (GhostScript query)
From: "Deborah Snavely" <dsnavely@xxxxxxxxxxx>
Date: Fri, 19 Jan 2001 16:30:12 -0800
Sender: owner-framers@xxxxxxxxx
Thread-Index: AcCCeCaX0cAOVvfSQ6qzs8Kt1wRJew==
Thread-Topic: Summary: extracting text from PDF files (GhostScript query)
Dear Framers: Wow, what a crew! I got dozens of helpful responses and suggestions to my query about ways and means of pulling the content out of a PDF file (so I can reconstruct the FrameMaker files). Final result: I used GhostView 3.6/Windows. Command: Edit | Text Extract... It's a lot better than the ASCII dumps from mainframe I used to clean up in the mid-1980s (no margin spaces to delete, for instance). Dov checked in first with pointers to Acrobat plug-ins. I experimented with Magellan, Drake, Jade (all from BCL Computers), Iceni's Gemini, and found that Datawatch's Redwing didn't offer a demo download. Several of these would be interesting *if* I had to do this chore frequently. (Thank goodnes I don't!) Al provided a method for doing it straight from PDF: >If you Distill the postscript file you can capture the text from the >generated PDF using Acrobat. When you open the PDF, set the View to >Continuous. You can then do an Edit->Select All. Three gotchas; 1) this will >capture text ONLY, if there are illustrations, sorry; 2) tables will come >out reading across in left-to-right rows, they will have to be rebuilt; 3) >check all PDF page breaks carefully, I have experienced missing information >(maybe a line or two of text, nothing real serious) where the PDF breaks >across pages. And I was planning to tray that when I finally unearthed the correct export command in GhostView. Many thanks to Colin and Scott, who both went to the trouble of testing solutions and sending me roadmaps! Colin: I opened timesfax.pdf in gsview, and used the "Text Extract..." option in the edit menu. This allows you to select a page or pages to extract, then prompts for an output file. <snip> There is also a command line version, gswin32c: I haven't tried that. Scott: >As an experiment, I verified that I could do it. >I opened a PDF in Acrobat and exported it to PS, not including embedded fonts. >I opened MacGS and used the ps2ascii Device, and viola! >A perfect ASCII dump. > >You can basically do the same in regular GhostScript with GhostView. >I found that the headers and footers are placed at the top of each >page. That is the only way you can tell what went where. Of course, >graphics are lost. Thanks again, to all of you, Deborah Snavely Document Architect, Technical Publications, Aurigin Systems, Inc. http://www.aurigin.com/ ** To unsubscribe, send a message to majordomo@omsys.com ** ** with "unsubscribe framers" (no quotes) in the body. **