Font subsets are made by taking an existing font and removing all the glyphs that won't be used and re-encoding it into the PDF. When the ability to embed fonts into PDFs was added, it was clear that this would bloat the files, so PDF also included the ability to have font subsets. When Adobe created Acrobat, they wanted to make transition from PostScript as easy as possible so that font mechanism was modeled. If a standard encoding didn't have what you wanted, you were encouraged to make fonts on the fly based on the standard font that differed only in encoding. Adobe used a single byte for characters and whenever you rendered any string, the glyph to draw was taken from a 256 entry table called the encoding vector. PostScript was designed at a point when there was no Unicode. If the fonts used in the file are font subsets that have creative reencoding, then you are in hell and will likely have all manner of pain doing the change. If you modified pdf2ps, you could probably manage changing the fonts on the way through too. You could probably do this with iText or DotPdf (the latter is not free beyond the evaluation, and is my company's product). It's easy when you have a font that matches the metrics of the font in the file and the encoding used for the font is sane. This task is doable and is anywhere from easy to non-trivial. You want some code/app to be able to go through the file and make appropriate changes to the embedded fonts. Here's what you really want - font substitution. PS: It should be possible to use the command pdftocairo, but it doesn't seem to call render_for_printing(), which makes the output SVG maintain the font information. Outputcontext = cairo.Context(outputsurface) Outputsurface = cairo.PDFSurface(outputname, 1, 1) # irrelevant, though, as we will define the sizes of each page # We have to create a PDF surface and inform a size. To convert from SVG and merge everything into a single PDF: import rsvgĭef convert_merge(inputfiles, outputname): But it is not hard to write code that does this. To reassemble you can use the pair inkscape / stapler to convert the files manually. Outputname = output_template % (nthpage + 1) :param base: Base name for the SVG files (optional)īase, ext = os.path.splitext(os.path.basename(inputname)) :param inputname: Name of the PDF to be converted '''Converts a multi-page PDF to multiple SVG files. It requires only pycairo and pypoppler: import os, math The code below figures the number of pages and does the conversion in a single step. But you would need to know the number of pages beforehand. You can call pdf2svg in a for loop to do that. # Now we finally can render the PDF to SVGĪt this point you should have an SVG in which all text has been converted to paths, and will be able to edit with Inkscape without rendering issues. Surface = cairo.SVGSurface(outputname, width, height) # We only have one page, since we split prior to converting. Pdffile = poppler.document_new_from_file(uri, None) Uri = 'file://' + os.path.abspath(inputname) # Convert the input file name to an URI to please poppler It requires the pycairo and pypoppler libraries: import os A minimal working example in python follows. If we take a look at the pdf2svg.c file, we can see that the code in principle is not that complex (assuming the input filename is in the filename variable and the output file name is in the outputname variable). Now to convert the PDFs to editable files, I'd probably use pdf2svg. Outputname = output_template % (page + 1) Outputpdf.addPage(inputpdf.getPage(page)) Str(math.ceil(math.log10(inputpdf.getNumPages()))),įor page in range(inputpdf.getNumPages()): # Prefix the output template with zeros so that ordering is preserved (shamelessly copying from the commands.py file) import mathįrom PyPDF2 import PdfFileWriter, PdfFileReaderīase, ext = os.path.splitext(os.path.basename(filename)) The following function splits a file and saves the individual pages in the current directory. Stapler itself uses PyPDF2 and the code for splitting a PDF file is not that complex. Would generate, where 1.N are the PDF pages. In your favorite shell: stapler burst file.pdf If you don't want a programmatic way to split documents, the modern way would be with using stapler. This answer will omit step 3, since that's not programmable.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |