Skip to content

Instantly share code, notes, and snippets.

@slavniyteo
Last active October 26, 2018 11:21
Show Gist options
  • Save slavniyteo/9a78ebd447370345829775e3fa1d64b5 to your computer and use it in GitHub Desktop.
Save slavniyteo/9a78ebd447370345829775e3fa1d64b5 to your computer and use it in GitHub Desktop.
[Title image is fetched successfully]
Bad news about Word docs, PDFs, etc.
The aforementioned tools work really well with visual and audio media, but text documents are unfortunately much more complex. Documents like .docx, .xlsx, .pdf, .ppt, and others usually contain multiple embedded images, videos, and other media files. They’re kind of like nesting dolls. So, while it’s possible to scrub basic metadata tags from any of these documents, the objects embedded within them have a so much metadata of their own that can be individually scrutinized. This makes the idea of software-based retraction somewhat foolish.
Here’s an example: Using another open source tool called Peepdf, we’re able to see all the different objects (like images) embedded into any.pdf file. So, even if we were to strip the metadata from the document itself, anyone can extract any of its individual embedded images, and parse their metadata for more identifying context using any of the aforementioned methods. (Also, did I forget to mention, that embedded images could be extremely tiny, and non-visible to the naked eye?)
Instead, it’s best to recreate the document by flattening all the embedded objects before exporting and sharing it. For these types of documents, that means either printing them out, then rescanning; or exporting them to a different format altogether.
First Look Media’s PDF Redact Tools is a great PDF flattening tool. It automates metadata removal by creating an image of each page within a document, and gluing them back together into a brand new PDF. While this is a fabulous tool, here are two downsides: the resulting PDF is usually a lot larger than its original, which might make export and sharing more cumbersome; and it relies upon a library, ImageMagick, with a somewhat buggy history. That said, PDF Redact Tools is incredibly easy to work with, and does an excellent job at metadata removal. If you can install it on a dedicated, sandboxed machine, it makes a great tool to have in your toolkit.
If you’re interested in doing named entity recognition (NER), word frequencies, or just better searching within text, a flattened PDF file will be hard to work with because all the text will now be image-based. Thankfully, tools exist to “read” images into workable text, like Tesseract. You can explode a flattened PDF into individual images of the pages using PDF Redact Tools, then feed the pages into Tesseract to create a text document that can be worked on with any language processing tool. Beware, however, that the optical character recognition is imperfect, and you might have to comb through the resulting text to fix typos. The english dictionary data is installed by default, but other language packs are available. Visit the Tesseract wiki for more information.
Other redaction tools
If you don’t have Photoshop, you might enjoy GIMP, its open source alternative, which can be used for performing visual redactions to PDFs and other documents.
Audacity is an audio toolkit that allows you to splice audio to your liking. I find it’s the perfect tool for editing interviews that may contain off-the-record statements.
Be aware that these types of edits are non-destructive, meaning that metadata, project history, and original artifacts can be uncovered by forensic analysis. Using GIMP and Audacity is a great way to perform audio and visual redactions, but you should still take care to flatten your media before publishing by “jumping the analog hole” and using Exiftool to verify you’ve done it correctly.
Although there are a number of really great software tools to help understand, manipulate, and scrub metadata, nothing is perfect. As we explored in the previous section, digital forensic specialists might still be able to uncover bits of history from the bytes in any digital artifact. One creative way to be sure that original metadata is inaccessible is to recreate the original through “the analog hole”.
Have you ever bought a bootleg movie? (It’s ok, no judgements!) If you have, you might remember that those movies were created by someone sneaking into the theater with their own camcorder, and simply taping the entire movie from their seat. That’s an example of the analog hole; and you can use similar tactics to create unattributable copies of your original media.
Some ideas for jumping through the analog hole
[2018-10-26 11:02:28] request.INFO: Matched route "api_post_entries". {"route":"api_post_entries","route_parameters":{"_controller":"Wallabag\\ApiBundle\\Controller\\EntryRestController::postEntriesAction","_format":"json","_route":"api_post_entries"},"request_uri":"http://10.8.0.1:8081/api/entries.json","method":"POST"} []
[2018-10-26 11:02:29] app.DEBUG: Restricted access config enabled? {"enabled":0} []
[2018-10-26 11:02:29] graby.DEBUG: Graby is ready to fetch [] []
[2018-10-26 11:02:29] graby.DEBUG: . looking for site config for freedom.press in primary folder {"host":"freedom.press"} []
[2018-10-26 11:02:29] graby.DEBUG: Appending site config settings from global.txt [] []
[2018-10-26 11:02:29] graby.DEBUG: . looking for site config for global in primary folder {"host":"global"} []
[2018-10-26 11:02:29] graby.DEBUG: ... found site config global.txt {"host":"global.txt"} []
[2018-10-26 11:02:29] graby.DEBUG: Cached site config with key: freedom.press {"key":"freedom.press"} []
[2018-10-26 11:02:29] graby.DEBUG: . looking for site config for global in primary folder {"host":"global"} []
[2018-10-26 11:02:29] graby.DEBUG: ... found site config global.txt {"host":"global.txt"} []
[2018-10-26 11:02:29] graby.DEBUG: Appending site config settings from global.txt [] []
[2018-10-26 11:02:29] graby.DEBUG: Cached site config with key: global {"key":"global"} []
[2018-10-26 11:02:29] graby.DEBUG: Cached site config with key: freedom.press.merged {"key":"freedom.press.merged"} []
[2018-10-26 11:02:29] graby.DEBUG: Fetching url: https://freedom.press/training/everything-you-wanted-know-about-media-metadata-were-afraid-ask/ {"url":"https://freedom.press/training/everything-you-wanted-know-about-media-metadata-were-afraid-ask/"} []
[2018-10-26 11:02:29] graby.DEBUG: Trying using method "get" on url "https://freedom.press/training/everything-you-wanted-know-about-media-metadata-were-afraid-ask/" {"method":"get","url":"https://freedom.press/training/everything-you-wanted-know-about-media-metadata-were-afraid-ask/"} []
[2018-10-26 11:02:29] graby.DEBUG: Use default user-agent "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2" for url "https://freedom.press/training/everything-you-wanted-know-about-media-metadata-were-afraid-ask/" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://freedom.press/training/everything-you-wanted-know-about-media-metadata-were-afraid-ask/"} []
[2018-10-26 11:02:29] graby.DEBUG: Use default referer "http://www.google.co.uk/url?sa=t&source=web&cd=1" for url "https://freedom.press/training/everything-you-wanted-know-about-media-metadata-were-afraid-ask/" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://freedom.press/training/everything-you-wanted-know-about-media-metadata-were-afraid-ask/"} []
[2018-10-26 11:02:29] graby.DEBUG: Data fetched: [array] {"data":{"effective_url":"https://freedom.press/training/everything-you-wanted-know-about-media-metadata-were-afraid-ask/","body":"(only length for debug): 49416","headers":"text/html; charset=utf-8","all_headers":{"date":"Fri, 26 Oct 2018 11:02:29 GMT","content-type":"text/html; charset=utf-8","transfer-encoding":"chunked","connection":"keep-alive","set-cookie":"__cfduid=d57891acf6ec6dc307b5b1cc07c122d921540551749; expires=Sat, 26-Oct-19 11:02:29 GMT; path=/; domain=.freedom.press; HttpOnly","vary":"Cookie","x-frame-options":"SAMEORIGIN, SAMEORIGIN","content-security-policy":"style-src 'self' 'unsafe-inline'; img-src 'self' https://*.stripe.com https://analytics.freedom.press https://freedom.press https://piglet.freedom.press; connect-src 'self' https://checkout.stripe.com; script-src 'self' 'unsafe-eval' 'unsafe-inline' http://ajax.googleapis.com/ajax/libs/jquery/2.2.4/jquery.min.js https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ https://checkout.stripe.com https://analytics.freedom.press; frame-src 'self' https://www.google.com/ https://checkout.stripe.com; default-src 'self'; report-uri https://freedomofpress.report-uri.com/r/d/csp/enforce","strict-transport-security":"max-age=63072000; includeSubdomains; preload","x-content-type-options":"nosniff","x-xss-protection":"1; mode=block","referrer-policy":"same-origin","cf-cache-status":"HIT","expires":"Fri, 26 Oct 2018 13:02:29 GMT","cache-control":"public, max-age=7200","expect-ct":"max-age=604800, report-uri=\"https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct\"","server":"cloudflare","cf-ray":"46fc6e51bd53be25-MXP"},"status":200}} []
[2018-10-26 11:02:29] graby.DEBUG: Treating as UTF-8 {"encoding":"utf-8"} []
[2018-10-26 11:02:29] graby.DEBUG: Opengraph data: [array] {"ogData":{"og_title":"Everything you wanted to know about media metadata, but were afraid to ask","og_url":"https://freedom.press/training/everything-you-wanted-know-about-media-metadata-were-afraid-ask/","og_type":"website","og_image":"https://freedom.press/media/images/twitter_banner.width-1600.png","og_image_width":"840","og_image_height":"493","og_description":"Freedom of the Press Foundation protects and defends adversarial journalism in the 21st century."}} []
[2018-10-26 11:02:29] graby.DEBUG: Looking for site config files to see if single page link exists [] []
[2018-10-26 11:02:29] graby.DEBUG: Returning cached and merged site config for freedom.press {"host":"freedom.press"} []
[2018-10-26 11:02:29] graby.DEBUG: No "single_page_link" config found [] []
[2018-10-26 11:02:29] graby.DEBUG: Attempting to extract content [] []
[2018-10-26 11:02:29] graby.DEBUG: Returning cached and merged site config for freedom.press {"host":"freedom.press"} []
[2018-10-26 11:02:29] graby.DEBUG: Strings replaced: 0 (find_string and/or replace_string) {"count":0} []
[2018-10-26 11:02:29] graby.DEBUG: Attempting to parse HTML with libxml {"parser":"libxml"} []
[2018-10-26 11:02:29] graby.DEBUG: Trying //meta[@property="og:title"]/@content for title {"pattern":"//meta[@property=\"og:title\"]/@content"} []
[2018-10-26 11:02:29] graby.DEBUG: title matched: Everything you wanted to know about media metadata, but were afraid to ask {"title":"Everything you wanted to know about media metadata, but were afraid to ask"} []
[2018-10-26 11:02:29] graby.DEBUG: ...XPath match: {pattern} ["pattern","//meta[@property=\"og:title\"]/@content"] []
[2018-10-26 11:02:29] graby.DEBUG: Trying //meta[@property="article:published_time"]/@content for date {"pattern":"//meta[@property=\"article:published_time\"]/@content"} []
[2018-10-26 11:02:29] graby.DEBUG: Trying //html[@lang]/@lang for language {"pattern":"//html[@lang]/@lang"} []
[2018-10-26 11:02:29] graby.DEBUG: Trying //meta[@name="DC.language"]/@content for language {"pattern":"//meta[@name=\"DC.language\"]/@content"} []
[2018-10-26 11:02:29] graby.DEBUG: Using Readability [] []
[2018-10-26 11:02:29] graby.DEBUG: Detecting body [] []
[2018-10-26 11:02:29] graby.DEBUG: Pruning content [] []
[2018-10-26 11:02:29] graby.DEBUG: Success ? 1 {"is_success":true} []
[2018-10-26 11:02:29] graby.DEBUG: Filtering HTML to remove XSS [] []
[2018-10-26 11:02:29] graby.DEBUG: Returning data (most interesting ones): [array] {"data":{"status":200,"html":"<h3>Bad news about Word docs, PDFs, etc.</h3><p>The aforementioned tools work really well with visual and audio media, but text documents are unfortunately much more complex. Documents like .docx, .xlsx, .pdf, .ppt, and others usually contain multiple embedded images, videos, and other media files. They’re kind of like nesting dolls. So, while it’s possible to scrub basic metadata tags from any of these documents, the objects embedded within them have a so much metadata of their own that can be individually scrutinized. This makes the idea of software-based retraction somewhat foolish.</p><p>Here’s an example: Using another open source tool called <a href=\"https://github.com/jesparza/peepdf\">Peepdf</a>, we’re able to see all the different objects (like images) embedded into any.pdf file. So, even if we were to strip the metadata from the document itself, anyone can extract any of its individual embedded images, and parse their metadata for more identifying context using any of the aforementioned methods. (Also, did I forget to mention, that embedded images could be extremely tiny, and non-visible to the naked eye?)</p><p>Instead, it’s best to recreate the document by flattening all the embedded objects before exporting and sharing it. For these types of documents, that means either printing them out, then rescanning; or exporting them to a different format altogether.</p><p>First Look Media’s <a href=\"https://github.com/firstlookmedia/pdf-redact-tools\">PDF Redact Tools</a> is a great PDF flattening tool. It automates metadata removal by creating an image of each page within a document, and gluing them back together into a brand new PDF. While this is a fabulous tool, here are two downsides: the resulting PDF is usually a lot larger than its original, which might make export and sharing more cumbersome; and it relies upon a library, ImageMagick, with a somewhat buggy history. That said, PDF Redact Tools is incredibly easy to work with, and does an excellent job at metadata removal. If you can install it on a dedicated, sandboxed machine, it makes a great tool to have in your toolkit.</p><p>If you’re interested in doing named entity recognition (NER), word frequencies, or just better searching within text, a flattened PDF file will be hard to work with because all the text will now be image-based. Thankfully, tools exist to “read” images into workable text, like <a href=\"https://github.com/tesseract-ocr/tesseract\">Tesseract</a>. You can explode a flattened PDF into individual images of the pages using PDF Redact Tools, then feed the pages into Tesseract to create a text document that can be worked on with any language processing tool. Beware, however, that the optical character recognition is imperfect, and you might have to comb through the resulting text to fix typos. The english dictionary data is installed by default, but other language packs are available. Visit the Tesseract wiki for more information.</p><h3>Other redaction tools</h3><p>If you don’t have Photoshop, you might enjoy <a href=\"https://gimp.org\">GIMP</a>, its open source alternative, which can be used for performing visual redactions to PDFs and other documents.</p><p><a href=\"https://www.audacityteam.org\">Audacity</a> is an audio toolkit that allows you to splice audio to your liking. I find it’s the perfect tool for editing interviews that may contain off-the-record statements.</p><p>Be aware that these types of edits are non-destructive, meaning that metadata, project history, and original artifacts can be uncovered by forensic analysis. Using GIMP and Audacity is a great way to perform audio and visual redactions, but you should still take care to flatten your media before publishing by “jumping the analog hole” and using Exiftool to verify you’ve done it correctly.</p><p>Although there are a number of really great software tools to help understand, manipulate, and scrub metadata, nothing is perfect. As we explored in the previous section, digital forensic specialists might still be able to uncover bits of history from the bytes in any digital artifact. One creative way to be sure that original metadata is inaccessible is to recreate the original through “the analog hole”.</p><p>Have you ever bought a bootleg movie? (It’s ok, no judgements!) If you have, you might remember that those movies were created by someone sneaking into the theater with their own camcorder, and simply taping the entire movie from their seat. That’s an example of the analog hole; and you can use similar tactics to create unattributable copies of your original media.</p><h3>Some ideas for jumping through the analog hole</h3>","title":"Everything you wanted to know about media metadata, but were afraid to ask","language":null,"date":null,"authors":[],"url":"https://freedom.press/training/everything-you-wanted-know-about-media-metadata-were-afraid-ask/","content_type":"text/html","open_graph":{"og_title":"Everything you wanted to know about media metadata, but were afraid to ask","og_url":"https://freedom.press/training/everything-you-wanted-know-about-media-metadata-were-afraid-ask/","og_type":"website","og_image":"https://freedom.press/media/images/twitter_banner.width-1600.png","og_image_width":"840","og_image_height":"493","og_description":"Freedom of the Press Foundation protects and defends adversarial journalism in the 21st century."},"native_ad":false,"all_headers":{"date":"Fri, 26 Oct 2018 11:02:29 GMT","content-type":"text/html; charset=utf-8","transfer-encoding":"chunked","connection":"keep-alive","set-cookie":"__cfduid=d57891acf6ec6dc307b5b1cc07c122d921540551749; expires=Sat, 26-Oct-19 11:02:29 GMT; path=/; domain=.freedom.press; HttpOnly","vary":"Cookie","x-frame-options":"SAMEORIGIN, SAMEORIGIN","content-security-policy":"style-src 'self' 'unsafe-inline'; img-src 'self' https://*.stripe.com https://analytics.freedom.press https://freedom.press https://piglet.freedom.press; connect-src 'self' https://checkout.stripe.com; script-src 'self' 'unsafe-eval' 'unsafe-inline' http://ajax.googleapis.com/ajax/libs/jquery/2.2.4/jquery.min.js https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ https://checkout.stripe.com https://analytics.freedom.press; frame-src 'self' https://www.google.com/ https://checkout.stripe.com; default-src 'self'; report-uri https://freedomofpress.report-uri.com/r/d/csp/enforce","strict-transport-security":"max-age=63072000; includeSubdomains; preload","x-content-type-options":"nosniff","x-xss-protection":"1; mode=block","referrer-policy":"same-origin","cf-cache-status":"HIT","expires":"Fri, 26 Oct 2018 13:02:29 GMT","cache-control":"public, max-age=7200","expect-ct":"max-age=604800, report-uri=\"https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct\"","server":"cloudflare","cf-ray":"46fc6e51bd53be25-MXP"}}} []
[2018-10-26 11:02:29] app.DEBUG: DownloadImagesSubscriber: disabled. [] []
[2018-10-26 11:02:29] request.INFO: Matched route "api_patch_entries". {"route":"api_patch_entries","route_parameters":{"_controller":"Wallabag\\ApiBundle\\Controller\\EntryRestController::patchEntriesAction","_format":"json","entry":"32","_route":"api_patch_entries"},"request_uri":"http://10.8.0.1:8081/api/entries/32.json","method":"PATCH"} []
[2018-10-26 11:02:30] app.DEBUG: Restricted access config enabled? {"enabled":0} []
[2018-10-26 11:02:30] app.DEBUG: DownloadImagesSubscriber: disabled. [] []
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment