Skip to content

Instantly share code, notes, and snippets.

@vdavez
Last active June 17, 2024 19:40

Revisions

  1. vzvenyach revised this gist Nov 2, 2013. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion docx2md.md
    Original file line number Diff line number Diff line change
    @@ -10,7 +10,7 @@ As it turns out, there are several open-source tools that allow for conversion b

    Then I found [unoconv](http://dag.wieers.com/home-made/unoconv/). This little tool takes advantage of OpenOffice's ability to convert a Word document into a bunch of different formats. But, unoconv too has a bit of a downside. Specifically, unoconv tries to keep a lot of the formatting that Word has embedded in a document. The output is, well, messy.

    But, by using unconv and pandoc in combination, you can get a pretty clean output.
    But, by using unconv and pandoc in combination, you can get a pretty clean output. And, the best part is that it retains footnotes and other key syntax (italics, etc.)

    ## Example

  2. vzvenyach revised this gist Nov 2, 2013. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion docx2md.md
    Original file line number Diff line number Diff line change
    @@ -19,4 +19,4 @@ Say you have the Council Rules in a Word Document named "test.docx." [(For a rea
    unoconv -f html test.docx
    pandoc -f html -t markdown -o test.md test.html

    Out is a beautiful markdown file. Admittedly, there's a bit of junk at the top with the Table of Contents. I deleted this when I rendered it nicely with strapdown.js. [In the end, here's my nicely rendered version of the Rules.](http://vzvenyach.github.io/Council_Rules/Rules.html).
    Out is a beautiful markdown file. Admittedly, there's a bit of junk at the top with the Table of Contents. I deleted this when I rendered it nicely with strapdown.js. [In the end, here's my nicely rendered version of the Rules.](http://vzvenyach.github.io/Council_Rules/Rules.html)
  3. vzvenyach revised this gist Nov 2, 2013. 1 changed file with 5 additions and 1 deletion.
    6 changes: 5 additions & 1 deletion docx2md.md
    Original file line number Diff line number Diff line change
    @@ -14,5 +14,9 @@ But, by using unconv and pandoc in combination, you can get a pretty clean outpu

    ## Example

    Say you have the Council Rules in a Word Document named "test.docx." [(For a real-life example, visit http://github.com/vzvenyach/Council_Rules/).](http://github.com/vzvenyach/Council_Rules/) Now, you run the following at the command line:

    unoconv -f html test.docx
    pandoc -f html -t markdown -o test.md test.html
    pandoc -f html -t markdown -o test.md test.html

    Out is a beautiful markdown file. Admittedly, there's a bit of junk at the top with the Table of Contents. I deleted this when I rendered it nicely with strapdown.js. [In the end, here's my nicely rendered version of the Rules.](http://vzvenyach.github.io/Council_Rules/Rules.html).
  4. vzvenyach revised this gist Nov 2, 2013. 1 changed file with 2 additions and 2 deletions.
    4 changes: 2 additions & 2 deletions docx2md.md
    Original file line number Diff line number Diff line change
    @@ -14,5 +14,5 @@ But, by using unconv and pandoc in combination, you can get a pretty clean outpu

    ## Example

    unoconv -f html test.docx
    pandoc -f html -t markdown -o test.md test.html
    unoconv -f html test.docx
    pandoc -f html -t markdown -o test.md test.html
  5. vzvenyach created this gist Nov 2, 2013.
    18 changes: 18 additions & 0 deletions docx2md.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,18 @@
    # Converting a Word Document to Markdown in Two Moves

    ## The Problem

    A lot of important government documents are created and saved in Microsoft Word (*.docx). But Microsoft Word is a proprietary format, and it's not really useful for presenting documents on the web. So, I wanted to find a way to convert a .docx file into markdown.

    ## The Solution

    As it turns out, there are several open-source tools that allow for conversion between file types. [Pandoc](johnmacfarlane.net/pandoc/) is one of them, and it's powerful. In fact, pandoc's website says "If you need to convert files from one markup format into another, pandoc is your swiss-army knife." But, although pandoc can convert from markdown into .docx, it doesn't work in the other direction.

    Then I found [unoconv](http://dag.wieers.com/home-made/unoconv/). This little tool takes advantage of OpenOffice's ability to convert a Word document into a bunch of different formats. But, unoconv too has a bit of a downside. Specifically, unoconv tries to keep a lot of the formatting that Word has embedded in a document. The output is, well, messy.

    But, by using unconv and pandoc in combination, you can get a pretty clean output.

    ## Example

    unoconv -f html test.docx
    pandoc -f html -t markdown -o test.md test.html