About Me

My Photo
Ireland
Hello, my name is Cathal Coffey. I am best described as a hybrid between a developer and an adventurer. When I am not behind a keyboard coding, I am hiking and climbing the beautiful mountains of my home country Ireland. I am a full time student studying Computer Science & Software Engineering at the National University of Ireland Maynooth. I am finishing the final year of a 4 year degree in September 2009. I am the creator of an open source project on codeplex.com called DocX. At the moment I spend a lot of my free time advancing DocX and I enjoy this very much. My aim is to build a community around DocX and add features based on requests from this community. I really enjoy hearing about how people are using DocX in their work\personal projects. So if you are one of these people, please send me an email. Cathal coffey.cathal@gmail.com

Saturday, October 31, 2009

Converting .docx into (.doc, .pdf, .html)

Introduction

A DocX user asked me during the week when was I going to support converting Word 2007 documents (.docx) into other useful forms such as  (.doc, .pdf, .html). I would love to add this functionality to DocX, however there is a problem.

The Problem

The only easy way to do this conversion, is to use Microsoft’s Office interop libraries. For anyone who doesn't know what Microsoft’s Office interop libraries are, I envy you.

The Microsoft Office interop libraries are available in the Add Reference dialog.

Untitled 

The Code

Once you have added a reference to Microsoft.Office.Interop.Word you can use the below project to convert a Word 2007 .docx into .doc, .pdf, and .html.

Code Snippet
  1. using System;
  2. using System.Collections.Generic;
  3. using System.Linq;
  4. using System.Text;
  5. using Word = Microsoft.Office.Interop.Word;
  6. using Microsoft.Office.Interop.Word;
  7.  
  8. namespace ConsoleApplication1
  9. {
  10.     class Program
  11.     {
  12.         static void Main(string[] args)
  13.         {
  14.             // Convert Input.docx into Output.doc
  15.             Convert(@"C:\users\cathal\Desktop\Input.docx", @"c:\users\cathal\Desktop\Output.doc", WdSaveFormat.wdFormatDocument);
  16.  
  17.             /*
  18.              * Convert Input.docx into Output.pdf
  19.              * Please note: You must have the Microsoft Office 2007 Add-in: Microsoft Save as PDF or XPS installed
  20.              * http://www.microsoft.com/downloads/details.aspx?FamilyId=4D951911-3E7E-4AE6-B059-A2E79ED87041&displaylang=en
  21.              */
  22.             Convert(@"c:\users\cathal\Desktop\Input.docx", @"c:\users\cathal\Desktop\Output.pdf", WdSaveFormat.wdFormatPDF);
  23.  
  24.             // Convert Input.docx into Output.html
  25.             Convert(@"c:\users\cathal\Desktop\Input.docx", @"c:\users\cathal\Desktop\Output.html", WdSaveFormat.wdFormatHTML);
  26.         }
  27.  
  28.         // Convert a Word 2008 .docx to Word 2003 .doc
  29.         public static void Convert(string input, string output, WdSaveFormat format)
  30.         {
  31.             // Create an instance of Word.exe
  32.             Word._Application oWord = new Word.Application();
  33.  
  34.             // Make this instance of word invisible (Can still see it in the taskmgr).
  35.             oWord.Visible = false;
  36.  
  37.             // Interop requires objects.
  38.             object oMissing = System.Reflection.Missing.Value;
  39.             object isVisible = true;
  40.             object readOnly = false;
  41.             object oInput = input;
  42.             object oOutput = output;
  43.             object oFormat = format;
  44.  
  45.             // Load a document into our instance of word.exe
  46.             Word._Document oDoc = oWord.Documents.Open(ref oInput, ref oMissing, ref readOnly, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref isVisible, ref oMissing, ref oMissing, ref oMissing, ref oMissing);
  47.  
  48.             // Make this document the active document.
  49.             oDoc.Activate();
  50.  
  51.             // Save this document in Word 2003 format.
  52.             oDoc.SaveAs(ref oOutput, ref oFormat, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing);
  53.             
  54.             // Always close Word.exe.
  55.             oWord.Quit(ref oMissing, ref oMissing, ref oMissing);
  56.         }
  57.     }
  58. }

The result

 

image
Input.docx

 

image

image

image

Output.doc

Output.pdf

Output.html

Please note

This code will only execute on a machine that has Microsoft’s Office installed on it. The Microsoft’s Office interop libraries actually execute a “hidden” instance of the Office. If you run the above code and then take a look at taskmgr you will see the following.

image

If you want to convert to .pdf, you must also have the Microsoft Office 2007 Add-in: Microsoft Save as PDF or XPS installed.

It is for this reason that I have not included convert functionality into my DocX library. I do not want DocX to have a dependency on Word.exe.

The future

Is there no way to do conversions without having Word.exe installed on my machine. I didn’t say that, I said there is no easy way. This looks very promising, now if I could only find the time.

Donation?

As always, I offer this code to you for free. I am however a student and if you would like to say thank you, you can buy me lunch by sending a €5 euro donation via paypal.

23 comments:

  1. Cool Trick to Export to PDF, i was looking it for quite some time.

    Thanks for sharing

    ReplyDelete
    Replies
    1. The perfect!These articles written too great,they rich contents and data accurately.they are help to me.I expect to see your new share.
      -----------------
      RS Gold Runescape Gold Buy WOW Gold

      Delete
    2. hi, i am a student of software engineering and i doing work on my final year project, i want help. i want to convert pdf file in different formats in clients side , means in jacascript. could you please help ma ?

      engg.nashib@gmail.com

      Delete
  2. I just started look at your Open Source Project "DOCX" and then saw this blog. I would really suggest you to look at the OpenXmlPowerTools.HtmlConvertor and the iTextSharp. You can use both of these in combination to generate either html and or pdf from the html. It is not as perfect, but does the job pretty nicely without overage of the PIAs. The HTML is also pretty clean.

    Thanks,

    ReplyDelete
  3. I love it,Excellent article.I am decide to put this into use one of these days.Thank you for sharing this.To Your Success!
    _____________________________________________________________________________

    Cocktail Dresses|Maternity Wedd Bride Dresses|Plus Size Mother of Bride Dresses

    ReplyDelete
    Replies
    1. Once again great post. You seem to have a good understanding of these themes.When I entering your blog,I felt this . Come on and keep writting your blog will be more attractive. To Your Success!

      Classic Dresses
      Classic Bridesmaid Dresses
      Wedding Dresses with Sleeves

      Delete
  4. I love it,Excellent article.I am decide to put this into use one of these days.Thank you for sharing this.To Your Success!
    _____________________________________________________________________________

    Rc Helicopter Parts|Rc Helicopter|Mini Rc Helicopter

    ReplyDelete
  5. It works good. Excellent article. Thanx.

    ReplyDelete
  6. Its pretty good and very easy to understand.

    Thanks

    ReplyDelete
  7. oDoc.Activate(); error "Object reference not set to an instance of an object." for iss. Help me plssssss :(

    ReplyDelete
  8. Hey guys, here you are a reliable store to buy WoW gold which is really cheap. I know it through my friend's recommendation. If you are a wow fan, you can have a try. You know, it is difficult to buy cheap wow gold with fast delivery. Hope you like it.

    ReplyDelete
  9. same error....oDoc.Activate(); error "Object reference not set to an instance of an object." for issue. Help me plssssss :( that's y i tried ur dll but it don't have this functionality...

    ReplyDelete
    Replies
    1. check below setting

      Start->dcomcnfg.exe
      Computer
      Local Computer
      Config DCOM
      Search For Microsoft Word 97-2003 Documents->Properties
      Tab Identity ,change from Launching User To Interactive User

      Delete
  10. Hi Cathal,

    This library appears to optimised for writing/editing documents. If there are good interfaces to enumerate the document then I would suggest coding up iTextSharp to output to many different potential formats. (From memory iTextSharp has a generic interface to output to many formats).

    I would be willing to donate toward to such a project. At the moment there is a lockup of commercial products. Docx to Pdf particularly would be great to have in the open source realm, as even commercial products can have bugs which you can't fix yourself.

    ReplyDelete
  11. Hi...
    But it is not working in IIS .
    Can you suggest me how to do this

    ReplyDelete
  12. can we convert doc to pdf without installing microsoft offic and open office and also without using third party dll's

    ReplyDelete
  13. Not working need to have Microsoft office to avail Microsoft’s Office interop libraries.

    ReplyDelete
  14. I had some errors on this when creating .doc and .html files. I found that i had to use oDoc.Close() to get everything to work properly.

    ReplyDelete
  15. If DocX uses Microsoft.Office.Interop's dll then why don't we convert document to pdf using that only.?

    ReplyDelete
  16. Thanks for sharing wonderful tips.
    I am very curious to know how doc file can be converted into PDF without installing MS office. The problem is there is restriction to install MS office on production server. Please suggest the best way to convert word to pdf or any third party tools.

    ReplyDelete