About Me

My photo
Ireland
Hello, my name is Cathal Coffey. I am best described as a hybrid between a developer and an adventurer. When I am not behind a keyboard coding, I am hiking and climbing the beautiful mountains of my home country Ireland. I am a full time student studying Computer Science & Software Engineering at the National University of Ireland Maynooth. I am finishing the final year of a 4 year degree in September 2009. I am the creator of an open source project on codeplex.com called DocX. At the moment I spend a lot of my free time advancing DocX and I enjoy this very much. My aim is to build a community around DocX and add features based on requests from this community. I really enjoy hearing about how people are using DocX in their work\personal projects. So if you are one of these people, please send me an email. Cathal coffey.cathal@gmail.com

Thursday, December 23, 2010

Replace text across many documents in Parallel

.NET 4.0 makes Parallel programming easy. Below is an example of how to replace text across many .docx documents in parallel.

This example contains 4 functions.

1) Replace: This function opens a document and does the text replace.

2) NonParallel_ReplaceText: This is how you would replace text across multiple documents without using parallel execution. This is included for comparisons sake.

3) Parallel_ReplaceText: This is how you would replace text across multiple documents in parallel.

4) Main: This function does the work sequentially and then in parallel and prints the time taken for both.

Before running this code replace the line
DirectoryInfo di = new DirectoryInfo(@"C:\Users\Cathal\Desktop\multiple");
with a directory on your machine that contains many .docx documents.

Note(s): 

1) There is over head when executing code in Parallel. Make sure your doing enough work to justify Parallel execution. For example: if you run this code on 4 small documents, the function NonParallel_ReplaceText may run faster than its parallel equivalent.

2) Run this example without the debugger, the debugger adds overhead which makes this code run significantly slower.

3) You can download and build the latest version of DocX.dll from here http://docx.codeplex.com/SourceControl/list/changesets#.

Code Snippet
  1. using System;
  2. using System.Collections.Generic;
  3. using System.Linq;
  4. using System.Text;
  5. using System.Text.RegularExpressions;
  6. using Novacode;
  7. using System.Drawing;
  8. using System.Threading.Tasks;
  9. using System.IO;
  10. using System.Diagnostics;
  11. namespace testDocX
  12. {
  13.     class Program
  14.     {
  15.         static void Main(string[] args)
  16.         {
  17.             // Directory containing many .docx documents.
  18.             DirectoryInfo di = new DirectoryInfo(@"C:\Users\Cathal\Desktop\multiple");
  19.  
  20.             // Print out the time taken in miliseconds.
  21.             Console.WriteLine("Non-Parallel took " + NonParallel_ReplaceText(di, "pear", "raep") + " miliseconds.");
  22.  
  23.             // Print out the time taken in miliseconds.
  24.             Console.WriteLine("Parallel took " + Parallel_ReplaceText(di, "raep", "pear") + " miliseconds.");
  25.  
  26.             // Wait until the user presses a key before exiting.
  27.             Console.ReadKey();
  28.         }
  29.  
  30.         // Replace text accross multiple documents sequentially.
  31.         private static long NonParallel_ReplaceText(DirectoryInfo di, string a, string b)
  32.         {
  33.             // Create a new Stopwatch, we will use this to time execution.
  34.             Stopwatch sw = new Stopwatch();
  35.  
  36.             sw.Start(); // Start the stop watch.
  37.  
  38.             // Loop through each document in this specified direction.
  39.             foreach (FileInfo fi in di.GetFiles())
  40.             {
  41.                 // Replace text in this document.
  42.                 Replace(fi.FullName, a, b);
  43.             }
  44.  
  45.             sw.Stop(); // Stop the stop watch.
  46.  
  47.             // Return the time taken in miliseconds.
  48.             return sw.ElapsedMilliseconds;
  49.         }
  50.  
  51.         // Replace text accross multiple documents in Parallel.
  52.         private static long Parallel_ReplaceText(DirectoryInfo di, string a, string b)
  53.         {
  54.             // Create a new Stopwatch, we will use this to time execution.
  55.             Stopwatch sw = new Stopwatch();
  56.  
  57.             sw.Start(); // Start the stop watch.
  58.  
  59.             // Loop through each document in this specified direction.
  60.             System.Threading.Tasks.Parallel.ForEach
  61.             (
  62.                 di.GetFiles(),
  63.                 currentFile =>
  64.                 {
  65.                     Replace(currentFile.FullName, a, b);
  66.                 }
  67.             );
  68.  
  69.             sw.Stop(); // Stop the stop watch.
  70.  
  71.             // Return the time taken in miliseconds.
  72.             return sw.ElapsedMilliseconds;
  73.         }
  74.  
  75.         // Replace the string a with the string b in filename and save the changes.
  76.         static void Replace(string filename, string a, string b)
  77.         {
  78.             // Load the document.
  79.             using (DocX document = DocX.Load(filename))
  80.             {
  81.                 // Replace text in this document.
  82.                 document.ReplaceText(a, b);
  83.  
  84.                 // Save changes made to this document.
  85.                 document.Save();
  86.             } // Release this document from memory.
  87.         }
  88.     }
  89. }

18 comments:

  1. Thank's a lot you save my life :)

    ReplyDelete
  2. Hi cathal,

    Thanks for the great library.
    Can you please also tell me if this library supports conversion of docx to pdf or reading entire document and document settings which can help me convert it to html as i basically need to display the docx file to html or pdf? thanks for the help in advance..

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. Hey I just want to know is this possible with DocX library that I find a specific word from the document file and add some string just after that word. I don't want to replace any text but to find a text and add another text with a space after that.

    Thanks And Regards
    Toshim Shaikh

    ReplyDelete
  5. Im stucked on this error

    Could not load type 'Novacode.DocX' from assembly 'docx, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null'.

    ReplyDelete
  6. Hi can you please tell me how do we replace the text with entire document. something like sandwiching a document in between the other document at a particular position depending on the occurrence of some string or text?

    ReplyDelete
  7. Can you help me input a string (about 1000 words) with format in to MsWord. I tried use Docx Library but I can insert chr(13) for split multi paragraphs. Thanks!

    ReplyDelete
  8. how can we do find replace in header or footer?

    ReplyDelete
    Replies
    1. can you use table to do that

      Delete
  9. This comment has been removed by the author.

    ReplyDelete
  10. Hi,
    I created a console application, added library DOCX however when i try to run the application i get the following error:
    "An unhandled exception of type 'System.TypeLoadException' occurred in mscorlib.dll

    Additional information: Could not load type 'Novacode.DocX' from assembly 'DOCX, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null'."

    Please help.

    Thank you

    ReplyDelete
  11. Hi,

    while replacing text in word document ,cant replace text in default header and footer

    ReplyDelete
  12. How to insert Html string (Text and Image) to word?

    ReplyDelete