Search This Blog

Sunday, September 29, 2013

Performance Tips for All Applications

There are a few tips to remember when working on the CLR in any language. These are relevant to everyone, and should be the first line of defense when dealing with performance issues.

Throw Fewer Exceptions

Throwing exceptions can be very expensive, so make sure that you don't throw a lot of them. Use Perfmon to see how many exceptions your application is throwing. It may surprise you to find that certain areas of your application throw more exceptions than you expected. For better granularity, you can also check the exception number programmatically by using Performance Counters.
Finding and designing away exception-heavy code can result in a decent perf win. Bear in mind that this has nothing to do with try/catch blocks: you only incur the cost when the actual exception is thrown. You can use as many try/catch blocks as you want. Using exceptions gratuitously is where you lose performance. For example, you should stay away from things like using exceptions for control flow.
Here's a simple example of how expensive exceptions can be: we'll simply run through a For loop, generating thousands or exceptions and then terminating. Try commenting out the throw statement to see the difference in speed: those exceptions result in tremendous overhead.
public static void Main(string[] args){
  int j = 0;
  for(int i = 0; i < 10000; i++){
    try{   
      j = i;
      throw new System.Exception();
    } catch {}
  }
  System.Console.Write(j);
  return;   
}
  • Beware! The run time can throw exceptions on its own! For example, Response.Redirect() throws a ThreadAbort exception. Even if you don't explicitly throw exceptions, you may use functions that do. Make sure you check Perfmon to get the real story, and the debugger to check the source.
  • To Visual Basic developers: Visual Basic turns on int checking by default, to make sure that things like overflow and divide-by-zero throw exceptions. You may want to turn this off to gain performance.
  • If you use COM, you should keep in mind that HRESULTS can return as exceptions. Make sure you keep track of these carefully.

Make Chunky Calls

A chunky call is a function call that performs several tasks, such as a method that initializes several fields of an object. This is to be viewed against chatty calls, that do very simple tasks and require multiple calls to get things done (such as setting every field of an object with a different call). It's important to make chunky, rather than chatty calls across methods where the overhead is higher than for simple, intra-AppDomain method calls. P/Invoke, interop and remoting calls all carry overhead, and you want to use them sparingly. In each of these cases, you should try to design your application so that it doesn't rely on small, frequent calls that carry so much overhead.
A transition occurs whenever managed code is called from unmanaged code, and vice versa. The run time makes it extremely easy for the programmer to do interop, but this comes at a performance price. When a transition happens, the following steps needs to be taken:
  • Perform data marshalling
  • Fix Calling Convention
  • Protect callee-saved registers
  • Switch thread mode so that GC won't block unmanaged threads
  • Erect an Exception Handling frame on calls into managed code
  • Take control of thread (optional)
To speed up transition time, try to make use of P/Invoke when you can. The overhead is as little as 31 instructions plus the cost of marshalling if data marshalling is required, and only 8 otherwise. COM interop is much more expensive, taking upwards of 65 instructions.
Data marshalling isn't always expensive. Primitive types require almost no marshalling at all, and classes with explicit layout are also cheap. The real slowdown occurs during data translation, such as text conversion from ASCI to Unicode. Make sure that data that gets passed across the managed boundary is only converted if it needs to be: it may turn out that simply by agreeing on a certain datatype or format across your program you can cut out a lot of marshalling overhead.
The following types are called blittable, meaning they can be copied directly across the managed/unmanaged boundary with no marshalling whatsoever: sbyte, byte, short, ushort, int, uint, long, ulong, float and double. You can pass these for free, as well as ValueTypes and single-dimensional arrays containing blittable types. The gritty details of marshalling can be explored further on the MSDN Library. I recommend reading it carefully if you spend a lot of your time marshalling.

Design with ValueTypes

Use simple structs when you can, and when you don't do a lot of boxing and unboxing. Here's a simple example to demonstrate the speed difference:
using System;
namespace ConsoleApplication{
public struct foo{
    public foo(double arg){ this.y = arg; }
    public double y;
  }
  public class bar{
    public bar(double arg){ this.y = arg; }
    public double y;
  }
  class Class1{
    static void Main(string[] args){
      System.Console.WriteLine("starting struct loop...");
      for(int i = 0; i < 50000000; i++)
      {foo test = new foo(3.14);}
      System.Console.WriteLine("struct loop complete. 
                                starting object loop...");
      for(int i = 0; i < 50000000; i++)
      {bar test2 = new bar(3.14); }
      System.Console.WriteLine("All done");
    }
  }
}
When you run this example, you'll see that the struct loop is orders of magnitude faster. However, it is important to beware of using ValueTypes when you treat them like objects. This adds extra boxing and unboxing overhead to your program, and can end up costing you more than it would if you had stuck with objects! To see this in action, modify the code above to use an array of foos and bars. You'll find that the performance is more or less equal.
Tradeoffs    ValueTypes are far less flexible than Objects, and end up hurting performance if used incorrectly. You need to be very careful about when and how you use them.
Try modifying the sample above, and storing the foos and bars inside arrays or hashtables. You'll see the speed gain disappear, just with one boxing and unboxing operation.
You can keep track of how heavily you box and unbox by looking at GC allocations and collections. This can be done using either Perfmon externally or Performance Counters in your code.

Use AddRange to Add Groups

Use AddRange to add a whole collection, rather than adding each item in the collection iteratively. Nearly all windows controls and collections have both Add and AddRange methods, and each is optimized for a different purpose. Add is useful for adding a single item, whereas AddRange has some extra overhead but wins out when adding multiple items. Here are just a few of the classes that support Add and AddRange:
  • StringCollection, TraceCollection, etc.
  • HttpWebRequest
  • UserControl
  • ColumnHeader

Trim Your Working Set

Minimize the number of assemblies you use to keep your working set small. If you load an entire assembly just to use one method, you're paying a tremendous cost for very little benefit. See if you can duplicate that method's functionality using code that you already have loaded.
Keeping track of your working set is difficult, and could probably be the subject of an entire paper. Here are some tips to help you out:
  • Use vadump.exe to track your working set. This is discussed in another white paper covering various tools for the managed environment.
  • Look at Perfmon or Performance Counters. They can give you detail feedback about the number of classes that you load, or the number of methods that get JITed. You can get readouts for how much time you spend in the loader, or what percent of your execution time is spent paging.

Use For Loops for String Iteration—version 1

In C#, the foreach keyword allows you to walk across items in a list, string, etc. and perform operations on each item. This is a very powerful tool, since it acts as a general-purpose enumerator over many types. The tradeoff for this generalization is speed, and if you rely heavily on string iteration you should use a For loop instead. Since strings are simple character arrays, they can be walked using much less overhead than other structures. The JIT is smart enough (in many cases) to optimize away bounds-checking and other things inside a For loop, but is prohibited from doing this on foreach walks. The end result is that in version 1, a Forloop on strings is up to five times faster than using foreach. This will change in future versions, but for version 1 this is a definite way to increase performance.
Here's a simple test method to demonstrate the difference in speed. Try running it, then removing the For loop and uncommenting the foreach statement. On my machine, the For loop took about a second, with about 3 seconds for the foreach statement.
public static void Main(string[] args) {
  string s = "monkeys!";
  int dummy = 0;

  System.Text.StringBuilder sb = new System.Text.StringBuilder(s);
  for(int i = 0; i < 1000000; i++)
    sb.Append(s);
  s = sb.ToString();
  //foreach (char c in s) dummy++;
  for (int i = 0; i < 1000000; i++)
    dummy++;
  return;   
  }
}
Tradeoffs   Foreach is far more readable, and in the future it will become as fast as a For loop for special cases like strings. Unless string manipulation is a real performance hog for you, the slightly messier code may not be worth it.

Use StringBuilder for Complex String Manipulation

When a string is modified, the run time will create a new string and return it, leaving the original to be garbage collected. Most of the time this is a fast and simple way to do it, but when a string is being modified repeatedly it begins to be a burden on performance: all of those allocations eventually get expensive. Here's a simple example of a program that appends to a string 50,000 times, followed by one that uses aStringBuilder object to modify the string in place. The StringBuilder code is much faster, and if you run them it becomes immediately obvious.
namespace ConsoleApplication1.Feedback{
  using System;
  
  public class Feedback{
    public Feedback(){
      text = "You have ordered: \n";
    }

    public string text;

    public static int Main(string[] args) {
      Feedback test = new Feedback();
      String str = test.text;
      for(int i=0;i<50000 0="" blue_toothbrush="" done="" i="" pre="" return="" str="str" system.console.out.writeline="">
namespace ConsoleApplication1.Feedback{
  using System;

  public class Feedback{
    public Feedback(){
      text = "You have ordered: \n";
    }

    public string text;

    public static int Main(string[] args) {
      Feedback test = new Feedback();
      System.Text.StringBuilder SB = 
        new System.Text.StringBuilder(test.text);
      for(int i=0;i<50000 0="" blue_toothbrush="" done="" i="" pre="" return="" sb.append="" system.console.out.writeline="">
Try looking at Perfmon to see how much time is saved without allocating thousands of strings. Look at the "% time in GC" counter under the .NET CLR Memory list. You can also track the number of allocations you save, as well as collection statistics.
Tradeoffs   There is some overhead associated with creating a StringBuilder object, both in time and memory. On a machine with fast memory, a StringBuilder becomes worthwhile if you're doing about five operations. As a rule of thumb, I would say 10 or more string operations is a justification for the overhead on any machine, even a slower one.

Precompile Windows Forms Applications

Methods are JITed when they are first used, which means that you pay a larger startup penalty if your application does a lot of method calling during startup. Windows Forms use a lot of shared libraries in the OS, and the overhead in starting them can be much higher than other kinds of applications. While not always the case, precompiling Windows Forms applications usually results in a performance win. In other scenarios it's usually best to let the JIT take care of it, but if you are a Windows Forms developer you might want to take a look.
Microsoft allows you to precompile an application by calling ngen.exe. You can choose to run ngen.exe during install time or before you distribute you application. It definitely makes the most sense to run ngen.exe during install time, since you can make sure that the application is optimized for the machine on which it is being installed. If you run ngen.exe before you ship the program, you limit the optimizations to the ones available on your machine. To give you an idea of how much precompiling can help, I've run an informal test on my machine. Below are the cold startup times for ShowFormComplex, a winforms application with roughly a hundred controls.
Code StateTime
Framework JITed
ShowFormComplex JITed
3.4 sec
Framework Precompiled, ShowFormComplex JITed2.5 sec
Framework Precompiled, ShowFormComplex Precompiled2.1sec
Each test was performed after a reboot. As you can see, Windows Forms applications use a lot of methods up front, making it a substantial performance win to precompile.

Use Jagged Arrays—Version 1

The v1 JIT optimizes jagged arrays (simply 'arrays-of-arrays') more efficiently than rectangular arrays, and the difference is quite noticeable. Here is a table demonstrating the performance gain resulting from using jagged arrays in place of rectangular ones in both C# and Visual Basic (higher numbers are better):
 C#Visual Basic 7
Assignment (jagged)
Assignment (rectangular)
14.16
8.37
12.24
8.62
Neural Net (jagged)
Neural net (rectangular)
4.48
3.00
4.58
3.13
Numeric Sort (jagged)
Numeric Sort (rectangular)
4.88
2.05
5.07
2.06
The assignment benchmark is a simple assignment algorithm, adapted from the step-by-step guide found in Quantitative Decision Making for Business (Gordon, Pressman, and Cohn; Prentice-Hall; out of print). The neural net test runs a series of patterns over a small neural network, and the numeric sort is self-explanatory. Taken together, these benchmarks represent a good indication of real-world performance.
As you can see, using jagged arrays can result in fairly dramatic performance increases. The optimizations made to jagged arrays will be added to future versions of the JIT, but for v1 you can save yourself a lot of time by using jagged arrays.

Keep IO Buffer Size Between 4KB and 8KB

For nearly every application, a buffer between 4KB and 8KB will give you the maximum performance. For very specific instances, you may be able to get an improvement from a larger buffer (loading large images of a predictable size, for example), but in 99.99% of cases it will only waste memory. All buffers derived from BufferedStream allow you to set the size to anything you want, but in most cases 4 and 8 will give you the best performance.

Be on the Lookout for Asynchronous IO Opportunities

In rare cases, you may be able to benefit from Asynchronous IO. One example might be downloading and decompressing a series of files: you can read the bits in from one stream, decode them on the CPU and write them out to another. It takes a lot of effort to use Asynchronous IO effectively, and it can result in a performance loss if it's not done right. The advantage is that when applied correctly, Asynchronous IO can give you as much as ten times the performance.
An excellent example of a program using Asynchronous IO is available on the MSDN Library.
  • One thing to note is that there is a small security overhead for asynchronous calls: Upon invoking an async call, the security state of the caller's stack is captured and transferred to the thread that'll actually execute the request. This may not be a concern if the callback executes lot of code, or if async calls aren't used excessively
Source MSDN

No comments:

Post a Comment