Sneaking a Peek inside the Chart Advisor Add-in

Friday, August 29, 2008

I recently wrote a post about the new Chart Advisor add-in from Microsoft’s Office Labs, and Dave Lee (one of the developers) was very kind to post a couple of comments that added to my understanding of the add-in. If you don’t already have it, I suggest that you download the Chart Advisor and give it a test run. This add-in has the potential to be a big help to those who struggle to choose the correct type of chart to represent their data.

In this post, I’m going to write about a few interesting things that I’ve learned by poking around in the XML files that configure the add-in. Please note that you should not make any changes to the files that are mentioned here, unless you are prepared to live with the consequences. If you are going to make changes, at least make a backup of the original files first.

Warning: A lot of what follows is my speculation based on a cursory reading of the XML files. I may be wrong on the details, or wrong on everything. Still, I thought that you might find this an interesting first look. Also note that Dave Lee advises against changing these files.

Description of Files

There are three XML files that are used with Chart Advisor. You don’t need to know anything about these files to use the add-in, but the curious among us just can’t help but have a look. You can find the files using Windows Explorer (or in any editor that has a File Open dialog box) by pasting the following into the address bar:

%AppData%Chart Advisor from Office Labs

That will take you to the directory where you will find three XML files:

  1. ChartAdvisor.config.xml
  2. ChartAdvisor.Rules.xml
  3. ChartAdvisor.schema.xml

The first file contains a small amount of configuration data (e.g., the location of the rules and schema files). The second file contains the rules. Finally, the third file contains the XML schema, which describes the structure of the other XML files.

I am far from an expert on XML, so I can only scratch the surface. In fact, I’m a rank amateur with XML and need to learn a lot more about it (particularly XBRL, which will soon be the required format for companies filing with the SEC), but the files are basically tagged text files. So, they are easy to read, but understanding them requires more effort.

Excel 2007 will open the files and parse them into columns, though this probably isn’t the easiest way to read the files. Instead, I am using the free Notepad++ editor, which knows how to color-code the contents for easier understanding. I’m sure that there are better tools available, but I don’t know enough to recommend one.

The Config File

ChartAdvisor.config.xml contains some basic configuration details. It has the location of the rules file and the location of the schema file, both of which are on your hard drive. It also tells Chart Advisor whether or not to collect usage statistics and show an overview of Chart Advisor (at startup?). Finally, it also has a Minimum Recommendation Score, which I assume is the smallest chart score that will be presented to the user in the gallery. Currently this is 5, which is very low from what I have seen. I’m not sure, but I believe that charts are scored from 0 to 100 (shown as a percentage) with higher scores being more appropriate according to the logic of the add-in.

The Rules File

ChartAdvisor.Rules.xml contains, well, the rules for choosing and scoring the charts. I’ve never written a program that pulls its rules from an XML file (though this seems to be all the rage these days), so I’m not sure exactly how this works. I assume that this file is read by the add-in and then the contents are used to create a bunch of If-Then-Else and/or Select-Case statements. These statements would then be followed to determine the chart scores.

For example, the first rules in the file seem to test whether a series is a category series. If so, then it tests to see if the series is comprised of dates, numbers, boolean values (true or false), or strings. Depending on what it determines, the add-in will then attempt to further classify the data. It performs these tests for each type (or family) of chart. Here is a snippet of the first set of rules (there are more, and you would find the end of this set of rules by finding the </DataMap> tag):

<br />
<!-- 
DataMap Name="LineAreaChart" -->
 <
DataMap Name="LineAreaChart" Description="Excel Line Area Charts"></p>

 <
Rule Series="Category">
   <!-- 
Tests Independent of Series Type -->
   <If 
condition="COLUMN.IsLeftmostSeries">1</If>
   <If 
condition="COLUMN.NullCount == 0">1</If>

    <
p><!-- Nested DateTime Series Tests -->
 <If 
condition="COLUMN.IsDateSeries">
  <If 
condition="COLUMN.IsLinearSeries">
   <
Select>
    <Case 
condition="COLUMN.CountDistinctValues LT 3">0</Case>
    <Case 
condition=&#8220;<span class="caps">COLUMN</span>.CountDistinctValues 
        
<span class="caps">BETWEEN</span> ( 310)&#8221;>7</Case>
    
<Case condition=&#8220;<span class="caps">COLUMN</span>.CountDistinctValues 
        
<span class="caps">BETWEEN</span> (1150)&#8221;>9</Case>
    
<Case condition=&#8220;<span class="caps">COLUMN</span>.CountDistinctValues 
        
<span class="caps">BETWEEN</span> (51,100)&#8221;>8</Case>
    
<Default>6</Default>
   </
Select>
   </If>
   <If 
condition=&#8220;<span class="caps">NOT</span> <span class="caps">COLUMN</span>.IsLinearSeries <span class="caps">AND</span> 
        
<span class="caps">COLUMN</span>.IsSortedSeries&#8221;>-4</If>
   
<If condition=&#8220;<span class="caps">NOT</span> <span class="caps">COLUMN</span>.IsLinearSeries <span class="caps">AND</span>
        
<span class="caps">NOT</span> <span class="caps">COLUMN</span>.IsSortedSeries&#8221;>0</If>
  
</If><br /> 

If I understand this correctly, and please correct me if I am wrong, then this is trying to determine a score for a Line or Area chart. It first looks to see if the series is in the leftmost column and if there are any Null values. Then it tests to see if the data in the series are dates. If it does contain dates, then if checks to see if it is a linear series. If so, it then assigns "points" based on the number of entries in the series (e.g., less than 3 gets no points, between 3 and 10 gets 7 points, and so on).

If the series contains dates, but they are not linear, and they are sorted then it gets -4 points. If it contains dates that are not linear and not sorted, then it gets 0 points.

I don’t understand the rules well enough to really get the logic, but that is roughly what I think that snippet is doing. I’ll leave it to others to actually work out all of the logic. This post is just intended to give you an idea of what the rules look like.

The Schema File

As noted above, ChartAdvisor.schema.xml contains a description of the elements and data types that can appear in the rules file. For example, a Line chart is defined by the following elements:

<br />
<
Chart Name=&#8220;Excel Line Chart&#8221; 
        
Description=&#8220;Summary Line Chart Rules&#8221; ScoreWeight=&#8220;1.05&#8221;><br />
<Element Name="XlChartType">xlLineMarkers</Element><br />
<
Element Name="ChartStyle">26</Element><br />
<
SeriesDefinitions>
  <!&
#8212; Category is optional because auto-generated 
        
category will be linear &#8212;>
  
<Series Name="Category" Optional="True" Multiple="False" />
  <
Series Name="ValueSeries" Optional="False" Multiple="True" /><br />
<!-- <
Series Name="SeriesDefinition" Optional="True" Multiple="False" /> &#8212;><br />
</SeriesDefinitions><br />
</
Chart><br /> 

Here you can see that the chart type is xlLineMarkers (a Line chart with markers) and is style number 26. It has an optional category series (X-axis labels), but one or more value series are required. Note that anything between <!— —> tags is a comment.

Popularity Scores

One of the things that I initially found interesting in the chart scoring algorithm was the "popularity score." In the snippet above, you can see that a Line chart is assigned a popularity score (ScoreWeight) of 1.05. I believe that this means that Line charts are more popular than average. I am assuming that this is an index, with 1.00 being exactly average. Apparently, these scores are based on Microsoft Research data.

According to my reading of the schema, here are the popularity scores for each of the chart types:

Chart Type Popularity Score
Clustered Column 1.10
Clustered Bar 1.10
Line 1.05
XY Scatter 1.05
Pie 1.00
Stacked Column 0.95
Stacked Bar 0.95
Area 0.90
Bubble 0.85
Doughnut 0.85
Radar 0.80
Surface 0.75
100% Stacked Column 0.50
100% Stacked Bar 0.50
Stock Chart (HLC) 0.10
Animated Bubble N/A

Hopefully, that list of popularity scores will be helpful to somebody. Interestingly, the Animated Bubble is listed but not given a score. As far as I know, Excel doesn’t currently offer any animated chart types. However, Jon Peltier has created one in his post Gapminder for Excel.

I hope that you have found this look at Chart Advisor interesting and useful. Be aware, though, that this post contains a lot of speculation on my part based on my viewing of the XML files. I may very well have made some mistakes. I hope that you will correct me where I’ve gone wrong. I’ll make corrections as necessary, and may remove the post entirely if I have screwed up badly.

Posted by .(JavaScript must be enabled to view this email address) on 08/29 at 02:04 PM

Permalink Tell-a-Friend

Comments

Tim -

A good first assessment of the rules, especially for someone new to XML. What makes it particularly difficult to try to edit the XML files is not knowing how the Advisor uses the XML in its internal algorithms.

But for a simpler advisor, where you leave out less effective chart types (radar, donut, pie), you could probably also come up with a simpler schema. Particularly for specific applications with more narrowly defined data. Doing this on any arbitrary freeform data is very tricky.


Posted by Jon Peltier on 08/30 at 08:00 AM | #

Jon,

Thanks. I think that you have a very good idea there. If they had started with just the major chart types (say, those with a popularity score of 1.00 or higher) then it would certainly have been easier.

BTW, did you notice that Pie charts have a popularity score of 1.00? I would have guessed that it was higher. Anyway, I almost put a link on Pie in the table. wink


Posted by Tim on 08/30 at 11:31 AM | #

The idea of a “popularity” score is a little scary. Not knowing where it’s derived from, I’d be afraid it’s the popularity of charts among people who most need this Advisor.

I would include many fewer chart types than the Advisor: XY, Line, Bar, Column, Area. I’d keep all 3D charts out of the options, as well as radars, pies, and donuts. I’d offer the choice of stacked bar and column, but clustered would appear higher in the preferences.

The included data would drive most of the options. Numerical X and Y leads to XY. Categorical X leads to the other types. Date X leads to the others as well, unless the values are date-time values, in which case we revert to XY.

The included data would also drive such options as log scales. No secondary axes. No combo charts, either, because that’s too case-specific and advanced a decision than at least the first generation of Advisor should be expected to handle.


Posted by Jon Peltier on 08/30 at 12:58 PM | #

Jon, they really should have hired you to work at the Labs. Those are great ideas. Maybe somebody will write a VBA add-in that works that way.

I’m going to try a test later to see if I can give a 0 popularity score to a chart type. I think that will make it so that it doesn’t show up in the recommendations. I’ll update the post after the test.


Posted by Tim on 08/30 at 01:13 PM | #

Ok, I tried editing the schema file (where I got the popularity scores) with no luck. I changed the score for bubble charts to 0, but I’m still getting bubble charts with high scores and normal popularity multipliers in the results.

Interestingly, it seems that at least once the score got changed back to 0.85. Either I am seeing things, or Chart Advisor is smart enough to reset it, or Vista’s file system isn’t actually saving the change. It just happened again, even though I ran my editor in Administrator mode.

For now, I think I’ll just follow Dave’s advice and not muck around in these files.


Posted by Tim on 08/30 at 02:18 PM | #

Didn’t Dave say they overwrite the file when they start up? I interpreted that to mean it started out compressed in the utility, and when the utility was launched, it deleted the existing rules file and recreated the original one.

Dave - if you’re watching, could you clarify your earlier comment?


Posted by Jon Peltier on 08/30 at 07:22 PM | #

Yes, he did say that:

“btw, I wouldn’t spend time trying to change the rules file—the syntax is fairly rigid and modifications by hand can cause it to be easily broken, so we overwrite the file when we start up.

He seemed to be talking specifically about the rules file, and I was modifying the schema. I searched the rules file for “ScoreWeight” (which is in the schema) and couldn’t find it. So, I assumed that it was pulling that multiplier from the schema.

You might be right that the add-in has a copy of the files within it, but why would they also include them externally? Just a backup?


Posted by Tim on 08/30 at 08:14 PM | #

I don’t know how the Advisor deals with its XML files. I took a comment by Dave and developed a whole imaginary world about it, which may or may not resemble the real world.


Posted by Jon Peltier on 08/31 at 06:24 AM | #

Commenting is not available in this channel entry.