Sketch of Me Excel Blog @ TVMCalcs.com

Testing a String for Proper Case

Posted on March 08, 2008

Have you ever needed to check a string in a worksheet cell to see if it is in proper case? By "proper case" I mean that the first letter of each word in the string is capitalized and the other characters are lowercase. For example, you may have a list of people’s names (maybe for a mailing list) and you need to be sure that the names are written properly (e.g., John Smith, not john Smith or John sMith). It seems that this should be easy in Excel, but it isn’t obvious (at least there is no easy, built-in function that you can use to validate the string). In this post I will demonstrate three methods. Two are pretty straightforward, and the third uses an array formula. The array formula is complex, but it is interesting at least, and parts of that solution may have other uses. I also provide a VBA function at the end, so I guess that makes it four ways to solve this problem.

This question arose on Yahoo! Answers where the questioner asked:

Cell A1 has data, I want to create a formula that would look into the cell, and if its case is "Proper", then I get TRUE, otherwise FALSE. any suggestions would be highly appreciated?

The Proper() Function

Now, Excel has a text function that you might expect to work: Proper(). However, the Proper() function takes a string argument and converts it, if necessary, into proper case. As an example, if A1 contains "john smith" (without the quotes) then:

=Proper(A1)

would return "John Smith," exactly what you want. What the function doesn’t do is return a boolean (True/False) result to indicate that the string is in proper case. Furthermore, you cannot directly use the function in a string comparison because comparisons in Excel are case-insensitive. So, using the same example as above, this formula will always return True:

=A1 = Proper(A1)

no matter what you type into A1. So, ="john smith"=Proper("john smith") will be true, even though it really isn’t. Again, Excel ignores the case of the letters in the comparison.

Solution 1: Using the Exact() Function

The simplest method that I have found for this problem is to use the Exact() function. This string function is case-sensitive, so it can be used to test if the letter cases are identical to the proper-case version of the string. Still assuming that A1 contains the string to be tested ("john smith" or some variant with different cases) then:

=EXACT(A1,PROPER(A1))

will always return False if the string is not in proper case. Simple and to the point. Problem solved.

Solution 2: Using the Substitute() Function

A slightly more complex version, with no advantage over using Exact() that I can see, is to use the Substitute() function. This function substitutes new text for existing text in a string. So, we can substitute a short string (say, "x" or just " ") into the proper-case version and then check to see if the function returns the short string:

=SUBSTITUTE(A1,PROPER(A1),"x")="x"

That will also return False if the original string (in A1) is not in the proper case, and True if it is.

Solution 3: An Array Function Solution

Using an array function for this problem is overkill, but I admit that it was the first technique that I thought of. While this is far from the easiest way to solve this particular problem, I think that the process has some valuable lessons. In the text below, you will learn about ASCII character codes, how to "explode" a string into its individual characters (kind of like the PHP Explode function, but we aren’t using a delimiter), and how to generate a dynamic array of consecutive numbers (you could also use that technique to generate many other arrays, such as every other number), and how to compare two arrays to see if they are identical.

The First Step Towards a Solution: The Code() Function

What we really need to do is to compare the ASCII values of the characters (that link has a good explanation of ASCII values and a list of them for each letter). The key is to realize that the ASCII character codes are different for the uppercase and lowercase versions of a letter. So, "J" is 74 while "j" is 106.

The Code() function will return the ASCII character code for a letter, or for the first letter of a string (it ignores all characters after the first). Take a look at the output of the Code() function in the following examples:

=Code("J") returns 74

=Code("j") returns 106

=Code("John Smith") returns 74 because it only sees the capital J at the beginning.

Great! All we need then is a way to compare the ASCII codes of each letter in the string to its proper-case version.

Splitting a String into Its Individual Characters

In order to compare the ASCII values of each character in the string, we need to separate the string into its individual characters. Ideally, we will do this all within a single formula rather than entering each character into its own cell. So, we need to use an array formula. This will require the use of the Mid() function and a dynamic array. I’ve dealt with string functions before in the Dealing with Unusual Date Formats post.

The Mid() function is defined as:

MID(text, start_num, num_chars)

and it extracts the number of characters that you specify (num_chars) from a string (text) starting at the position (start_num) that you specify. For example:

=MID("john",1,1)

will return "j" because that is the first character in the string.

We need the Mid() function to extract each character in the string, not just the first one. If we enter the function above as an array formula (Shift+Ctrl+Enter, not just Enter) it will still return a single "j." So, we need to specify an array for the start_num argument:

=MID("john",{1,2,3,4},1)

That will give us {"j","o","h","n"} as the result. You won’t see that array in your worksheet cell (you’ll just see "j"), but you can check the result after entering the formula by highlighting it in the Formula Bar and pressing F9.

Now we are getting somewhere, but we still have a bit of a journey.

Converting the Characters into their ASCII Codes

Remember that we have to compare the ASCII codes of the characters, not the letters themselves. Again, this is because a string comparison would be case-insensitive. So, we will change our array formula from above so that it also uses the Code() function that was mentioned earlier. We want to feed the output of our Mid() function into the Code() function. Take a look at the results for two of these formulas:

=CODE(MID("john",{1,2,3,4},1)) returns {106,111,104,110}

=CODE(MID("John",{1,2,3,4},1)) returns {74,111,104,110}

Note that we get a different array of ASCII codes from "john" and "John." If the strings were identical, the arrays would be identical. The second version is the proper case version.

We can now make use of the Proper() function and see where this is going:

=CODE(MID("john",{1,2,3,4},1)) returns {106,111,104,110}

=CODE(MID(PROPER("john"),{1,2,3,4},1)) returns {74,111,104,110}

In this case, I used the exact same string ("john"), but in the second version I first converted it to the proper case using the Proper() function. All that we need to do is to compare the arrays from these two versions for any arbitrary string. If they are the same for every character code, then the original string is in proper case.

A Dynamic Array from 1 to the Length of the String

One remaining problem is that the formula above will only work for a four-character-long string, or the first four characters of a string. So, we need to change the {1,2,3,4} part into a dynamic array from 1 to the length of the string. For example, "John Smith" has 10 characters, so we need that array to be {1,2,3,4,5,6,7,8,9,10} if we want to evaluate every character.

I’ve mentioned before that the Row() function can be used to create a dynamic array of sequential numbers, and that is exactly what we need. For example:

=ROW(A1:A4)

when entered as an array formula will return {1,2,3,4}. We could use =ROW(A1:A10) to get the 10 sequential numbers that we would need to evaluate every character in "John Smith". Unfortunately, that formula isn’t dynamic because the range (A1:A10) is hard-coded.

To create a dynamic range, we need to use the Offset() function. So, we can do this to get our 10 sequential numbers:

=ROW(A1:OFFSET(A1,9,1))

That creates a range from A1:A10, which is then fed into the Row() function to produce {1,2,3,4,5,6,7,8,9,10}. Note that we are offsetting from A1 by 9 rows, not 10. If you offset from A1 by 1 row, you get to A2. If you offset by 2 rows, you get to A3. So, we need to use an offset that is 1 less than the last row that we want. Also, note that offsetting from A1 is arbitrary. You could offset from any other column (e.g., B1, C1, etc). However, if you were to offset from A2, then the array would start at 2 instead of 1.

Finally, remember that we really want an array from 1 to the length of the string. So, replace the 9 in the above formula with the Len() function minus 1. For example,

=ROW(A1:OFFSET(A1,LEN("John Smith")-1,1))

will return {1,2,3,4,5,6,7,8,9,10}, which is exactly what we want. If we change the string from "John Smith" to "John" (or anything else) then our array will change. It will be exactly as long as the string. It is a dynamic range.

We are getting close. Let’s plug our dynamic range into our Code(Mid()) formula from above, and assume that the string is in A2:

=CODE(MID(A2,ROW(A1:OFFSET(A1,LEN(A2)-1,0)),1))

returns {106,111,104,110} for "john" in A2. If we change the string to "John Smith" we get {74;111;104;110;32;83;109;105;116;104}. It works automatically because the array is dynamic.

Comparing Two Arrays to Test for Equivalence

The last formula above will give us the array of ASCII character codes for each character in the string in A2. We need to compare that array with one formed by the proper-case version of that same string:

=CODE(MID(PROPER(A2),ROW(A1:OFFSET(A1,LEN(A2)-1,0)),1))

That formula is identical to the previous one, except that we first convert the string in A2 to proper-case using the Proper() function. If the string in A2 is "john smith" (all lowercase characters) then the two formulas would return:

=CODE(MID(A2,ROW(A1:OFFSET(A1,LEN(A2)-1,0)),1)) returns {106;111;104;110;32;115;109;105;116;104}

=CODE(MID(PROPER(A2),ROW(A1:OFFSET(A1,LEN(A2)-1,0)),1)) returns {74;111;104;110;32;83;109;105;116;104}

Notice that the arrays are different in the first and sixth positions, so we know that the string cannot be in proper case. If instead, the string in A2 was "John Smith" then the first version of the formula would return {74;111;104;110;32;83;109;105;116;104}, which is identical in every position to the proper-case result.

So, how do we do this comparison in a formula? If we divide one array by another, Excel will produce a third array that contains the result of dividing the corresponding array positions. In other words, position 1 is divided by position 1, position 2 is divided by position 2, an so on.

If the string in A2 is "john smith" then dividing the two arrays will give us {1.43243243243243;1;1;1;1;1.3855421686747;1;1;1;1}. Note that we have a 1 in every position except for the first and sixth. Since not every position is a 1, we know that the two arrays are not the same.

The formula to do this comparison is:

=CODE(MID(A2,ROW(A1:OFFSET(A1,LEN(A2)-1,0)),1))/ CODE(MID(PROPER(A2), ROW(A1:OFFSET(A1,LEN(A2)-1,0)),1))

The only difference between the numerator and denominator is that the denominator is working with the proper-case of the string. If the original string is in the proper-case, then the result array would be {1;1;1;1;1;1;1;1;1;1}. If it isn’t in the proper case, then the result array would contain one or more positions that are not 1’s.

Wrapping It Up

The seeds of our solution are now sown. We just need to harvest the final answer. Realize that the sum of the numbers in the result array will be equal to the length of the string if, and only if, the string is in proper-case. Using the above examples:

=SUM({1.43243243243243;1;1;1;1;1.3855421686747;1;1;1;1}) returns 10.8179746011071 for the string "john smith"

=SUM({1;1;1;1;1;1;1;1;1;1}) returns 10 for the string "John Smith"

So, all we need to do is to compare the length of the original string to the sum of our result array. If they are the same then the string is in proper-case. So, the formula would be (enter it using Shift+Ctrl+Enter):

=LEN(A2) = SUM(CODE(MID(A2,ROW($A$1:OFFSET($A$1,LEN(A2)-1,0)),1)) / CODE(MID(PROPER(A2),ROW($A$1:OFFSET($A$1,LEN(A2)-1,0)),1)))

For the string "John Smith" in A2 the formula would return TRUE. If the string in A2 was "john smith" it would return FALSE. Any version of the string other than "John Smith" will return FALSE, which is exactly what we set out to do.

That is a pretty ugly formula, but it does the job. If you can’t use VBA (many companies don’t allow it), then you really don’t have a choice but to construct such a beast. Read on for a VBA solution.

Solution 4: A VBA Function to Replace the Worksheet Formula

If you are able to use macros at your company, then VBA provides a simple way to solve the same problem. Here is an IsProper() function that you can use in your workbook:

Public Function IsProper(S As String) As Boolean
Returns True if the string is in the proper case
If StrComp(S, StrConv(S, vbProperCase), vbBinaryCompare) = 0 Then
    IsProper = True
Else
    IsProper = False
End If
End Function

To use the IsProper() function, press Alt-F11 to open the VBA editor. Find your workbook in the project window and right-click it. Insert a Module and then paste in the code. Now, the IsProper() function can be used in your workbook in place of the monstrosity of a worksheet function that we created earlier.

You can download a sample workbook that has both the worksheet formula and the VBA version. The example workbook is a macro-enabled Excel 2007 file, but you can open it in Excel 2003, or earlier, using the Compatibility Pack.


Posted by on March 08, 2008 at 07:28 PM

Categories: Array Formulas Functions Offset String Functions Code left len mid Proper VBA

Comments:

That was very helpful. But can anybody tell me how to change each letter of a particular word to another letter using a code.

For example in the word there will only four letters A, T, G, C and the number of the letters can be anything. It should be converted to T for A, A for T, G for C and C for G.

Posted by Mainak  on  June 23, 2008  at  03:16 AM | #

Name:

Email (will not be displayed):

Location:

URL:

Smileys

Remember my personal information

Notify me of follow-up comments?

Submit the word you see below:


<< Back to main