Convert Hebrew Unicode text to old EBCDIC (codepge 803)
Foreword:
We, dot.net progrmmers, don't deal too much with various encodings. We have Encoding class in the Framework, which solves all our issues... for the most part, but not always. There are some very old and rarely used encodings which are not natively supported by .net Framework, and because they are so rare there may be little or no information about them in public domain. It seems like a trivial task, you google it and don't find the solution.
This happened to me a couple of months ago - i needed to convert normal unicode strings to ebcdic (encoding used in mainframes). The problem was that, for some reason, mainframes in my organization use old codepage for Hebrew; codepage 803, which unlike newer codepge 424 is not supported by the Framework. Remapping characters from one codepage to another was not such a big deal, though there was an issue with mapping 'alef'. I relied on this resource:
http://www.tachyonsoft.com/cp00803.htm
But the biggest problem is not mapping, the biggest problem is combining right-to-left language with left-to-right language within one string. To save a RTL text in the MF one needs to reverse the order of charchters (i'm not really sure why, but i was told this is in order to support printing from the terminal and so that text is properly rendered in MF emulators), while LTR text should not be reversed. so once you have a combined text like:
יונדאי יוצאת במבצע תחת הסלוגן " I WANT" בו יוענקו הנחות והטבות שונות לכל דגמי יונדאי 2011.
in order to reverse only hebrew but not english or numbers or special charcters you need to at least recognize hebrew text as such. Naturally reading from MF requires the opposite process.
To make the long story short, take a look at the code which I came up with in the end. It fits for English, Hebrew and combined (english inside hebrew) text.
WARNING: this code is not absolutely generic as it was developed for the humble needs of our department, and we usually work with rather short strings, no longer than a couple of sentences, so i allowed myself to use REGEX as it won't seriously affect our applications' performance; for longer strings you might need to optimize , otherwise it might be nastily slow. The good part is that this code doesn't use any third party components which could be a problem in financial organizations (where MFs are used); only the most basic libraries from the .net Framework.
There are basically two public functions:
all other functions are private and are utilized by the above.
The code is provided "AS IS" , you're allowed to use it at your own responsibility only.
----------------------------------
using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
namespace SomeNameSpace
{
class EbcdicAdapter
{
/// <summary>
/// converts a unicode string to ebcdic encoded string
/// </summary>
/// <param name="normalString"></param>
/// <returns></returns>
public static string ConvertToOldEbcdic(string normalString)
{
if (Regex.IsMatch(normalString, "([א-ת])"))//only if includes hebrew needs to be arranged.
{
normalString = SwapBrackets(normalString);
}
string inputString = normalString.ToUpper(); // cannot handle correctly lowercase english letters
string decString = "";
string oldEbcdic = "";
Encoding ebcEnc = Encoding.GetEncoding(20424);
byte[] inBytes = ebcEnc.GetBytes(inputString);
StringBuilder sBuilder = new StringBuilder(inBytes.Length * 2);
foreach (byte b in inBytes)
{
byte adaptedByte = AdaptToOldEbcdic(b);
sBuilder.AppendFormat("{0:X2}", adaptedByte);
}
decString = sBuilder.ToString();
byte[] outBytes = new byte[decString.Length / 2];
for (int i = 0; i < decString.Length; i += 2)
{
outBytes[i / 2] = Convert.ToByte(decString.Substring(i, 2), 16);
}
Array.Reverse(outBytes);
oldEbcdic = ebcEnc.GetString(outBytes);
oldEbcdic = ArrangeEnHeString(oldEbcdic);
return oldEbcdic.Trim();
}
/// <summary>
/// converts an ebcdic encoded string to a unicode string
/// </summary>
/// <param name="oldEbcdicString"></param>
/// <returns></returns>
public static string ConvertFromEbcdic(string oldEbcdicString)
{
string inputString = oldEbcdicString; //"wipfx ilhp xear zixara my";
string decString = "";
string outputString = "";
Encoding ebcEnc = Encoding.GetEncoding(20424);
byte[] inBytes = ebcEnc.GetBytes(inputString);
StringBuilder sBuilder = new StringBuilder(inBytes.Length * 2);
foreach (byte b in inBytes)
{
byte adaptedByte = AdaptToNewEbcdic(b);
sBuilder.AppendFormat("{0:X2}", adaptedByte);
}
decString = sBuilder.ToString();
byte[] outBytes = new byte[decString.Length / 2];
for (int i = 0; i < decString.Length; i += 2)
{
outBytes[i / 2] = Convert.ToByte(decString.Substring(i, 2), 16);
}
Array.Reverse(outBytes);
outputString = ebcEnc.GetString(outBytes);
if (Regex.IsMatch(outputString, "([א-ת])"))//only if includes hebrew needs to be arranged.
{
outputString = ArrangeEnHeString(outputString);
outputString = SwapBrackets(outputString);
}
else // if not just need to reverse the string
{
char[] chars = outputString.Trim().ToCharArray();
Array.Reverse(chars);
StringBuilder stBuilder = new StringBuilder(chars.Length);
foreach (char _char in chars)
{
stBuilder.Append(_char);
}
outputString = stBuilder.ToString();
}
return outputString.Trim();
}
/// <summary>
/// reverses only english text within the hebrew text.
/// numbers are treated as english text.
/// punctuation as hebrew text.
/// </summary>
/// <returns></returns>
private static string ArrangeEnHeString(string inputStr)
{
string _input = inputStr; //"זה טקסט לבדיקה עם ENGLISH ומספרים 123456";
string pattern = "(([0-9A-Z\\s\\.\\,\\%])+|([A-Z\\.])+)"; //english or numbers with empty spaces (\\s). this pattern string is what you might want to adjust to your needs.
MatchCollection matches = Regex.Matches(_input, pattern);
foreach (Match match in matches)
{
//if match is " " white space - go to next match
if (match.Value.Equals(" "))
{
continue;
}
//retain starting and trailing empty spaces
bool startsEmpty = match.Value.StartsWith(" ");
bool endsEmpty = match.Value.EndsWith(" ");
string matchValueTrimmed = match.Value.Trim();
char[] chars = matchValueTrimmed.ToCharArray();
Array.Reverse(chars);
StringBuilder sBuilder = new StringBuilder(_input.Length);
foreach (char _char in chars)
{
sBuilder.Append(_char);
}
string reversedStr = sBuilder.ToString();
if (!string.IsNullOrEmpty(matchValueTrimmed))
{
_input = _input.Replace(matchValueTrimmed, reversedStr);
}
//bring back empty spaces
if (startsEmpty)
{
_input = " " + _input;
}
if (endsEmpty)
{
_input = _input + " ";
}
}
return _input;
}
public static string SwapBrackets(string outputString)
{
for (int i = 0; i < outputString.Length; i++)
{
if (outputString[i] == ')')
{
outputString = outputString.Remove(i, 1);
outputString = outputString.Insert(i, "(");
}
else if (outputString[i] == '(')
{
outputString = outputString.Remove(i, 1);
outputString = outputString.Insert(i, ")");
}
else if (outputString[i] == '[')
{
outputString = outputString.Remove(i, 1);
outputString = outputString.Insert(i, "]");
}
else if (outputString[i] == ']')
{
outputString = outputString.Remove(i, 1);
outputString = outputString.Insert(i, "[");
}
else if (outputString[i] == '}')
{
outputString = outputString.Remove(i, 1);
outputString = outputString.Insert(i, "{");
}
else if (outputString[i] == '{')
{
outputString = outputString.Remove(i, 1);
outputString = outputString.Insert(i, "}");
}
}
return outputString;
}
private static byte AdaptToNewEbcdic(byte oldEbcdic)
{
byte newEbcdic = 0;
byte[,] dictionary = GetEbcdicDictionary();
for (int i = 0; i < 27; i++)
{
if (dictionary[i, 0].Equals(oldEbcdic))
{
newEbcdic = dictionary[i, 1];
break;
}
}
if (newEbcdic.Equals(0)) // if not hebrew letter keep existing mapping
{
newEbcdic = oldEbcdic;
}
return newEbcdic;
}
private static byte AdaptToOldEbcdic(byte newEbcdic)
{
byte oldEbcdic = 0;
byte[,] dictionary = GetEbcdicDictionary();
for (int i = 0; i < 27; i++)
{
if (dictionary[i, 1].Equals(newEbcdic))
{
oldEbcdic = dictionary[i, 0];
break;
}
}
if (oldEbcdic.Equals(0)) // if not hebrew letter keep existing mapping
{
oldEbcdic = newEbcdic;
}
return oldEbcdic;
}
/// <summary>
/// map hebrew letters from 424 codepage to (supposedly) 803 codepage.
/// all letters except for א fit the 803 codepage.
/// </summary>
/// <returns></returns>
private static byte[,] GetEbcdicDictionary()
{
byte[,] dictionary = new byte[,] {
{121,65}, // א // different from 803 codepage
{129,66}, // ב
{130,067}, // ג
{131,068}, // ד
{132,069}, // ה
{133,070}, // ו
{134,071}, // ז
{135,072}, // ח
{136,073}, // ט
{137,081}, // י
{145,082}, // ך
{146,083}, // כ
{147,084}, // ל
{148,085}, // ם
{149,086}, // מ
{150,087}, // ן
{151,088}, // נ
{152,089}, // ס
{153,098}, // ע
{162,099}, // ף
{163,100}, // פ
{164,101}, // ץ
{165,102}, // צ
{166,103}, // ק
{167,104}, // ר
{168,105}, // ש
{169,113} // ת
};
return dictionary;
}
}
}
-------------------------------------------------------------
Obviously , before I wrote this utility , I searched the web for the solution , which I din't find, but some pieces of code from different sites gave me tips. Unfortuntely , I cannot recall where I saw the code which I integrated in my solution. So if you feel that I used here a piece of code that you wrote, and deserve a credit, please write me a message, and I will gladly add a credit.