FilipDominec/wrong_charset_detection.py

FilipDominec · 2022-06-28T15:09:22Z

Example output:

REALLY ENCODED: \  BUT INTERPRETED AS:
                   abbccccccccccccccccccccccccccccccccccccccccceeeeggghhiiiiiiiiiiiiiiiiiiiiiijkkkklmmmmmmmmmmmpprssstuuuuuuuuuu
                   siihppppppppppppppppppppppppppppppppppppppppuuuubbbpzssssssssssssssssssssssoooozaaaaaaaaaaaaatahhhinttttttttt
                   cgga0111111111111124457778888888888888888999cccc12k- oooooooooooooooooooooohiii1tccccccccccclcwiiisifffffffff
                   i55r3001122222222272302375555556666666677345----83 r 2222222888888888888888a8880i-----------mp-fff-c---------
                   i hm7022455555555534700750256780123456945290jjjk01 o 0000000888888888888888b---4nacccfgilrrto1uttt6o111333788
                     ka 6650012345678                          iipr32 m 2222222555555555555555 rtu8-reryarcaoous5n---2d666222  -
                     sp                                        ss  0  a 2222222999999999999999     1anorreetmmr 4ijjj0e -- --  s
                     c                                         -x     n ----------------------      btaiseliaak  ciii - bl bl  i
                     s                                         20     8 jjjjjjk111111123456789      ietlikannni  osss e ee ee  g
                                                               02       ppppppr 013456              cuil  n2 is  d -x s         
                                                               01        -----                       rai  d  ah  e 20 c         
                                                               43        1223e                       onc     n   - 02 a         
                                                                           0 x                                   e 01 p         
                                                                           0 t                                   s 43 e         
                                                                           4                                     c              
                                                                                                                 a              
                                                                                                                 p              
                                                                                                                 e              
                                                                                                                                
ascii               ············································································································
big5               · ···········································································································
big5hkscs          ·· ··········································································································
charmap            ··· ·········································································································
cp037              ···· ········································································································
cp1006             ····· ·······································································································
cp1026             ······ ······································································································
cp1125             ······· ·····································································································
cp1140             ········ ····································································································
cp1250             ········· ···································································································
cp1251             ·········· ··································································································
cp1252             ··········· ·································································································
cp1253             ············ ································································································
cp1254             ············· ·······························································································
cp1255             ·············· ······························································································
cp1256             ··············· ·····························································································
cp1257             ················ ····························································································
cp1258             ················· ···························································································
cp273              ·················· ··························································································
cp424              ··················· ·························································································
cp437              ···················· ························································································
cp500              ····················· ·······················································································
cp720              ······················ ······················································································
cp737              ······················· ·····················································································
cp775              ························ ····················································································
cp850              ························· ···················································································
cp852              ····················X····· ····XXX··X········································································
cp855              ··························· ·················································································
cp856              ···························· ················································································
cp857              ····························· ···············································································
cp858              ······························ ··············································································
cp860              ······························· ·············································································
cp861              ································ ············································································
cp862              ································· ···········································································
cp863              ·································· ··········································································
cp864              ··································· ·········································································
cp865              ···································· ········································································
cp866              ····································· ·······································································
cp869              ······································ ······································································
cp874              ······································· ·····································································
cp875              ········································ ····································································
cp932              ········································· ···································································
cp949              ·········································· ··································································
cp950              ··········································· ·································································
euc-jis-2004       ············································ ································································
euc-jisx0213       ············································· ·······························································
euc-jp             ·············································· ······························································
euc-kr             ··············································· ·····························································
gb18030            ················································ ····························································
gb2312             ················································· ···························································
gbk                ·················································· ··························································
hp-roman8          ··················································· ·························································
hz                 ···················································· ························································
iso2022-jp         ····················································· ·······················································
iso2022-jp-1       ······················································ ······················································
iso2022-jp-2       ······················································· ·····················································
iso2022-jp-2004    ························································ ····················································
iso2022-jp-3       ························································· ···················································
iso2022-jp-ext     ·························································· ··················································
iso2022-kr         ··························································· ·················································
iso8859-1          ···························································· ················································
iso8859-10         ····························································· ···············································
iso8859-11         ······························································ ··············································
iso8859-13         ······························································· ·············································
iso8859-14         ································································ ············································
iso8859-15         ································································· ···········································
iso8859-16         ·································································· ··········································
iso8859-2          ··································································· ·········································
iso8859-3          ···································································· ········································
iso8859-4          ····································································· ·······································
iso8859-5          ······································································ ······································
iso8859-6          ······································································· ·····································
iso8859-7          ········································································ ····································
iso8859-8          ········································································· ···································
iso8859-9          ·········································································· ··································
johab              ··········································································· ·································
koi8-r             ············································································ ································
koi8-t             ············································································· ·······························
koi8-u             ·············································································· ······························
kz1048             ··············································································· ·····························
latin-1            ················································································ ····························
mac-arabic         ················································································· ···························
mac-centeuro       ·················································································· ··························
mac-croatian       ··················································································· ·························
mac-cyrillic       ···················································································· ························
mac-farsi          ····················································································· ·······················
mac-greek          ······················································································ ······················
mac-iceland        ······················································································· ·····················
mac-latin2         ························································································ ····················
mac-roman          ························································································· ···················
mac-romanian       ·························································································· ··················
mac-turkish        ··························································································· ·················
palmos             ···························································································· ················
ptcp154            ····························································································· ···············
raw-unicode-escape ······························································································ ··············
shift-jis          ······························································································· ·············
shift-jis-2004     ································································································ ············
shift-jisx0213     ································································································· ···········
tis-620            ·································································································· ··········
unicode-escape     ··································································································· ·········
utf-16             ···································································································· ········
utf-16-be          ····································································································· ·······
utf-16-le          ······································································································ ······
utf-32             ······································································································· ·····
utf-32-be          ········································································································ ····
utf-32-le          ········································································································· ···
utf-7              ·········································································································· ··
utf-8              ··········································································································· ·
utf-8-sig          ············································································································ 
Conclusion: when 'Měření' is encoded as:
	{'cp852'}
but (mis)interpreted as:
	{'cp437', 'cp860', 'cp862', 'cp861', 'cp865'},
 it may appear as 'M╪²ení'

	#!/usr/bin/python3
	#-- coding: utf-8 --

	# Searches for such charset conversion, which would generate a wrong encoded string from a known correct one
	# Public domain, written by Filip Dominec 2022

	# EXAMPLES:

	#wrong, correct = "╪ konstrukЯnб ¤eчenб", "ě konstrukční řešení"
	#wrong, correct = "slouÄeninovĂ˝ch", "sloučeninových"
	#wrong, correct = "pøípravu slouèeninových polovodièù", "přípravu sloučeninových polovodičů"
	#wrong, correct = "ý", "ý"
	#wrong, correct = "Pro přípravu sloučeninových polovodičů vyuľívá jako zdrojové materiály", "Pro přípravu sloučeninových polovodičů využívá jako zdrojové materiály"
	#wrong, correct = "à", "ů"
	#wrong, correct = "v∞m╪r", "výměr"
	#wrong, correct = "slouèeninových", "sloučeninových"
	#wrong, correct = "vyuľívá","využívá"
	#wrong, correct = "vyu¾ívá", "využívá"
	wrong, correct = "M╪²ení", "Měření",


	import os


	## Try all encodings (big table!)
	def encodinglist(): # https://stackoverflow.com/questions/1728376/get-a-list-of-all-the-encodings-python-can-encode-to
	r=[]
	for i in os.listdir(os.path.split(__import__("encodings").__file__)[0]):
	name=os.path.splitext(i)[0]
	try:
	"".encode(name)
	except:
	pass
	else:
	if name not in ("idna", "punycode"):
	r.append(name.replace("_","-"))
	r.sort()
	return r
	enclist = encodinglist()

	## Narrow list of likely encodings
	#enclist = ['ascii', 'utf8', 'latin-1']
	#win_encs = [f'Windows-125{n}' for n in range(8)]
	#iso_encs = [f'ISO-8859-{n}' for n in range(1,10) ]
	#enclist = enclist + win_encs + iso_encs



	possible_froms = []
	possible_tos = []
	possible_solutions = []

	enclen = max(len(c) for c in enclist)
	enclist_aligned = [f"{enc:{enclen}} " for enc in enclist]

	print("REALLY ENCODED: \ BUT INTERPRETED AS:")
	for ll in ("".join(j) for j in zip(*enclist_aligned)):
	print(" "*enclen + " " + ll)

	for f,a in zip(enclist, enclist_aligned):
	print(a, end="")
	for t in enclist:
	try:
	co = wrong.encode(t,"ignore").decode(f,"ignore")
	if co == correct:
	#print(f,t)
	print("X", end="")
	possible_froms.append(f)
	possible_tos.append(t)
	possible_solutions.append((f,t))
	else:
	print("·" if f!=t else " ", end="")

	#print(co, end="")
	#if "ý" in co: print(f,t)
	except:
	pass
	print("E",end="")
	print()

	#for f,t in possible_solutions:
	print(f"Conclusion: when '{correct}' is encoded as:\n\t{set(possible_froms)}\nbut (mis)interpreted as:\n\t{set(possible_tos)},\n it may appear as '{wrong}'")

FilipDominec/wrong_charset_detection.py

FilipDominec commented Jun 28, 2022 •

edited

Loading

Uh oh!

FilipDominec/wrong_charset_detection.py

FilipDominec commented Jun 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FilipDominec commented Jun 28, 2022 •

edited

Loading