From radev at umich.edu Tue Apr 15 12:02:34 2008 From: radev at umich.edu (radev@umich.edu) Date: Tue Apr 15 11:55:51 2008 Subject: [Mead] Re: French MEAD In-Reply-To: <200804151456.41663.jorge.garcia-flores@cea.fr> Message-ID: <20080415160234.DBA1D60083127@belobog.si.umich.edu> Let us know how it goes. > > It works! Thanx a lot for your help. > I will now try to make a french IDF database! > > Greetings > > Jorge > > On Tuesday 15 April 2008 03:41, Thuy Vu wrote: > > Hello, > > I spent some more times trying to get mead to work with French. Pardon my > > inexperience; I was not able to fix the problem. However, below information > > may be helpful to you. > > > > To fix utf8 I/O error: > > - In the ".docsent" file created, replace with > > replace > > - Also edit lib/MEAD/Document.pm file such that: > > 1) add "use utf8" on top, > > 2) replace line ~40 where "open (INSTREAM, "iconv -f BIG5 -t UTF-8 > > $document_filename |" with " open (INSTREAM, "<:utf8 ", > > $document_filename);" > > 3) Comment out line ~80, where " $text = > > $UTF_8_to_Big5->convert($text);" > > - Same changes should be done on lib/MEAD/Query.pm such that: > > 1) On line ~30, replace " open(UNICODE_VERSION, "iconv -f BIG5 -t > > UTF-8 $query_filename |");" with " open(UNICODE_VERSION, "<:utf8 ", > > "$query_filename");" > > 2) On line ~50, replace " $UTF_8_to_Big5->convert($text);" with > > "$text" > > - One last change on lib/Essence/Text.pm: In line ~23, replace > > A_BUNCH_OF_CHAR in "my @words = split /A_BUNCH_OF_CHAR/, $text;" with the > > appropriate character that would help splitting word in French. See example > > non-English split characters in bin/make-CHIN-docsent.pl > > > > According to my variable dump trace out, above changes would guarantee that > > mead data types (file steam, scalar, and hash) can correctly store French > > text. What is still broken is IDF operation on utf8 characters. > > > > Using English IDF, this is the produced summary for pekin data: "Cyberm?tho > > Banque de ressources en ligne pour la formation ? la recherche > > sociocommunautaire et universitaire en sciences sociales DOCUMENT Reporters > > sans fronti?res, Solidarit? Chine et le Comit? de soutien au peuple > > tib?tain, ? Au nom des droits de l'homme, non ? la candidature de P?kin aux > > J.O en 2008 ?, m?moire soumis au Comit? international olympique, Paris, > > reproduit avec l'autorisation de Reporters sans fronti?res." > > > > P/S: If you want to perform variable trace dump, make sure that the dump is > > also utf8 comparable. For an example, write to a log file like this "open > > (TLOG, ">>:utf8", "/tmp/thuy.log");" is ok. > > > > Good luck. > > _________________________________________________________________ > > Thuy Vu > > GSRA, University of Michigan > > ttvu@umich.edu > > > > ? > > > > > -----Original Message----- > > > From: GARCIA FLORES Jorge 704360 IRSN [mailto:jorge.garcia- > > > flores@cea.fr] > > > Sent: Monday, April 14, 2008 10:05 AM > > > To: Thuy Vu > > > Cc: radev@umich.edu; 'Bryan Gibson'; 'Joshua Gerrish'; 'Anthony Fader' > > > Subject: Re: French MEAD > > > > > > Hi. We tweak it away by running MEAD with non ASCII characters, and > > > then "reconstructing" a summary with MEADS selected sentences, > > > extracted from > > > an UTF-8 friendly file (that is, more less the solution proposed by > > > Thuy Vu). > > > However, Im very intrested in building a french IDF file... could you > > > recommend me an article where you describe the building method for IDF > > > files > > > and DBM files? > > > > > > Thanx a lot > > > > > > Jorge > > > > > > On Thursday 10 April 2008 22:01, Thuy Vu wrote: > > > > Hello, > > > > > > > > From what I understand, since mead version 305 with Chinese > > > > > > capability, > > > > > > > BIG5 encoding is hardcoded within the library to handle irregular > > > > character. The user correctly commented out any line that deals with > > > > Iconv(). If input is already UTF8, mead should not try to convert it > > > > > > back > > > > > > > and fore between UTF8 and BIG5. Therefore, by commenting these > > > > > > conversions > > > > > > > out like the end-user and Bryan did, iconv crash stops. However, the > > > > observed result is empty for sentence with irregular character. > > > > > > > > More internal problems: > > > > 1) The main one is Perl I/O. By default Perl is Unicode compatible. > > > > However, when encoding is not specified by user, Perl would guess and > > > > convert it to utf8. In our case, the characters are doubly encoded to > > > > > > UTF8, > > > > > > > which is jargon. This means that for all type of I/O, we need to > > > > > > clarify > > > > > > > with Perl that they are UTF8 already. > > > > - For file I/O, modify open(STREAM, $filename) to open(STREAM, > > > > > > "<:uft8", > > > > > > > $filename) > > > > - For hash, key cannot be utf8 implicitly > > > > - Same special handling for DB query. > > > > I started modifying code for file I/O case, but realized that > > > > > > there're too > > > > > > > much changes to be done. Instead, I set PERL_UNICODE=SDA and > > > > LANG=$LANG:utf8 to tell Perl to treat all input as UTF-8. However, > > > > > > this > > > > > > > broke Essence script that split string based on non-UTF8 character. > > > > > > In > > > > > > > addition, it does not treat UTF-8 output, so the summary result still > > > > > > only > > > > > > > shows non-irregular text. > > > > > > > > 2) Another problem I foresee is IDF. Even if encoding is fixed, it > > > > > > doesn't > > > > > > > make sense to run French texts against English or Chinese IDF file. > > > > > > They > > > > > > > need to generate their own frnidf files for accurate summary. > > > > > > > > Conclusion: > > > > - To fix encoding problem, we have to write a special handler just > > > > > > like we > > > > > > > did for Chinese character. It should create docsent files, store > > > > > > strings in > > > > > > > UTF-8 explicitly. This requires big effort for Clair. > > > > - To patch encoding problem, the end-user can add option "-c" to > > > > > > iconv(), > > > > > > > like this iconv( -c -f original_encoding -t new_encoding). This will > > > > > > strip > > > > > > > irregular character. So "Ch?o" becomes "Cho". This text does not make > > > > > > sense > > > > > > > to end-user, but they can use it to run mead, then re-generate the > > > > > > original > > > > > > > text using some sort of sentence ID. But again, IDF needs to be > > > > > > created. > > > > > > > Recommended resource for the end-user: > > > > - Mead for Chinese: Read "Summarizing Chinese Documents with MEAD" > > > > > > section > > > > > > > of mead manual (available online). > > > > - Encoding: http://www.ahinea.com/en/tech/perl-unicode-struggle.html > > > > > > > > Kind regards, > > > > > > _______________________________________________________________________ > > > > > > > Thuy Vu > > > > GSRA, University of Michigan > > > > ttvu@umich.edu > > > > > > > > > -----Original Message----- > > > > > From: radev@umich.edu [mailto:radev@umich.edu] > > > > > Sent: Tuesday, April 08, 2008 7:50 PM > > > > > To: Thuy Vu > > > > > Cc: 'Bryan Gibson' > > > > > Subject: Re: French MEAD > > > > > > > > > > Please try to solve this problem and email the people who asked. > > > > > > > > > > > > -----Original Message----- > > > > > > > From: radev@umich.edu [mailto:radev@umich.edu] > > > > > > > Sent: Monday, April 07, 2008 8:14 PM > > > > > > > To: Bryan Gibson > > > > > > > Cc: Thuy Vu > > > > > > > Subject: Re: French MEAD > > > > > > > > > > > > > > Thuy, can you please figure this out? I need to let Bryan work > > > > > > on > > > > > > > > the > > > > > > > > > > > > AAN project. > > > > > > > > > > > > [Thuy] Ok. I won't much done by tomorrow because of class, but > > > > > > > > > > hopefully by > > > > > > > > > > > Wednesday I will have some ideas. > > > > > > > > > > > > > Quoting Bryan Gibson : > > > > > > > > Do you know if all of the french characters should be covered > > > > > > in > > > > > > > > UTF- > > > > > > > > > > > > 8? > > > > > > > > > > > > > > > I'm afraid don't know much about character encodings. > > > > > > > > > > > > > > > > Bryan > > > > > > > > > > > > > > > > On Mon, Apr 07, 2008 at 08:01:30PM -0400, radev@umich.edu > > > > > > wrote: > > > > > > > >> I believe that they were trying to use UTF-8. > > > > > > > >> > > > > > > > >> Quoting Bryan Gibson : > > > > > > > >>> Hi Thuy, > > > > > > > >>> > > > > > > > >>> It looks like it's an issue in MEAD/Document.pm line 42: > > > > > > > >>> open (INSTREAM, "iconv -f BIG5 -t UTF-8 > > > > > > $document_filename > > > > > > > > |"); > > > > > | > > > > > > > >>> This is always trying to convert files from BIG5 to UTF-8. > > > > > > Do > > > > > > > > you > > > > > > > > > > > > know > > > > > > > > > > > > > > >>> what the conversion would be for french characters? Is it > > > > > > > > > > > > > > something > > > > > > > > > > > > > > >>> other than UTF-8? Everything I've tried just produces > > > > > > blank > > > > > > > > output > > > > > > > > > > > > or > > > > > > > > > > > > > > >>> an error: > > > > > > > >>> > > > > > > > >>> gibsonb@belobog:/data0/projects/meadfrench> > > > > > > ./mead/bin/mead.pl > > > > > > > > > > MORCAS > > > > > > > > > > > > > > >>> Using system rc-file: > > > > > > > >>> /data0/projects/mead311/mead-belobog/bin/../.meadrc > > > > > > > >>> Warning: Can't find user rc-file > > > > > > > >>> Cluster: /data0/projects/meadfrench/MORCAS/MORCAS.cluster > > > > > > > >>> iconv: illegal input sequence at position 251 > > > > > > > >>> > > > > > > > >>> Any ideas would be helpful. > > > > > > > >>> > > > > > > > >>> Bryan > > > > > > > >>> > > > > > > > >>> On Mon, Apr 07, 2008 at 05:23:21PM -0400, radev@umich.edu > > > > > > > > > > wrote: > > > > > > > >>>> Hi, Thuy, > > > > > > > >>>> > > > > > > > >>>> Can you please figure out the problem with the French > > > > > > mead? > > > > > > > > > > >>>> Drago > > > > > > > > > > -- > > > > > Dragomir R. Radev Associate Professor > > > > > SI, CSE, Ling U. Michigan, Ann Arbor > > > > > http://www.eecs.umich.edu/~radev radev@umich.edu > > -- Dragomir R. Radev Associate Professor SI, CSE, Ling U. Michigan, Ann Arbor http://www.eecs.umich.edu/~radev radev@umich.edu