[Mead] Re: French MEAD
radev at umich.edu
radev at umich.edu
Tue Apr 15 12:02:34 EDT 2008
Let us know how it goes.
>
> It works! Thanx a lot for your help.
> I will now try to make a french IDF database!
>
> Greetings
>
> Jorge
>
> On Tuesday 15 April 2008 03:41, Thuy Vu wrote:
> > Hello,
> > I spent some more times trying to get mead to work with French. Pardon my
> > inexperience; I was not able to fix the problem. However, below information
> > may be helpful to you.
> >
> > To fix utf8 I/O error:
> > - In the ".docsent" file created, replace <?xml version='1.0'?> with
> > replace <?xml version='1.0' encoding='UTF-8'?>
> > - Also edit lib/MEAD/Document.pm file such that:
> > 1) add "use utf8" on top,
> > 2) replace line ~40 where "open (INSTREAM, "iconv -f BIG5 -t UTF-8
> > $document_filename |" with " open (INSTREAM, "<:utf8 ",
> > $document_filename);"
> > 3) Comment out line ~80, where " $text =
> > $UTF_8_to_Big5->convert($text);"
> > - Same changes should be done on lib/MEAD/Query.pm such that:
> > 1) On line ~30, replace " open(UNICODE_VERSION, "iconv -f BIG5 -t
> > UTF-8 $query_filename |");" with " open(UNICODE_VERSION, "<:utf8 ",
> > "$query_filename");"
> > 2) On line ~50, replace " $UTF_8_to_Big5->convert($text);" with
> > "$text"
> > - One last change on lib/Essence/Text.pm: In line ~23, replace
> > A_BUNCH_OF_CHAR in "my @words = split /A_BUNCH_OF_CHAR/, $text;" with the
> > appropriate character that would help splitting word in French. See example
> > non-English split characters in bin/make-CHIN-docsent.pl
> >
> > According to my variable dump trace out, above changes would guarantee that
> > mead data types (file steam, scalar, and hash) can correctly store French
> > text. What is still broken is IDF operation on utf8 characters.
> >
> > Using English IDF, this is the produced summary for pekin data: "Cybermétho
> > Banque de ressources en ligne pour la formation à la recherche
> > sociocommunautaire et universitaire en sciences sociales DOCUMENT Reporters
> > sans frontières, Solidarité Chine et le Comité de soutien au peuple
> > tibétain, « Au nom des droits de l'homme, non à la candidature de Pékin aux
> > J.O en 2008 », mémoire soumis au Comité international olympique, Paris,
> > reproduit avec l'autorisation de Reporters sans frontières."
> >
> > P/S: If you want to perform variable trace dump, make sure that the dump is
> > also utf8 comparable. For an example, write to a log file like this "open
> > (TLOG, ">>:utf8", "/tmp/thuy.log");" is ok.
> >
> > Good luck.
> > _________________________________________________________________
> > Thuy Vu
> > GSRA, University of Michigan
> > ttvu at umich.edu
> >
> >
> >
> > > -----Original Message-----
> > > From: GARCIA FLORES Jorge 704360 IRSN [mailto:jorge.garcia-
> > > flores at cea.fr]
> > > Sent: Monday, April 14, 2008 10:05 AM
> > > To: Thuy Vu
> > > Cc: radev at umich.edu; 'Bryan Gibson'; 'Joshua Gerrish'; 'Anthony Fader'
> > > Subject: Re: French MEAD
> > >
> > > Hi. We tweak it away by running MEAD with non ASCII characters, and
> > > then "reconstructing" a summary with MEADS selected sentences,
> > > extracted from
> > > an UTF-8 friendly file (that is, more less the solution proposed by
> > > Thuy Vu).
> > > However, Im very intrested in building a french IDF file... could you
> > > recommend me an article where you describe the building method for IDF
> > > files
> > > and DBM files?
> > >
> > > Thanx a lot
> > >
> > > Jorge
> > >
> > > On Thursday 10 April 2008 22:01, Thuy Vu wrote:
> > > > Hello,
> > > >
> > > > From what I understand, since mead version 305 with Chinese
> > >
> > > capability,
> > >
> > > > BIG5 encoding is hardcoded within the library to handle irregular
> > > > character. The user correctly commented out any line that deals with
> > > > Iconv(). If input is already UTF8, mead should not try to convert it
> > >
> > > back
> > >
> > > > and fore between UTF8 and BIG5. Therefore, by commenting these
> > >
> > > conversions
> > >
> > > > out like the end-user and Bryan did, iconv crash stops. However, the
> > > > observed result is empty for sentence with irregular character.
> > > >
> > > > More internal problems:
> > > > 1) The main one is Perl I/O. By default Perl is Unicode compatible.
> > > > However, when encoding is not specified by user, Perl would guess and
> > > > convert it to utf8. In our case, the characters are doubly encoded to
> > >
> > > UTF8,
> > >
> > > > which is jargon. This means that for all type of I/O, we need to
> > >
> > > clarify
> > >
> > > > with Perl that they are UTF8 already.
> > > > - For file I/O, modify open(STREAM, $filename) to open(STREAM,
> > >
> > > "<:uft8",
> > >
> > > > $filename)
> > > > - For hash, key cannot be utf8 implicitly
> > > > - Same special handling for DB query.
> > > > I started modifying code for file I/O case, but realized that
> > >
> > > there're too
> > >
> > > > much changes to be done. Instead, I set PERL_UNICODE=SDA and
> > > > LANG=$LANG:utf8 to tell Perl to treat all input as UTF-8. However,
> > >
> > > this
> > >
> > > > broke Essence script that split string based on non-UTF8 character.
> > >
> > > In
> > >
> > > > addition, it does not treat UTF-8 output, so the summary result still
> > >
> > > only
> > >
> > > > shows non-irregular text.
> > > >
> > > > 2) Another problem I foresee is IDF. Even if encoding is fixed, it
> > >
> > > doesn't
> > >
> > > > make sense to run French texts against English or Chinese IDF file.
> > >
> > > They
> > >
> > > > need to generate their own frnidf files for accurate summary.
> > > >
> > > > Conclusion:
> > > > - To fix encoding problem, we have to write a special handler just
> > >
> > > like we
> > >
> > > > did for Chinese character. It should create docsent files, store
> > >
> > > strings in
> > >
> > > > UTF-8 explicitly. This requires big effort for Clair.
> > > > - To patch encoding problem, the end-user can add option "-c" to
> > >
> > > iconv(),
> > >
> > > > like this iconv( -c -f original_encoding -t new_encoding). This will
> > >
> > > strip
> > >
> > > > irregular character. So "Chào" becomes "Cho". This text does not make
> > >
> > > sense
> > >
> > > > to end-user, but they can use it to run mead, then re-generate the
> > >
> > > original
> > >
> > > > text using some sort of sentence ID. But again, IDF needs to be
> > >
> > > created.
> > >
> > > > Recommended resource for the end-user:
> > > > - Mead for Chinese: Read "Summarizing Chinese Documents with MEAD"
> > >
> > > section
> > >
> > > > of mead manual (available online).
> > > > - Encoding: http://www.ahinea.com/en/tech/perl-unicode-struggle.html
> > > >
> > > > Kind regards,
> > >
> > > _______________________________________________________________________
> > >
> > > > Thuy Vu
> > > > GSRA, University of Michigan
> > > > ttvu at umich.edu
> > > >
> > > > > -----Original Message-----
> > > > > From: radev at umich.edu [mailto:radev at umich.edu]
> > > > > Sent: Tuesday, April 08, 2008 7:50 PM
> > > > > To: Thuy Vu
> > > > > Cc: 'Bryan Gibson'
> > > > > Subject: Re: French MEAD
> > > > >
> > > > > Please try to solve this problem and email the people who asked.
> > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: radev at umich.edu [mailto:radev at umich.edu]
> > > > > > > Sent: Monday, April 07, 2008 8:14 PM
> > > > > > > To: Bryan Gibson
> > > > > > > Cc: Thuy Vu
> > > > > > > Subject: Re: French MEAD
> > > > > > >
> > > > > > > Thuy, can you please figure this out? I need to let Bryan work
> > >
> > > on
> > >
> > > > > the
> > > > >
> > > > > > > AAN project.
> > > > > >
> > > > > > [Thuy] Ok. I won't much done by tomorrow because of class, but
> > > > >
> > > > > hopefully by
> > > > >
> > > > > > Wednesday I will have some ideas.
> > > > > >
> > > > > > > Quoting Bryan Gibson <gibsonb at umich.edu>:
> > > > > > > > Do you know if all of the french characters should be covered
> > >
> > > in
> > >
> > > > > UTF-
> > > > >
> > > > > > > 8?
> > > > > > >
> > > > > > > > I'm afraid don't know much about character encodings.
> > > > > > > >
> > > > > > > > Bryan
> > > > > > > >
> > > > > > > > On Mon, Apr 07, 2008 at 08:01:30PM -0400, radev at umich.edu
> > >
> > > wrote:
> > > > > > > >> I believe that they were trying to use UTF-8.
> > > > > > > >>
> > > > > > > >> Quoting Bryan Gibson <gibsonb at umich.edu>:
> > > > > > > >>> Hi Thuy,
> > > > > > > >>>
> > > > > > > >>> It looks like it's an issue in MEAD/Document.pm line 42:
> > > > > > > >>> open (INSTREAM, "iconv -f BIG5 -t UTF-8
> > >
> > > $document_filename
> > >
> > > > > |");
> > > > > |
> > > > > > > >>> This is always trying to convert files from BIG5 to UTF-8.
> > >
> > > Do
> > >
> > > > > you
> > > > >
> > > > > > > know
> > > > > > >
> > > > > > > >>> what the conversion would be for french characters? Is it
> > > > > > >
> > > > > > > something
> > > > > > >
> > > > > > > >>> other than UTF-8? Everything I've tried just produces
> > >
> > > blank
> > >
> > > > > output
> > > > >
> > > > > > > or
> > > > > > >
> > > > > > > >>> an error:
> > > > > > > >>>
> > > > > > > >>> gibsonb at belobog:/data0/projects/meadfrench>
> > >
> > > ./mead/bin/mead.pl
> > >
> > > > > > > MORCAS
> > > > > > >
> > > > > > > >>> Using system rc-file:
> > > > > > > >>> /data0/projects/mead311/mead-belobog/bin/../.meadrc
> > > > > > > >>> Warning: Can't find user rc-file
> > > > > > > >>> Cluster: /data0/projects/meadfrench/MORCAS/MORCAS.cluster
> > > > > > > >>> iconv: illegal input sequence at position 251
> > > > > > > >>>
> > > > > > > >>> Any ideas would be helpful.
> > > > > > > >>>
> > > > > > > >>> Bryan
> > > > > > > >>>
> > > > > > > >>> On Mon, Apr 07, 2008 at 05:23:21PM -0400, radev at umich.edu
> > > > >
> > > > > wrote:
> > > > > > > >>>> Hi, Thuy,
> > > > > > > >>>>
> > > > > > > >>>> Can you please figure out the problem with the French
> > >
> > > mead?
> > >
> > > > > > > >>>> Drago
> > > > >
> > > > > --
> > > > > Dragomir R. Radev Associate Professor
> > > > > SI, CSE, Ling U. Michigan, Ann Arbor
> > > > > http://www.eecs.umich.edu/~radev radev at umich.edu
>
>
--
Dragomir R. Radev Associate Professor
SI, CSE, Ling U. Michigan, Ann Arbor
http://www.eecs.umich.edu/~radev radev at umich.edu
More information about the Mead
mailing list