[Mead] Re: Italian characters stripped
radev at umich.edu
radev at umich.edu
Fri Oct 17 11:55:04 EDT 2003
Valerio Santinelli wrote:
> ----- Original Message -----
> From: <radev at umich.edu>
> > > The second question is how to make a language db file for Italian. In =
> > > chapter 11.3 of the manual there's a mention to a script used to do that
> > > work but that has been stripped from v3.07. Can you give me some =
> > > directions as to how to compute IDF? Is there any way to do that through
> > > the analysis of a large set of documents?
> > There is a script build-idf.pl or something like this that takes as
> > input a text file (e.g., enidf.txt) to build an IDF file.
> Yes, I've seen that. Is there a description of the textfile format anywhere?
> It would be helpful to understand what each value is. :)
There are IDF values. The formula is something like this:
idf(i) = -log(N/n_i)
where N is the number of documents in a large corpus and n is the
number of documents among them containing a given word i.
Low IDF means stop words.
N = Collection size = 1,000,000 documents
n = nb. documents with the word "the" = 900,000
N/n = 1.1
log (N/n) = something slightly higher than 0.
N = 1,000,000 docs
n = 10 (for the word "serendipity")
N/n = 100,000
log (N/n) = 5 (quite high)
--> you can also use a different log base (e.g., 2)
Note: the CIDR addon to MEAD may have some code for computing IDF. It
may need some work to use it for other purposes though.
> > To build the IDF file, yopu may have to write your own little
> > program. I hadn't thought of including one with MEAD. (Feel free to
> > contribute it back to MEAD if you want :)
> This won't be a problem if I can understand the structure of the file
> itself. :)
Dragomir R. Radev radev at umich.edu
Assistant Professor of Information, Electrical Engineering and
Computer Science, and Linguistics, the University of Michigan, Ann Arbor
Phone: 734-615-5225 Fax: 734-764-2475 http://www.si.umich.edu/~radev
More information about the Mead