python-catalin: nltk

Showing posts with label nltk. Show all posts

Tuesday, June 18, 2019

Python 3.7.3 : Stemming with nltk.

Today I will start another tutorial about nltk python module and stemming.
The stemming is the process of producing morphological variants of a root/base word.
Stemming programs are commonly referred to as stemming algorithms or stemmers to reduces the words.
Errors in Stemming can be overstemming and understemming.
These two words are stemmed to the same root that are of different stems then the term is overstemming.
When two words are stemmed to same root that are not of different stems then the term used is understemming.
Applications of stemming are used in information retrieval systems like search engines or is used to determine domain vocabularies in domain analysis.
Let install this python module named nltk with pip tool:

C:\Python373\Scripts>pip install nltk
Collecting nltk
...
Successfully installed nltk-3.4.1 six-1.12.0

The nltk python module work with human language data for applying in statistical natural language processing (NLP).
It contains text processing libraries for tokenization, parsing, classification, stemming, tagging, graphical demonstrations, sample data sets, and semantic reasoning.
The next step is to download the models and data, see more at this official webpage.
First run this lines of code to update the nltk python module.

import nltk
nltk.download()

Let's test a simple implementation of stemming words using nltk python module:

from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize 
   
my_porter = PorterStemmer() 
   
quote = "Deep in the human unconscious is a pervasive need for a logical universe that makes sense."

words = word_tokenize(quote) 
   
for w in words: 
    print(w, " : ", my_porter.stem(w))

The result is something like this:

C:\Users\catafest>python stemming_001.py
Deep  :  deep
in  :  in
the  :  the
human  :  human
unconscious  :  unconsci
is  :  is
a  :  a
pervasive  :  pervas
need  :  need
for  :  for
a  :  a
logical  :  logic
universe  :  univers
that  :  that
makes  :  make
sense  :  sens
.  :  .

C:\Users\catafest>

You can read more about the stemming at Wikipedia.

Tuesday, May 2, 2017

The nltk python module - part 001.

About nltk python module.
NLTK is a leading platform for building Python programs to work with human language data. The base of this issue is about Natural Language Processing techniques to analyze text like a processing of human language data. You can read the NLTK 3.0 documentation from here.
How to install nltk python module under Windows 10 and Fedora 26 distro.
Install under Windows 10, by using the pip command:

C:\Python27\Scripts>pip install --trusted-host pypi.python.org nltk
Collecting nltk
Downloading nltk-3.2.2.tar.gz (1.2MB)
100% |################################| 1.2MB 2.6MB/s
Requirement already satisfied: six in c:\python27\lib\site-packages (from nltk)
Building wheels for collected packages: nltk
...
Successfully built nltk
Installing collected packages: nltk
Successfully installed nltk-3.2.2

Download all packages into your Windows 10 with this python source code:

C:\Python27>python
Python 2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 20:42:59) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>> nltk.download()
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
True

Under Linux you can install by using the pip command, I used Fedora 26 distro:

[root@localhost mythcat]# pip install nltk
WARNING: Running pip install with root privileges is generally not a good idea.
 Try `pip install --user` instead.
Collecting nltk
  Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken
 by 'ProtocolError('Connection aborted.', error(104, 'Connection reset by peer'))': /simple/nltk/
  Downloading nltk-3.2.2.tar.gz (1.2MB)
    100% |████████████████████████████████| 1.2MB 1.1MB/s 
Requirement already satisfied: six in /usr/lib/python2.7/site-packages (from nltk)
Installing collected packages: nltk
  Running setup.py install for nltk ... done
Successfully installed nltk-3.2.2

Download all packages into your Fedora 26 distro with this python source code:

[mythcat@localhost ~]$ python 
Python 2.7.13 (default, Feb 21 2017, 12:00:39) 
[GCC 7.0.1 20170219 (Red Hat 7.0.1-0.9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>> nltk.download()
NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> l
Packages:
  [ ] abc................. Australian Broadcasting Commission 2006
  [ ] alpino.............. Alpino Dutch Treebank
...
Collections:
  [ ] all-corpora......... All the corpora
  [ ] all................. All packages
  [ ] book................ Everything used in the NLTK Book

([*] marks installed packages)

Download which package (l=list; x=cancel)?
  Identifier> all
    Downloading collection u'all'
       | 
       | Downloading package abc to /home/mythcat/nltk_data...
       |   Unzipping corpora/abc.zip.
       | Downloading package alpino to /home/mythcat/nltk_data...
       |   Unzipping corpora/alpino.zip.
       | Downloading package biocreative_ppi to
...

Let's start with a simple example by show sample example books:


>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
>>> ...

The next example let you import books from the sample area and use it:

#function count the word in the Text
>>> print text1.count("white")
191
# function concordance view shows us every occurrence of a given word, together with some context.
>>> print text3.concordance("white")
Displaying 5 of 5 matches:
potted , and every one that had some white in it , and all the brown among the 
 hazel and chesnut tree ; and pilled white strakes in them , and made the white
white strakes in them , and made the white appear which was in the rods . And h
y dream , and , behold , I had three white baskets on my he And in the uppermos
all be red with wine , and his teeth white with milk . Zebulun shall dwell at t
None
#function similar to the name of the text
>>> print text3.similar("white")
None
>>> print text3.similar("got")
named set arrayed bound brought see embraced kissed slew unto curse
built shewed laid digged sent gave offer offered blessed
None
#contexts are shared by two or more words
>>> text3.common_contexts(["white","blue"])
(u'The following word(s) were not found:', u'white blue')
>>> text3.common_contexts(["man","men"])
old_of the_and the_said the_that the_took young_and the_s

This is all for today.

python-catalin

analitics

Pages

Tuesday, June 18, 2019

Python 3.7.3 : Stemming with nltk.

Tuesday, May 2, 2017

The nltk python module - part 001.