{"id":234,"date":"2018-05-17T16:20:33","date_gmt":"2018-05-17T16:20:33","guid":{"rendered":"http:\/\/python.wp.w3.pt\/?p=234"},"modified":"2018-05-17T16:38:59","modified_gmt":"2018-05-17T16:38:59","slug":"stopwords-em-portugues","status":"publish","type":"post","link":"http:\/\/python.w3.pt\/?p=234","title":{"rendered":"Stopwords em portugu\u00eas e pontua\u00e7\u00e3o"},"content":{"rendered":"<p>Existe <a href=\"http:\/\/www.nltk.org\/howto\/portuguese_en.html\">um pacote<\/a> em Python para remover as stopwords de textos em portugu\u00eas. Eis como se usa:<\/p>\n<pre># Stopwords em portugu\u00eas:\r\n\r\n>>> import nltk\r\n>>> nltk.download('stopwords')\r\n\r\n>>> stopwords = nltk.corpus.stopwords.words('portuguese')\r\n\r\n>>> stopwords[:10]\r\n[u'de', u'a', u'o', u'que', u'e', u'do', u'da', u'em', u'um', u'para']\r\n\r\n>>> len(stopwords)\r\n203\r\n\r\n>>> 'a' in stopwords\r\nTrue\r\n>>> 'xico' in stopwords\r\nFalse\r\n<\/pre>\n<p>Neste momento tem apenas 203 palavras, o que me parece muito pouco. H\u00e1 uma lista destas palavras no <a href=\"https:\/\/gist.github.com\/alopes\/5358189\">github<\/a>.<\/p>\n<p>Para testar a pontua\u00e7\u00e3o, pode ser usado o c\u00f3digo seguinte:<\/p>\n<pre>\r\n>>> import string\r\n>>> for c in string.punctuation:\r\n...     print(\"[\" + c + \"]\")\r\n...\r\n[!]\r\n[\"]\r\n[#]\r\n[$]\r\n[%]\r\n[&]\r\n[']\r\n[(]\r\n[)]\r\n[*]\r\n[+]\r\n[,]\r\n[-]\r\n[.]\r\n[\/]\r\n[:]\r\n[;]\r\n[<]\r\n[=]\r\n[>]\r\n[?]\r\n[@]\r\n[[]\r\n[\\]\r\n[]]\r\n[^]\r\n[_]\r\n[`]\r\n[{]\r\n[|]\r\n[}]\r\n[~]\r\n\r\n>>> ',' in string.punctuation\r\nTrue\r\n>>> 'a' in string.punctuation\r\nFalse\r\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Existe um pacote em Python para remover as stopwords de textos em portugu\u00eas. Eis como se usa: # Stopwords em portugu\u00eas: >>> import nltk >>> nltk.download(&#8216;stopwords&#8217;) >>> stopwords = nltk.corpus.stopwords.words(&#8216;portuguese&#8217;) >>> stopwords[:10] [u&#8217;de&#8217;, u&#8217;a&#8217;, u&#8217;o&#8217;, u&#8217;que&#8217;, u&#8217;e&#8217;, u&#8217;do&#8217;, u&#8217;da&#8217;, u&#8217;em&#8217;, u&#8217;um&#8217;, u&#8217;para&#8217;] >>> len(stopwords) 203 >>> &#8216;a&#8217; in stopwords True >>> &#8216;xico&#8217; in stopwords False &hellip; <\/p>\n<p class=\"link-more\"><a href=\"http:\/\/python.w3.pt\/?p=234\" class=\"more-link\">Continuar a ler <span class=\"screen-reader-text\">&#8220;Stopwords em portugu\u00eas e pontua\u00e7\u00e3o&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"http:\/\/python.w3.pt\/index.php?rest_route=\/wp\/v2\/posts\/234"}],"collection":[{"href":"http:\/\/python.w3.pt\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/python.w3.pt\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/python.w3.pt\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/python.w3.pt\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=234"}],"version-history":[{"count":3,"href":"http:\/\/python.w3.pt\/index.php?rest_route=\/wp\/v2\/posts\/234\/revisions"}],"predecessor-version":[{"id":237,"href":"http:\/\/python.w3.pt\/index.php?rest_route=\/wp\/v2\/posts\/234\/revisions\/237"}],"wp:attachment":[{"href":"http:\/\/python.w3.pt\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=234"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/python.w3.pt\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=234"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/python.w3.pt\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=234"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}