{"id":159,"date":"2016-12-08T12:06:12","date_gmt":"2016-12-08T12:06:12","guid":{"rendered":"http:\/\/python.wp.w3.pt\/?p=159"},"modified":"2016-12-08T12:22:12","modified_gmt":"2016-12-08T12:22:12","slug":"expressoes-regulares-iii","status":"publish","type":"post","link":"http:\/\/python.w3.pt\/?p=159","title":{"rendered":"Express\u00f5es regulares III"},"content":{"rendered":"<p>Procurar URLs no meio do texto, em particular liga\u00e7\u00f5es http ou https, com express\u00f5es regulares, considerando nomes em Unicode, nomeadamente nomes de ficheiros, de dom\u00ednios e de dom\u00ednios de topo (TLD, top level domains). Al\u00e9m disso, os nomes dos TLD, que antes eram limitados a um lote pequeno e conhecido, agora nascem todas as semanas, e com tamanhos vari\u00e1veis. Aparentemente, <a href=\"https:\/\/en.wikipedia.org\/wiki\/Domain_Name_System#cite_ref-rfc1034_1-2\">h\u00e1 um limite<\/a> para esse tamanho, mas at\u00e9 quando?<\/p>\n<p>Neste momento, o limite est\u00e1 definido nas <a class=\"external mw-magiclink-rfc\" href=\"https:\/\/tools.ietf.org\/html\/rfc1035\" rel=\"nofollow\">RFC 1035<\/a>, <a class=\"external mw-magiclink-rfc\" href=\"https:\/\/tools.ietf.org\/html\/rfc1123\" rel=\"nofollow\">RFC 1123<\/a>, e <a class=\"external mw-magiclink-rfc\" href=\"https:\/\/tools.ietf.org\/html\/rfc2181\" rel=\"nofollow\">RFC 2181<\/a>. Esse limite \u00e9 de 63 carateres para cada segmento do nome de dom\u00ednio, e 253 para o nome na totalidade, considerando j\u00e1 a representa\u00e7\u00e3o textual &#8211; <a href=\"https:\/\/en.wikipedia.org\/wiki\/Punycode\">Punycode<\/a> &#8211; no caso dos nomes que usem letras para al\u00e9m do ASCII ou, mais precisamente, para al\u00e9m do conjunto LDH (letras, d\u00edgitos, h\u00edfen), a-z, A-Z, 0-9 e o h\u00edfen.<\/p>\n<p>Os URLs cont\u00eam tamb\u00e9m nomes de diretorias e de ficheiros, assim como par\u00e2metros e vari\u00e1veis para p\u00e1ginas din\u00e2micas. Construir um interpretador de URLs n\u00e3o \u00e9 tarefa f\u00e1cil. \u00c9 prefer\u00edvel usar um j\u00e1 criado e testado e, eventualmente, melhor\u00e1-lo ou corrigi-lo.<\/p>\n<p>Foi isso que eu fiz, e aqui fica o programa exemplo em Python com a express\u00e3o regular para ca\u00e7ar URLs:<\/p>\n<pre>#!\/usr\/bin\/python\r\n# -*- coding: utf-8 -*-\r\nimport re\r\n\r\nline = \"#lol avi\u00e3o https:\/\/gist.github.com\/HenkPoley\/8899766 com o t\u00f3ino a bordo test@google.co.il, person@amazon.co.uk https:\/\/www.vinicius.atongadamirongadokabulet\u00ea\/fotos\/pauta.gif e outro http:\/\/www.pagina.\u79fb\u52a8\/index.php e mais um outro link: https:\/\/\/w3.\u6e38\u620f\/ http:\/\/w3.pt e ainda https:\/\/stat.cool\/index.php \"\r\n\r\np = re.compile(r\"\"\"(?i)\\b((?:https?:(?:\/{1,3}|[\\w0-9%])|[\\w0-9.\\-]+[.](?:\\w{2,63})\/)(?:[^\\s()&lt;&gt;{}\\[\\]]+|\\([^\\s()]*?\\([^\\s()]+\\)[^\\s()]*?\\)|\\([^\\s]+?\\))+(?:\\([^\\s()]*?\\([^\\s()]+\\)[^\\s()]*?\\)|\\([^\\s]+?\\)|[^\\s`!()\\[\\]{};:'\".,&lt;&gt;?\u00ab\u00bb\u201c\u201d\u2018\u2019])|(?:\\b(?&lt;![@.])[\\w0-9]+(?:[.\\-][\\w0-9]+)*[.](?:\\w{2,63})\\b\/?(?!@)))\"\"\", re.U)\r\n\r\nr = p.findall(line)\r\nprint r\r\n<\/pre>\n<p>O resultado da interpreta\u00e7\u00e3o \u00e9 o seguinte:<\/p>\n<pre>['https:\/\/gist.github.com\/HenkPoley\/8899766', 'https:\/\/www.vinicius.atongadamirongadokabulet\\xc3\\xaa\/fotos\/pauta.gif', 'http:\/\/www.pagina.\\xe7\\xa7\\xbb\\xe5\\x8a\\xa8\/index.php', 'https:\/\/\/w3.\\xe6\\xb8\\xb8\\xe6\\x88\\x8f\/', 'http:\/\/w3.pt', 'https:\/\/stat.cool\/index.php']\r\n<\/pre>\n<p>Links usados para obter este resultado, por ordem cronol\u00f3gica:<br \/>\n<a href=\"http:\/\/stackoverflow.com\/questions\/6718633\/python-regular-expression-again-match-url\">Python regular expression again &#8211; match url<\/a><br \/>\n<a href=\"http:\/\/daringfireball.net\/2010\/07\/improved_regex_for_matching_urls\">An Improved Liberal, Accurate Regex Pattern for Matching URLs<\/a><br \/>\n<a href=\"https:\/\/gist.github.com\/gruber\/8891611\">gruber\/Liberal Regex Pattern for Web URLs<\/a><br \/>\n<a href=\"https:\/\/gist.github.com\/winzig\/8894715\">winzig\/Liberal Regex Pattern for URLs<\/a><br \/>\n<a href=\"https:\/\/en.wikipedia.org\/wiki\/Domain_Name_System#cite_ref-rfc1034_1-2\">Domain Name System<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Procurar URLs no meio do texto, em particular liga\u00e7\u00f5es http ou https, com express\u00f5es regulares, considerando nomes em Unicode, nomeadamente nomes de ficheiros, de dom\u00ednios e de dom\u00ednios de topo (TLD, top level domains). Al\u00e9m disso, os nomes dos TLD, que antes eram limitados a um lote pequeno e conhecido, agora nascem todas as semanas, &hellip; <\/p>\n<p class=\"link-more\"><a href=\"http:\/\/python.w3.pt\/?p=159\" class=\"more-link\">Continuar a ler <span class=\"screen-reader-text\">&#8220;Express\u00f5es regulares III&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"http:\/\/python.w3.pt\/index.php?rest_route=\/wp\/v2\/posts\/159"}],"collection":[{"href":"http:\/\/python.w3.pt\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/python.w3.pt\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/python.w3.pt\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/python.w3.pt\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=159"}],"version-history":[{"count":5,"href":"http:\/\/python.w3.pt\/index.php?rest_route=\/wp\/v2\/posts\/159\/revisions"}],"predecessor-version":[{"id":164,"href":"http:\/\/python.w3.pt\/index.php?rest_route=\/wp\/v2\/posts\/159\/revisions\/164"}],"wp:attachment":[{"href":"http:\/\/python.w3.pt\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=159"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/python.w3.pt\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=159"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/python.w3.pt\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=159"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}