导航首页 » 技术教程 » PHP利用正则表达式将相对路径转成绝对路径的方法示例
全站头部文字 我要出现在这里
PHP利用正则表达式将相对路径转成绝对路径的方法示例 379 2024-01-10   

前言

大家应该都有所体会,很多时候在做网络爬虫的时候特别需要将爬虫搜索到的超链接进行处理,统一都改成绝对路径的,所以本文就写了一个正则表达式来对搜索到的链接进行处理。下面话不多说,来看看详细的介绍吧。

通常我们可能会搜索到如下的链接:

< 空超链接 -->
<a href="http://www.gimoo.net/t/1805/5aead87a05db2.html"></a> 
< 空白符 -->
<a href="http://www.gimoo.net/t/1805/ " rel="external nofollow" > </a>
< a标签含有其它属性 -->
<a href="http://www.gimoo.net/t/1805/index.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" alt="超链接"> index.html </a>
<a href="http://www.gimoo.net/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" target="_blank"> / target="_blank" </a>
<a target="_blank" href="http://www.gimoo.net/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" alt="超链接" > target="_blank" / alt="超链接" </a>
<a target="_blank" title="超链接" href="http://www.gimoo.net/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" alt="超链接" > target="_blank" title="超链接" / alt="超链接" </a>
< 根目录 -->
<a href="http://www.gimoo.net/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" > / </a>
<a href="http://www.gimoo.net/t/1805/a" rel="external nofollow" > a </a>
< 含参数 -->
<a href="http://www.gimoo.net/index.html&" rel="external nofollow" > /index.html?id=1 </a>
<a href="http://www.gimoo.net/t/1805/&" rel="external nofollow" > ?id=2 </a>
< // -->
<a href="http://index.html" rel="external nofollow" > //index.html </a>
<a href="http://www.mafutian.net" rel="external nofollow" > //www.mafutian.net </a>
< 站内链接 -->
<a href="http://www.hole_1.com/index.html" rel="external nofollow" > http://www.hole_1.com/index.html </a>
< 站外链接 -->
<a href="http://www.mafutian.net" rel="external nofollow" > http://www.mafutian.net </a>
<a href="http://www.numberer.net" rel="external nofollow" > http://www.numberer.net </a>
< 图片,文本文件格式的链接 -->
<a href="http://www.gimoo.net/t/1805/1.jpg" rel="external nofollow" > 1.jpg </a>
<a href="http://www.gimoo.net/t/1805/1.jpeg" rel="external nofollow" > 1.jpeg </a>
<a href="http://www.gimoo.net/t/1805/1.gif" rel="external nofollow" > 1.gif </a>
<a href="http://www.gimoo.net/t/1805/1.png" rel="external nofollow" > 1.png </a>
<a href="http://www.gimoo.net/t/1805/1.txt" rel="external nofollow" > 1.txt </a>
< 普通链接 -->
<a href="http://www.gimoo.net/t/1805/index.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" > index.html </a>
<a href="http://www.gimoo.net/t/1805/index.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" > index.html </a>
<a href="http://www.gimoo.net/t/1805/index.html" rel="external nofollow" > ./index.html </a>
<a href="http://www.gimoo.net/t/index.html" rel="external nofollow" > ../index.html </a>
<a href="http://www.gimoo.net/t/1805/.../" rel="external nofollow" > .../ </a>
<a href="http://www.gimoo.net/t/1805/..." rel="external nofollow" > ... </a>
< 非链接,含有链接冒号 --> 
<a href="javascript:void(0)" rel="external nofollow" > javascript:void(0) </a>
<a href="a:b" rel="external nofollow" > a:b </a>
<a href="http://www.gimoo.net/a" rel="external nofollow" > /a#a:b </a>
<a href="mailto:'mafutian@126.com'" rel="external nofollow" > mailto:'mafutian@126.com' </a>
<a href="http://www.gimoo.net/tencent://message/&" rel="external nofollow" > /tencent://message/?uin=335134463 </a> 
< 相对路径 -->
<a href="http://www.gimoo.net/t/1805/." rel="external nofollow" > . </a>
<a href="http://www.gimoo.net/t/1805/.." rel="external nofollow" > .. </a>
<a href="http://www.gimoo.net/t/" rel="external nofollow" > ../ </a>
<a href="http://www.gimoo.net/a/b/.." rel="external nofollow" > /a/b/.. </a>
<a href="http://www.gimoo.net/a" rel="external nofollow" > /a </a>
<a href="http://www.gimoo.net/t/1805/b" rel="external nofollow" > ./b </a>
<a href="http://www.gimoo.net/t/1805/b" rel="external nofollow" > ./././././././././b </a> < 其实就是 ./b -->
<a href="http://www.gimoo.net/t/c" rel="external nofollow" > ../c </a>
<a href="http://www.gimoo.net/d" rel="external nofollow" > ../../d </a>
<a href="http://www.gimoo.net/t/b/d" rel="external nofollow" > ../a/../b/c/../d </a>
<a href="http://www.gimoo.net/t/e" rel="external nofollow" > ./../e </a>
<a href="http://www.hole_1.org/./../e" rel="external nofollow" > http://www.hole_1.org/./../e </a> 
<a href="http://www.gimoo.net/t/f" rel="external nofollow" > ./.././f </a>
<a href="http://www.hole_1.org/../a/.../../b/c/../d/.." rel="external nofollow" > http://www.hole_1.org/../a/.../../b/c/../d/.. </a> 
< 带有端口号 -->
<a href="http://www.gimoo.net/t/1805/:8081/index.html" rel="external nofollow" > :8081/index.html </a>
<a href="http://www.mafutian.net:80/index.html" rel="external nofollow" > :80/index.html </a>
<a href="http://www.mafutian.net:8081/index.html" rel="external nofollow" > http://www.mafutian.net:8081/index.html </a>
<a href="http://www.mafutian.net:8082/index.html" rel="external nofollow" > http://www.mafutian.net:8082/index.html </a>

处理的第一步,设置成绝对路径:

http:// ... / ../ ../

然后本文讲讲如何去除绝对路径中的 './'、'../'、'/..'的实现代码:

function url_to_absolute($relative)
{
 $absolute = '';
 // 去除所有的 './'
 $absolute = preg_replace('/(?<!.).//','',$relative);
 $count = preg_match_all('/(?<!/)/([^/]{1,}?)/..//',$absolute,$res);
 // 迭代去除所有的 '/abc/../'
 do
 {
 $absolute = preg_replace('/(?<!/)/([^/]{1,}?)/..//','/',$absolute);
 $count = preg_match_all('/(?<!/)/([^/]{1,}?)/..//',$absolute,$res); 
 }while($count >= 1);
 // 除去最后的 '/..'
 $absolute = preg_replace('/(?<!/)/([^/]{1,}?)/..$/','/',$absolute);
 $absolute = preg_replace('//..$/','',$absolute);
 // 除去存在的 '../'
 $absolute = preg_replace('/(?<!.)..//','',$absolute);
 return $absolute;
}
$relative = 'http://www.mytest.org/../a/.../../b/c/../d/..';
var_dump(url_to_absolute($relative));
// 输出:string 'http://www.mytest.org/a/b/' (length=26)

总结

以上就是这篇文章的全部内容了,希望本文的内容对大家的学习或者工作能带来一定的帮助,如果有疑问大家可以留言交流,谢谢大家对绿夏网的支持。



!!!站长长期在线接!!!

网站、小程序:定制开发/二次开发/仿制开发等

各种疑难杂症解决/定制接口/定制采集等

站长微信:lxwl520520

站长QQ:1737366103