spalsh,单独使用(不使用scrapy框架),如何设置Proxy IP

场景

有些场景,为了方便、高效,需要脱离scrapy框架使用spalsh

配置

代理:隧道代理为佳

在宿主机上找个位置,新建文件/root/splash/proxy-files/cip.ini 注意:区别于官方文档,ini应该为小写

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
[proxy]

; required
host=你的配置
port=你的配置

; optional, default is no auth
username=你的配置
password=你的配置

; optional, default is HTTP. Allowed values are HTTP and SOCKS5
type=HTTP

[rules]
; optional, default ".*"
whitelist=
.*cip.cc.*

; optional, default is no blacklist
blacklist=
.*.js.*
.*.css.*
.*.png

docker启动spalsh

1
2
3
4
[root@host proxy-files]# docker run -p 8050:8050 -v /root/splash/proxy-files:/etc/splash/proxy-profiles scrapinghub/splash

# 后台启动
docker run -d -p 8050:8050 --restart=always -v /root/splash/proxy-files:/etc/splash/proxy-profiles scrapinghub/splash

启动后日志:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
2022-10-13 04:22:24+0000 [-] Log opened.
2022-10-13 04:22:24.608906 [-] Xvfb is started: ['Xvfb', ':322410884', '-screen', '0', '1024x768x24', '-nolisten', 'tcp']
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-splash'
2022-10-13 04:22:24.718768 [-] Splash version: 3.5
2022-10-13 04:22:24.774528 [-] Qt 5.14.1, PyQt 5.14.2, WebKit 602.1, Chromium 77.0.3865.129, sip 4.19.22, Twisted 19.7.0, Lua 5.2
2022-10-13 04:22:24.774753 [-] Python 3.6.9 (default, Jul 17 2020, 12:50:27) [GCC 8.4.0]
2022-10-13 04:22:24.774878 [-] Open files limit: 1048576
2022-10-13 04:22:24.774946 [-] Can't bump open files limit
2022-10-13 04:22:24.806401 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles
2022-10-13 04:22:24.806578 [-] memory cache: enabled, private mode: enabled, js cross-domain access: disabled
2022-10-13 04:22:24.969678 [-] verbosity=1, slots=20, argument_cache_max_entries=500, max-timeout=90.0
2022-10-13 04:22:24.970001 [-] Web UI: enabled, Lua: enabled (sandbox: enabled), Webkit: enabled, Chromium: enabled
2022-10-13 04:22:24.970452 [-] Site starting on 8050
2022-10-13 04:22:24.970553 [-] Starting factory <twisted.web.server.Site object at 0x7f729c0f4550>
2022-10-13 04:22:24.970848 [-] Server listening on http://0.0.0.0:8050

其中 2022-10-13 04:22:24.806401 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles标识代理生效

使用:

访问cip.cc,查看此时的ip

1
curl 'http://10.0.19.90:8050/render.html?url=http://cip.cc&proxy=cip'

返回结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52

<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">Too Many Request
</pre></body></html>%
➜ ~ curl 'http://10.0.19.90:8050/render.html?url=http://cip.cc&proxy=cip'
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head>
<title>IP查询 - 查IP(www.cip.cc)</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="description" content="查IP(www.cip.cc)网站, 提供免费的IP查询服务,命令行查询IP, 并且支持'PC网站, 手机网站, 命令行(Windows/UNIX/Linux)' 三大平台, 是个多平台的IP查询网站, 更新即使, 数据准确是我们的目标">
<meta name="keywords" content="IP, 查IP, IP查询">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta content="width=device-width,initial-scale=1" name="viewport">
<link rel="icon" href="data:;base64,=">
<link href="//static.cip.cc/static/styles.min.css?v=15" rel="stylesheet">
<script src="https://hm.baidu.com/hm.js?6c34da399cbcfbb71d86c72215942759"></script><script type="text/javascript" src="//static.cip.cc/static/js.min.js?v=6"></script>
</head>
<body>

<div class="wrapper">
<div class="page">
<div class="logo">
<h1>
<strong>多平台的命令行IP查询</strong>
<a href="//www.cip.cc/" title="手机, 命令行IP查询"><img src="//static.cip.cc/static/img/logo.png?v=2" alt="手机, 命令行IP查询"></a>
</h1>
</div>
<div class="search">
<form action="/" onsubmit="return query();">
<table>
<tbody>
<tr>
<td style=" width: 75%; ">
<input id="data-input" placeholder="请输入要查询的 IP 地址" size="26" type="text">
</td>
<td>
<input id="data-submit" type="submit" class="kq-button" value="查询">
</td>
</tr>
</tbody>
</table>
</form>
</div>
<div class="data kq-well">
<pre>IP : 182.87.15.14
地址 : 中国 江西 鹰潭
运营商 : 电信

数据二 : 江西省鹰潭市 | 电信

数据三 : 中国江西省鹰潭市 | 电信

URL : http://www.cip.cc/182.87.15.14
</pre>

可以看到,ip已经更换

Python requests demo

1
2
3
4
5
6
7
import requests

target_url = "http://cip.cc"
url = f'http://{你的服务器地址}:8050/render.html?url={target_url}&proxy=cip'

response = requests.get(url)
print(response.text)

一个需求案例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import time
import requests
from concurrent.futures import ThreadPoolExecutor
from retrying import retry


@retry(stop_max_attempt_number=3)
def req_taobao():
target_url = "https://shop551707528.taobao.com/search.htm?search=y&orderType=hotsell_desc&&pageNo=1"
url = f'http://10.0.19.90:8050/render.html?url={target_url}&proxy=taobao&wait=5'

response = requests.get(url, timeout=5)
print(response.text)


"""开始爬虫"""
s_time = time.time()
executor = ThreadPoolExecutor(5)
for i in range(10):
executor.submit(req_taobao)
executor.shutdown()
print(time.time() - s_time)

相关文档

子航 Clark wechat
微信公众号"优雅的python",欢迎订阅!
坚持分享,您的支持将鼓励我继续创作!