高手的存在,就是让服务10亿人的时候,你感觉只是为你一个人服务......

网络带宽瓶颈导致tps上不去

目录
  1. 1. 背景:
  2. 2. 环境拓扑
  3. 3. resin rac配置
  4. 4. 压测
    1. 4.1. 脚本
  5. 5. 后续

最近给保险项目组做压力测试,发现增加并发用户数和应用服务器数量,tps一直上不去,
最终发现是应用服务器与DB服务器网络带宽不够导致。

背景:

dba的邮件记录:
Alt text

淘宝的邮件记录:
Alt text

从两封邮件来看,人员之间存在对性能术语理解的偏差。
dba邮件中描述35 并发,taobao邮件中描述35 QPS,由于之前taobao人员没有跟我直接交流,我只看到dba的邮件,所以我以为35是并发用户数,而不是每秒请求数(QPS,这里也可以理解为TPS)。
以至于后面几天的测试中,一直以为taobao那边的压测的数据有问题。


环境拓扑

模拟线上的机器拓扑,搭建了应用服务。
两台物理机,每台机器上虚拟出两台虚拟机,一共4台虚拟机做app server。
物理机和DB端,网卡都是千兆全双工。


resin rac配置

/etc/hosts中配置域名指向:
ph-mic-test-db1-vip与ph-mic-test-db2-vip

1
2
3
4
5
6
7
8
9
10
11
12
<database>
<jndi-name>jdbc/InsDatabase</jndi-name>
<driver type="oracle.jdbc.driver.OracleDriver">
<url>jdbc:oracle:thin:@(DESCRIPTION =(LOAD_BALANCE = on)(FAILOVER=on)(ADDRESS_LIST =(ADDRESS = (PROTOCOL = TCP)(HOST = ph-mic-test-db1-vip)(PORT = 1521))(ADDRESS = (PROTOCOL = TCP)(HOST = ph-mic-test-db2-vip)(PORT = 1521)))(CONNECT_DATA =(SERVER = DEDICATED)(SERVICE_NAME = ins.host185)(FAILOVER_MODE =(TYPE = SELECT)(METHOD = BASIC))))</url>
<user>ins</user>
<password>instest007</password>
<init-param QTO="F"/>
</driver>
<prepared-statement-cache-size>20</prepared-statement-cache-size>
<max-connections>1024</max-connections>
<max-idle-time>30s</max-idle-time>
</database>

压测

脚本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
web_custom_request("underwriting", 
"URL=http://192.168.42.162:9001/api/tb/underwriting?com_id=1757208157&sign=c9f56705bb1380f32a2121994680e30e",
"Method=POST",
"Resource=0",
"RecContentType=text/xml",
"Referer=",
"Mode=HTML",
"EncType=text/xml; charset=utf-8",
"Body="
"<?xml version=\"1.0\" encoding=\"GBK\" standalone=\"yes\"?> \n"
"<PackageList xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"> \n"
"<Package> \n"
"<Header> \n"
"<RequestType>01</RequestType> \n"
"<UUID>2e2830d0-bf96-460e-805a-e4d5cdafbebe</UUID> \n"
"<ComId>1757208157</ComId> \n"
"<From>taobao</From> \n"
"<SendTime>2016-11-07 08:08:17</SendTime> \n"
"<TaoBaoSerial>{serial}{serial2}</TaoBaoSerial> \n"
"<ComSerial xsi:nil=\"true\"/> \n"
"<Asyn>0</Asyn> \n"
"<ReturnUrl>http://service.baoxian.taobao.com/baoxian/cooperation</ReturnUrl> \n"
"<ProductCode>1851</ProductCode> \n"
"</Header> \n"
"<Request> \n"
"<Order> \n"
"<TBOrderId>{serial}{serial2}</TBOrderId> \n"
"<TotalPremium>3000</TotalPremium> \n"
"<PostFee xsi:nil=\"true\"/> \n"
"<InsBeginDate>2016-11-19 00:00:00</InsBeginDate> \n"
"<InsEndDate>2017-11-19 00:00:00</InsEndDate> \n"
"<InsPeriod>1Y</InsPeriod> \n"
"<ApplyNum>1</ApplyNum> \n"
"<Item> \n"
"<ItemId>526471633873</ItemId> \n"
"<SkuRiskCode>1851</SkuRiskCode> \n"
"<ProductCode>1851</ProductCode> \n"
"<ProductName>???</ProductName> \n"
"<Amount xsi:nil=\"true\"/> \n"
"<Premium>3000</Premium> \n"
"<ActualPremium>3000</ActualPremium> \n"
"<DiscountRate>10000</DiscountRate> \n"
"</Item> \n"
"<PolicyNo xsi:nil=\"true\"/> \n"
"</Order> \n"
"<ApplyInfo> \n"
"<Holder> \n"
"<CustomList> \n"
"<Custom key=\"HolderBirthday\">1984-08-15</Custom> \n"
"<Custom key=\"HolderName\">{name1}{name}</Custom> \n"
"<Custom key=\"HolderMobile\">{tel}{tel2}</Custom> \n"
"<Custom key=\"HolderSex\">1</Custom> \n"
"<Custom key=\"HolderCardType\">1</Custom> \n"
"<Custom key=\"HolderCardNo\">320323198804227051</Custom> \n"
"</CustomList> \n"
"</Holder> \n"
"<InsuredInfo> \n"
"<IsHolder>0</IsHolder> \n"
"<InsuredList> \n"
"<Insured> \n"
"<CustomList> \n"
"<Custom key=\"InsuredName\">{name1}{name}</Custom> \n"
"<Custom key=\"InsuredCardType\">1</Custom> \n"
"<Custom key=\"InsuredRelation\">1</Custom> \n"
"<Custom key=\"InsuredSex\">1</Custom> \n"
"<Custom key=\"InsuredBirthday\">1984-08-15</Custom> \n"
"<Custom key=\"InsuredCardNo\">320323198804227051</Custom> \n"
"</CustomList> \n"
"<BenefitInfo> \n"
"<IsLegal>1</IsLegal> \n"
"<BenefitList/> \n"
"</BenefitInfo> \n"
"</Insured> \n"
"</InsuredList> \n"
"</InsuredInfo> \n"
"<OtherInfo> \n"
"<CustomList> \n"
"<Custom key=\"bxcifid\">238810000071764631481</Custom> \n"
"</CustomList> \n"
"</OtherInfo> \n"
"<RefundInfo> \n"
"<CustomList/> \n"
"</RefundInfo> \n"
"</ApplyInfo> \n"
"</Request> \n"
"</Package> \n"
"</PackageList> \n",
LAST);

加压的时候发现,压力机运行良好,tps最大在28,cpu和load不高;后来又加了2台app server,tps依然保持在28。
因为在测试网络进行压测,而且一秒钟28个订单提交,对网络带宽这块没去考虑,觉得不会有问题。
后来分析了下单请求与DB的交互,很是吃惊,Word天…一次下单有379次DB操作!!!!
Alt text

然后,发送下单请求,抓了一下包:
Alt text

请求大概1s一次,每秒19.359Mbit,也就是说,一次下单请求,数据库要返回19.359Mbit的数据!!!

我们来计算计算,
TPS=28,19.359Mbit/s,
每秒钟:28*19.359=542Mbit/s

OK,我们知道了,每秒钟542Mbit/s的吞吐,再测试一下app server与DB之间最大网络传输

netpert -H 192.168.52.185 -l 30
Alt text

app server、网关、DB server都是千兆卡,646.81Mbit/s
网络不好不要超过30%,压测时542Mbit/s会出现大量包丢失现象

实际压测中,DB端的网络情况:
Alt text
瓶颈在这~

后续

但是,从taobao压测的流量记录来看,DB端最大250Mbit/s,根本没有达到我们这个量,这件事情很奇怪….