01
vhost-user
dpdk的提出以及设计思想
随着各种互联网应用的不断出现,网络设备以及服务器的带宽也快速提升,从千兆到万兆再到25g,40g,100g快速演变。
与此同时,在云计算和系统虚拟化技术的快速发展的推动下,虚拟机的网络处理能力需求也逐年增强。
另外,不仅是在服务器领域,很多网络服务也都逐渐向虚拟化,资源池化,云化的方向发展,比如路由器,交换机,防火墙,基站等常用网络设备原本都是硬件解决方案,现在也逐渐向虚拟化方向发展,业界急需适合高性能网络的软件解决方案。
既往的基于linux实现的网络服务是基于内核的,也就是说所有数据包的发送和接收都要经过内核协议栈。在处理大流量和高并发的情况下,频繁的硬件中断会降低内核数据包的处理能力,同时内核空间和用户空间之间的数据拷贝也会产生比较大的负载。
为了进一步提升网络数据处理能力,linux uio (user-space drivers) 技术把硬件操作映射到用户空间,实现了在用户空间直接操作网卡硬件。
这样网络数据包的处理就可以不经过linux内核了,避免了高速网络数据处理时内核中断爆炸和大量数据复制的问题,进而让网络处理能力得以大幅度的提升。
绕过内核直接在用户态处理网络数据包,不仅可以解决内核的性能瓶颈,而且还易于开发和维护,和基于内核模块的实现相比也更灵活。
另外,因为程序运行在用户空间,即使有问题也不影响linux内核和系统的其他模块,也增强了整个系统的可用性。
针对高性能网络软件开发框架的需求,基于linux uio技术在应用层处理网络数据包,这也正是intel dpdk的主要设计思想。
dpdk应用初识
图1 linux内核处理路径(慢路径)和dpdk处理路径(快路径)对比
如图1所表示的就是基于linux内核的数据处理路径(慢路径)和基于dpdk的用户空间数据处理路径(快路径)的对比。
这里需要特别说明的是,intel dpdk被编译之后其实是一系列的库文件,供应用程序调用和链接。基于dpdk机制开发的应用程序在编译和链接dpdk的库文件之后,就可以进行高速的网络处理了。
这些dpdk应用都是运行在用户空间的应用程序。通过linux uio的机制以及少量内核模块(例如intel x86平台上需要预先加载igb_uio等内核驱动模块),这些运行在用户空间的dpdk应用程序能够旁路linux内核,直接操作物理网卡进而完成高速网络数据的发送和接收。
截止到目前为止dpdk已经支持多种硬件体系结构,包括intel x86,arm,powerpc等等,并且被网络设备厂商和互联网厂商广泛接受,已经有很多基于dpdk开发的产品和服务投入到生产环境使用了。
intel dpdk的核心技术之一是pmd (用户态的轮询模式驱动)。通过非中断以及数据进出应用缓冲区内存的零拷贝机制,进一步提高发送接受数据的效率。用户态模式的pmd驱动,去除中断,避免内核态和用户态内存拷贝,减少系统开销,从而提升i/o吞吐能力。
我们可以通过分析dpdk中的l2fw程序,大致了解一下dpdk应用程序的基本结构。
具体的代码可以参考dpdk-stable-18.11.11/examples/l2fwd/main.c。l2fwd是一个简单的dpdk应用程序,可以看出在main函数中,调用rte_eal_init函数对dpdk运行环境进行初始化之后,调用l2fwd_launch_one_lcore函数在每个cpu的核上都运行l2fwd_main_loop函数。
l2fwd_main_loop函数就是在无限循环中不断的轮询该cpu核绑定的网卡的所有端口,调用rte_eth_tx_buffer_flush和 rte_eth_rx_burst进行数据包的发送和接收,进而再完成转发。
被dpdk应用绑定的cpu核不再被linux内核调度,而是被l2fwd_main_loop函数独占。所以和原来基于linux内核的实现相比,网络数据包的处理能力得以大幅度提升。 dpdk中的pmd技术的具体实现,可以参考dpdk-stable-18.11.11/drivers/net/e1000/igb_ethdev.c中的函数eth_igb_dev_init,pmd相关的实现是dpdk的内部实现,限于篇幅,我们这里就不展开讲解了。
int
main(int argc, char **argv)
{
……
/* init eal */
ret = rte_eal_init(argc, argv);
if (ret n_rx_port == 0) {
rte_log(info, l2fwd, lcore %u has nothing to do , lcore_id);
return;
}
rte_log(info, l2fwd, entering main loop on lcore %u , lcore_id);
for (i = 0; i n_rx_port; i++) {
portid = qconf->rx_port_list[i];
rte_log(info, l2fwd, -- lcoreid=%u portid=%u , lcore_id,
portid);
}
while (!force_quit) {
cur_tsc = rte_rdtsc();
/*
* tx burst queue drain
*/
diff_tsc = cur_tsc - prev_tsc;
if (unlikely(diff_tsc > drain_tsc)) {
for (i = 0; i n_rx_port; i++) {
portid = l2fwd_dst_ports[qconf->rx_port_list[i]];
buffer = tx_buffer[portid];
sent = rte_eth_tx_buffer_flush(portid, 0, buffer);
if (sent)
port_statistics[portid].tx += sent;
}
/* if timer is enabled */
if (timer_period > 0) {
/* advance the timer */
timer_tsc += diff_tsc;
/* if timer has reached its timeout */
if (unlikely(timer_tsc >= timer_period)) {
/* do this only on master core */
if (lcore_id == rte_get_master_lcore()) {
print_stats();
/* reset the timer */
timer_tsc = 0;
}
}
}
prev_tsc = cur_tsc;
}
/*
* read packet from rx queues
*/
for (i = 0; i n_rx_port; i++) {
portid = qconf->rx_port_list[i];
nb_rx = rte_eth_rx_burst(portid, 0,
pkts_burst, max_pkt_burst);
port_statistics[portid].rx += nb_rx;
for (j = 0; j device.devargs))
return 1;
return rte_eth_dev_pci_generic_probe(pci_dev, sizeof(struct virtio_hw),eth_virtio_dev_init);
}
/* reset device and renegotiate features if needed */
static int
virtio_init_device(struct rte_eth_dev *eth_dev, uint64_t req_features)
{
struct virtio_hw *hw = eth_dev->data->dev_private;
struct virtio_net_config *config;
struct virtio_net_config local_config;
struct rte_pci_device *pci_dev = null;
int ret;
/* reset the device although not necessary at startup */
vtpci_reset(hw);
if (hw->vqs) {
virtio_dev_free_mbufs(eth_dev);
virtio_free_queues(hw);
}
/* tell the host we've noticed this device. */
vtpci_set_status(hw, virtio_config_status_ack);
/* tell the host we've known how to drive the device. */
vtpci_set_status(hw, virtio_config_status_driver);
if (virtio_negotiate_features(hw, req_features) virtio_user_dev) {
pci_dev = rte_eth_dev_to_pci(eth_dev);
rte_eth_copy_pci_info(eth_dev, pci_dev);
}
/* if host does not support both status and msi-x then disable lsc */
if (vtpci_with_feature(hw, virtio_net_f_status) &&
hw->use_msix != virtio_msix_none)
eth_dev->data->dev_flags |= rte_eth_dev_intr_lsc;
else
eth_dev->data->dev_flags &= ~rte_eth_dev_intr_lsc;
/* setting up rx_header size for the device */
if (vtpci_with_feature(hw, virtio_net_f_mrg_rxbuf) ||
vtpci_with_feature(hw, virtio_f_version_1))
hw->vtnet_hdr_size = sizeof(struct virtio_net_hdr_mrg_rxbuf);
else
hw->vtnet_hdr_size = sizeof(struct virtio_net_hdr);
/* copy the permanent mac address to: virtio_hw */
virtio_get_hwaddr(hw);
ether_addr_copy((struct ether_addr *) hw->mac_addr,
ð_dev->data->mac_addrs[0]);
pmd_init_log(debug,
port mac: %02x:%02x:%02x:%02x:%02x:%02x,
hw->mac_addr[0], hw->mac_addr[1], hw->mac_addr[2],
hw->mac_addr[3], hw->mac_addr[4], hw->mac_addr[5]);
if (vtpci_with_feature(hw, virtio_net_f_ctrl_vq)) {
config = &local_config;
vtpci_read_dev_config(hw,
offsetof(struct virtio_net_config, mac),
&config->mac, sizeof(config->mac));
if (vtpci_with_feature(hw, virtio_net_f_status)) {
vtpci_read_dev_config(hw,
offsetof(struct virtio_net_config, status),
&config->status, sizeof(config->status));
} else {
pmd_init_log(debug,
virtio_net_f_status is not supported);
config->status = 0;
}
if (vtpci_with_feature(hw, virtio_net_f_mq)) {
vtpci_read_dev_config(hw,
offsetof(struct virtio_net_config, max_virtqueue_pairs),
&config->max_virtqueue_pairs,
sizeof(config->max_virtqueue_pairs));
} else {
pmd_init_log(debug,
virtio_net_f_mq is not supported);
config->max_virtqueue_pairs = 1;
}
hw->max_queue_pairs = config->max_virtqueue_pairs;
if (vtpci_with_feature(hw, virtio_net_f_mtu)) {
vtpci_read_dev_config(hw,
offsetof(struct virtio_net_config, mtu),
&config->mtu,
sizeof(config->mtu));
/*
* mtu value has already been checked at negotiation
* time, but check again in case it has changed since
* then, which should not happen.
*/
if (config->mtu mtu);
return -1;
}
hw->max_mtu = config->mtu;
/* set initial mtu to maximum one supported by vhost */
eth_dev->data->mtu = config->mtu;
} else {
hw->max_mtu = virtio_max_rx_pktlen - ether_hdr_len -
vlan_tag_len - hw->vtnet_hdr_size;
}
pmd_init_log(debug, config->max_virtqueue_pairs=%d,
config->max_virtqueue_pairs);
pmd_init_log(debug, config->status=%d, config->status);
pmd_init_log(debug,
port mac: %02x:%02x:%02x:%02x:%02x:%02x,
config->mac[0], config->mac[1],
config->mac[2], config->mac[3],
config->mac[4], config->mac[5]);
} else {
pmd_init_log(debug, config->max_virtqueue_pairs=1);
hw->max_queue_pairs = 1;
hw->max_mtu = virtio_max_rx_pktlen - ether_hdr_len -
vlan_tag_len - hw->vtnet_hdr_size;
}
ret = virtio_alloc_queues(eth_dev);
if (ret data->dev_conf.intr_conf.rxq) {
if (virtio_configure_intr(eth_dev) data->port_id, pci_dev->id.vendor_id,
pci_dev->id.device_id);
return 0;
}
在ovs-dpdk中添加一个vhost-user网络接口时:ovs-dpdk会调用rte_vhost_driver_register函数,首先根据传入参数path文件路径创建unix domain socket,用于后续virtio-net前端(qemu虚拟机中的dpdk应用程序)和virtio-net后端(ovs-dpdk应用程序)的通信。
这部分相关的源码在dpdk-stable-18.11.11/lib/librte_vhost/socket.c的第824行,主要的函数是rte_vhost_driver_register。
/*
* register a new vhost-user socket; here we could act as server
* (the default case), or client (when rte_vhost_user_client) flag
* is set.
*/
int
rte_vhost_driver_register(const char *path, uint64_t flags)
{
int ret = -1;
struct vhost_user_socket *vsocket;
if (!path)
return -1;
pthread_mutex_lock(&vhost_user.mutex);
if (vhost_user.vsocket_cnt == max_vhost_socket) {
rte_log(err, vhost_config,
error: the number of vhost sockets reaches maximum );
goto out;
}
vsocket = malloc(sizeof(struct vhost_user_socket));
if (!vsocket)
goto out;
memset(vsocket, 0, sizeof(struct vhost_user_socket));
vsocket->path = strdup(path);
if (vsocket->path == null) {
rte_log(err, vhost_config,
error: failed to copy socket path string );
vhost_user_socket_mem_free(vsocket);
goto out;
}
tailq_init(&vsocket->conn_list);
ret = pthread_mutex_init(&vsocket->conn_mutex, null);
if (ret) {
rte_log(err, vhost_config,
error: failed to init connection mutex );
goto out_free;
}
vsocket->vdpa_dev_id = -1;
vsocket->dequeue_zero_copy = flags & rte_vhost_user_dequeue_zero_copy;
if (vsocket->dequeue_zero_copy &&
(flags & rte_vhost_user_iommu_support)) {
rte_log(err, vhost_config,
error: enabling dequeue zero copy and iommu features
simultaneously is not supported );
goto out_mutex;
}
/*
* set the supported features correctly for the builtin vhost-user
* net driver.
*
* applications know nothing about features the builtin virtio net
* driver (virtio_net.c) supports, thus it's not possible for them
* to invoke rte_vhost_driver_set_features(). to workaround it, here
* we set it unconditionally. if the application want to implement
* another vhost-user driver (say scsi), it should call the
* rte_vhost_driver_set_features(), which will overwrite following
* two values.
*/
vsocket->use_builtin_virtio_net = true;
vsocket->supported_features = virtio_net_supported_features;
vsocket->features = virtio_net_supported_features;
vsocket->protocol_features = vhost_user_protocol_features;
/*
* dequeue zero copy can't assure descriptors returned in order.
* also, it requires that the guest memory is populated, which is
* not compatible with postcopy.
*/
if (vsocket->dequeue_zero_copy) {
if ((flags & rte_vhost_user_client) != 0)
rte_log(warning, vhost_config,
zero copy may be incompatible with vhost client mode );
vsocket->supported_features &= ~(1ull
~(1ull ~(1ull reconnect && reconn_tid == 0) {
if (vhost_user_reconnect_init() != 0)
goto out_mutex;
}
} else {
vsocket->is_server = true;
}
ret = create_unix_socket(vsocket);
if (ret vsocket;
int ret;
ret = vhost_user_msg_handler(conn->vid, connfd);
if (ret vid);
close(connfd);
*remove = 1;
if (dev)
vhost_destroy_device_notify(dev);
if (vsocket->notify_ops->destroy_connection)
vsocket->notify_ops->destroy_connection(conn->vid);
vhost_destroy_device(conn->vid);
if (vsocket->reconnect) {
create_unix_socket(vsocket);
vhost_user_start_client(vsocket);
}
pthread_mutex_lock(&vsocket->conn_mutex);
tailq_remove(&vsocket->conn_list, conn, next);
pthread_mutex_unlock(&vsocket->conn_mutex);
free(conn);
}
}
收发数据包的函数分别是eth_vhost_rx和eth_vhost_tx。代码的具体实现在dpdk-stable-18.11.11/drivers/net/vhost/rte_eth_vhost.c中。
函数eth_dev_vhost_create用来创建vhost-user的后端设备的时候,会注册发送和接收数据的回调函数为eth_vhost_rx和eth_vhost_tx。
在dpdk中,发送和接收数据包的代码都是在dpdk中实现的,代码的可读性比较好。只是特别要注意的是同样的dpdk代码,运行的位置不一样:前端是在qemu虚拟机中运行的dpdk应用程序,后端运行在ovs-dpdk中。
static int
eth_dev_vhost_create(struct rte_vdev_device *dev, char *iface_name,
int16_t queues, const unsigned int numa_node, uint64_t flags)
{
const char *name = rte_vdev_device_name(dev);
struct rte_eth_dev_data *data;
struct pmd_internal *internal = null;
struct rte_eth_dev *eth_dev = null;
struct ether_addr *eth_addr = null;
struct rte_vhost_vring_state *vring_state = null;
struct internal_list *list = null;
vhost_log(info, creating vhost-user backend on numa socket %u ,
numa_node);
list = rte_zmalloc_socket(name, sizeof(*list), 0, numa_node);
if (list == null)
goto error;
/* reserve an ethdev entry */
eth_dev = rte_eth_vdev_allocate(dev, sizeof(*internal));
if (eth_dev == null)
goto error;
data = eth_dev->data;
eth_addr = rte_zmalloc_socket(name, sizeof(*eth_addr), 0, numa_node);
if (eth_addr == null)
goto error;
data->mac_addrs = eth_addr;
*eth_addr = base_eth_addr;
eth_addr->addr_bytes[5] = eth_dev->data->port_id;
vring_state = rte_zmalloc_socket(name,
sizeof(*vring_state), 0, numa_node);
if (vring_state == null)
goto error;
/* now put it all together
* - store queue data in internal,
* - point eth_dev_data to internals
* - and point eth_dev structure to new eth_dev_data structure
*/
internal = eth_dev->data->dev_private;
internal->dev_name = strdup(name);
if (internal->dev_name == null)
goto error;
internal->iface_name = rte_malloc_socket(name, strlen(iface_name) + 1,
0, numa_node);
if (internal->iface_name == null)
goto error;
strcpy(internal->iface_name, iface_name);
list->eth_dev = eth_dev;
pthread_mutex_lock(&internal_list_lock);
tailq_insert_tail(&internal_list, list, next);
pthread_mutex_unlock(&internal_list_lock);
rte_spinlock_init(&vring_state->lock);
vring_states[eth_dev->data->port_id] = vring_state;
data->nb_rx_queues = queues;
data->nb_tx_queues = queues;
internal->max_queues = queues;
internal->vid = -1;
data->dev_link = pmd_link;
data->dev_flags = rte_eth_dev_intr_lsc;
eth_dev->dev_ops = &ops;
/* finally assign rx and tx ops */
eth_dev->rx_pkt_burst = eth_vhost_rx;
eth_dev->tx_pkt_burst = eth_vhost_tx;
if (rte_vhost_driver_register(iface_name, flags))
goto error;
if (rte_vhost_driver_callback_register(iface_name, &vhost_ops) dev_name);
}
rte_free(vring_state);
rte_eth_dev_release_port(eth_dev);
rte_free(list);
return -1;
}
02
vdpa
dpdk场景下的vhost-user是virtio-net在用户态的实现。和内核的实现相比,提高了网络数据的处理能力。但毕竟还是纯软件层面的实现,在性能上肯定比不过硬件层面的实现。
为了进一步提升性能,intel还推出了vdpa (vhost data path acceleration) 的硬件解决方案,直接让网卡与虚拟机内的virtio虚拟队列交互,把数据包dma到虚拟机的缓存内,在支持了virtio标准的基础上实现了真正意义上的零拷贝。
在dpdk 18.05之后的版本中,已经开始支持vdpa的特性了。vdpa这部分的实现主要是在网卡中,相当于把virtio后端实现在网卡中了。
所以,我们这里只关注一下virtio前端和vdpa设备之间的关联。这部分实现的相关代码在dpdk-stable-18.11.11/examples/vdpa/main.c,第143行start_vdpa函数中的rte_vhost_driver_attach_vdpa_device函数中实现了vhost-user的vsocket数据结构和vdpa设备id的关联。
static int
start_vdpa(struct vdpa_port *vport)
{
int ret;
char *socket_path = vport->ifname;
int did = vport->did;
if (client_mode)
vport->flags |= rte_vhost_user_client;
if (access(socket_path, f_ok) != -1 && !client_mode) {
rte_log(err, vdpa,
%s exists, please remove it or specify another file and try again. ,
socket_path);
return -1;
}
ret = rte_vhost_driver_register(socket_path, vport->flags);
if (ret != 0)
rte_exit(exit_failure,
register driver failed: %s ,
socket_path);
ret = rte_vhost_driver_callback_register(socket_path,
&vdpa_sample_devops);
if (ret != 0)
rte_exit(exit_failure,
register driver ops failed: %s ,
socket_path);
ret = rte_vhost_driver_attach_vdpa_device(socket_path, did);
if (ret != 0)
rte_exit(exit_failure,
attach vdpa device failed: %s ,
socket_path);
if (rte_vhost_driver_start(socket_path) vdpa_dev_id = did;
pthread_mutex_unlock(&vhost_user.mutex);
return vsocket ? 0 : -1;
}
原文标题:virtio技术的演进和发展 (2/2)
文章出处:【微信公众号:linux阅码场】欢迎添加关注!文章转载请注明出处。
iPhone7买还是不买?莫博士喜忧参半
凌华科技助力打造安全的医疗环境
叫板谷歌眼镜,三星研发智能隐形眼镜
关于射频(RF)印刷电路板(PCB)设计和布局的指导及建议
服务器是由哪些部分组成的呢?
DPDK的提出以及设计思想是什么?
二手苹果7拍天价!全新iPhone8今晚发布,二手iPhone7居然拍出27万,比iPhone8还贵,现在毁约还来得及吗?
电力线及变压器防盗报警系统解决方案
大众进军电动汽车领域将开启全新的篇章
【应用场景】安科瑞多功能物联网表在物联网的应用
智能水表在未来的发展体现在哪些方面
数字经济闪耀湾区之湾区之“芯” 理想汽车2022年2月交付8414辆
安科瑞电力监控系统在虹桥商务区的07-2地块的设计和应用(安科瑞 王琪)
5G与AI助力本季度台积电营收大增
IP STB高阶系统挑战
制动电阻工作原理_制动电阻的作用是什么
工控主板的应用领域有哪些,它的性能特点是什么
SUUNTO颂拓5GPS智能运动手表评测 只能算是物有所值并不能说是性价比极高
如何定义智慧城市技术?
Nexperia | 为什么说沟槽肖特基整流二极管是理想选择?