NFSv3 Linux RDMA Server Design

Jonathan Bauman (baumanj@umich.edu), CITI

Additions ~~Deletions~~

Goals

Primary

RPC RDMA implementation of NFSv3 server using kDAPL API and Mellanox Infiniband cards
Interoperation with Networks Appliance's RPC RDMA client
Demonstrable performance gains over NFSv3 over TCP operation

Secondary

Ease of extension to support NFS versions 2 and 4
Suitability for inclusion in Linux kernel
Support for loadable module transport implementations
Interoperation with other Infiniband-based RDMA clients

Evaluation Criteria

Correctness: Connectathon basic suite
Performance: Simple raw transfer tests (dd, wc, etc.) with measurements of total, user and system time. Decreases in system time expected with transfers sufficiently large to overcome setup costs.

Dependencies

Working kDAPL implementation provided by Mellanox
Working Inifiniband cards provided by Mellanox
Working Linux NFSv3 RPC RDMA client provided by Network Appliance
Timely RDMA technology support from Mellanox

Platform

Development will be based on a dual-processor Dell PowerEdge 2650 running SuSE Linux Professional version 9.1 and a stock Linux 2.6.9 kernel. This platform is chosen to provide a common reference between CITI, Network Appliance and Mellanox in the hopes that it will minimize difficulties associated with resolving issues with the supporting software and hardware. Since this design is concerned with only the NFS version 3 server Chuck Lever's RPC client transport switch patches and Trond Myklebust and CITI's NFS version 4 patches will not be included.

Design

This section describes the initial design for the modifications to the Linux NFSv3 server necessary to support RDMA transport. Since NFSD and the underlying RPC server code were not initially designed to support non-socket transport types, significant modification will be required to achieve a suitably robust and general solution. As much as possible, modification to NFS and NFSv3-specific code is avoided in favor of modification to the shared RPC subsystem. This approach should serve to ease the development of RDMA-enabled versions of other RPC programs, particularly the other versions of the NFS program. Furthermore, as inclusion of RDMA transport code in the Linux kernel is a goal, appropriate code structure and style will drive design decisions. In particular, modularity will be preferred to quickness of implementation. The goal of this development is a functioning server suitable for eventual inclusion in the mainline kernel rather than a prototype.

There are two basic approaches envisioned for adding RDMA transport support to the Linux NFSv3 server. The first is to add new RDMA-specific data structures and functions without perturbing the socket-oriented code and enabling the RDMA code paths via conditional switching. However, due to the level of integration of socket-specific code in the RPC layer, this would allow virtually no reuse of existing RPC code, and would necessitate significant modifications at the NFS layer. Furthermore, this approach is unlikely to be acceptable to Linux kernel maintainers. The other approach is to add another layer of abstraction in the RPC layer dividing it into a unified state management layer and an abstract transport layer. To an extent, this already exists to allow the RPC socket interface to use both TCP and UDP transports. This design proposes to isolate all socket-specific code and replace it with a generalized interface that can be implemented by an RDMA transport as well as by sockets. In this respect it is similar to the Chuck Lever's RPC client transport switch. Rather than attempting to create a fully abstracted transport interface before beginning development, RPC functionality will be abstracted and RDMA versus socket implementations will be isolated as needed during development. The final goal is to achieve a completely abstract interface devoid of socket-specific code and suitable for implementation of new transport types. However, in order to speed development toward a working implementation, and to minimize the creation of unnecessary abstraction, this incremental approach is preferred.

In order to quantify development progress as well as simplify the design and development tasks, the implementation has been divided into three stages. Though development work will certainly overlap between them, each stage is characterized by the level of RDMA functionality provided.

Listening for and accepting a connection

This stage will involve no transfer of RPC or NFS data. It is simply concerned with configuring the RDMA hardware and software to listen for and accept a connection from an RDMA peer.

The extent of modification to NFS-specific code will be to replace calls to

int svc_makesock(struct svc_serv *serv, int protocol, unsigned short port)

int svc_makexprt(struct svc_serv *serv, int protocol, unsigned short port)

svc_makexprt

svc_maxsock

svc_create_socket

svc.c

svcxprt.c

In order to integrate the RDMA implementation into the existing sockets-based code, significant reorganization of RPC structures is needed. svc_sock will be replaced by an abstract svc_xprt structure. As much of the structure as possible will be retained and a ~~union of structures or a~~ pointer to a structure with transport specific data will be added. Eventually, support for registering new transport implementations that are loaded as kernel modules will be added. Registration of new transport implementations will trigger a call to svc_makexprt for the appropriate protocol. svcsock.h will be reorganized into svcxprt.h, svcsock.h and svcrdma.h. Less invasive modification will also be made to the svc_serv and svc_rqst structures where they reference socket-specific structures. Finally, minor modifications will be required at all points in the RPC code that reference the old svc_sock structure, or socket-specific fields of the other structures.

The new svc_xprt structure will likely need to contain additional function pointers to satisfy the increased control required for RDMA. The exact details of the interface modifications are not specified at this time, and will be subject to the needs of development. However, in all cases, minimizing divergence from the original RPC code is preferred.

Though the resulting RDMA functionality achieved by completion of this stage is relatively modest, it also represents a significant reorganization of much of the underlying RPC code as well as RDMA initialization routines that while simple in function are considerable in volume. As such, though the development time may seem considerable, it represents a major step forward in the implementation of the full RDMA server.

Initial estimate of basic development time: 5 weeks

Processing of inline NFSv3 requests

This stage will call for the creation of RDMA-specific send and receive routines similar to svc_{tcp,udp}_{recvfrom,sendto}. Data for all requests and replies will be sent inline. This is similar to standard TCP operation, but will utilize RDMA Send operations. There is not expected to be any ~~performance gain over TCP with this stage; in fact, a moderate performance degradation may occur. However, it should~~ major performance gain over TCP with this stage since no RDMA chunk operations are being performed; however, a moderate performance degradation may occur. The purpose of this stage is to function as further validation of the success of stage one, and provide a solid framework of flow control on which to base stage three.

The main tasks associated with this stage will be successfully registering the memory buffers used by the RDMA Send operation and ensuring their proper management by both the RDMA hardware and RPC/NFS software layers. Also, new code must be added to process RDMA headers. This does not appear to pose significant difficulty.

This stage appears to be relatively simple to implement, but should yield a functional NFSv3 over RDMA server.

Initial estimate of basic development time: 3 weeks

Full NFSv3 RDMA mode

This stage will enable the use of RDMA Read and Write operation for large data transfers. At this point, socket-specific functionality should be completely abstracted out of the RPC interface, so any interface changes will be solely for the purpose of increasing the level of control for RDMA.

The majority of this stage of development will involve encoding and decoding chunk lists and managing the memory associated with the RDMA Read/Write operations. Two factors will hopefully serve to simplify this implementation:

RPC Page Management

RPC manages request and response data with the use of an xdr_buf structure which contains an initial kvec structure followed by an array of contiguous pages. The initial kvec is used for RPC header data, as well as the data payload for short messages, while the list of pages is used exclusively for large data movement operations such as READ, READDIR and WRITE. George Feinberg proposed taking advantage of this fact in his design for the NFSv4 RDMA client. This allows the server to transparently determine when to utilize write chunks and RDMA Send operations for RPC replies.

Server Memory Registration

Since only the NFS RDMA server performs RDMA Read/Write operations, there is no perceived increased security risk in pre-registering all server memory. This allows for simple utilization of any desired memory region for RDMA operations, eliminating the need to specialize the page buffer allocation schemes used by the RPC layer. However, before inclusion in the Linux kernel, the potential for spoofing of RDMA steering tags and the consequences of this memory registration strategy should be reconsidered.

The primary challenge in implementing RDMA operations is in handling the different types of chunks and chunk lists. As mentioned previously, the xdr_buf structure is designed in such a way that allows the server to separate large data payloads from inline data. The three currently used chunk types will be handled thusly.

Read chunks: In client write operations, the server will perform RDMA Read with client provided read chunks. The transport-specific receive function that is called by svc_recv will be responsible for interpreting the chunk list in the RDMA header and performing RDMA Read operations into the xdr_buf structure's page list. The result will be the same as if the data were received via TCP or UDP: the RPC and upper layer protocols should be unaffected.
Write chunks: In client read operations, the server will perform RDMA Write into client buffers directed by client supplied write chunks. In this case, The transport-specific receive function that is called by svc_recv will be responsible for interpreting the chunk list in the RDMA header and storing the information in the transport-specific data structure attached to the svc_rqst structure. Subsequently, the transport-specific function that is called by svc_send will access the stored write chunk data, and perform RDMA Write operations with the contents of the xdr_buf structure's page list. Again, there should be no affect to the rest of the RPC and upper layer protocols.
Reply chunks: In the case of client requests whose inline replies are too large for RDMA Send operations, requests may be made with accompanying reply chunks, indicating to the server to RDMA Write the entire RPC reply into client buffers. As with write chunks, reply chunks in the RDMA header will be stored in the transport-specific svc_rqst substructure. The transport-specific function that is called by svc_send will check for the presence of reply chunks, and if present will use them to send the contents of the xdr_buf structure to the client via RDMA Write operations, followed by a null RDMA Send operation to indicate completion.

In determining the location in the protocol stack to place the modifications for handling RDMA chunks, minimal collateral code impact and opacity to RPC upper layer protocols were of primary concern. This, along with the insight into the use of the xdr_buf structure, led to managing chunks at the RPC transport level. Investigation of placing control at the XDR layer was also examined, but proved impractical due to differences in the XDR handling of read/write operations by different NFS versions. For example, NFSv3 server performs the read system call during XDR request decoding, whereas NFSv4 server performs the same call during XDR reply encoding. Performing chunk handling at the RPC transport layer should obviate the need to make modifications for different NFS versions (or other RPC programs) as well as achieving optimal performance.

The experience of Network Appliance engineers indicates that this is the most difficult stage of RDMA development. That, coupled with CITI's lack of experience with RDMA results in this stage having the highest development time.

Initial estimate of basic development time: 6 weeks

Initial estimate of total development time: 14 weeks.

Note:

Estimates of development time in this section assume all CITI work will be completed by one developer working alone. Changes in personnel allocation may affect the schedule. Other factors that may affect the schedule include changes to the development platform and any delays incurred due to dependencies described in the previous section. Finally, it should be emphasized that this is the first utilization of Linux as an RDMA server platform. This work is experimental in nature, and unforeseen complications should not be unexpected. Though these estimates attempt to provide time for addressing the known difficulties in implementing an RDMA transport, they should still be treated as rough estimates. Insofar as possible attempts will be made to revise future estimates as work progresses.

Additional Concerns

As of yet, Mellanox has not been able to provide an kDAPL implementation that has been verified operational on CITI hardware. As such, NetApp's NFSv3 RDMA client cannot yet be run. A working version is expected soon, but since this a necessary component to test the RDMA server, all CITI time spent installing and configuring new software from Mellanox and NetApp should be added to the raw server implementation estimate. For scheduling purposes, 1 week will be assumed for now.
Network Appliance has requested that CITI give a presentation at Connectathon 2005 regarding experiences implementing NFSv4.1 sessions work on Linux. Creation of this presentation will require 1 week of work and must be completed to allow feedback from Network Appliance (1 week suggested) and timely submission to Connectathon organizers (no date posted).

Schedule

January								February																												March
24	25	26	27	28	29	30	31	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	1	2	3	4	5	6
Week 1							Week 2							Week 3							Week 4							Week 5							Week 6
Setup test client							Stage 1														Stage 1 (continued)
														Create Presentation							Revise Presentation
													OpenIB Developers Workshop																		Connectathon

March																									April
7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17
Week 7							Week 8							Week 9							Week 10							Week 11							Week 12
Stage 1 (continued)							Stage 2																					Stage 3

April													May
18	19	20	21	22	23	24	25	26	27	28	29	30	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29
Week 13							Week 14							Week 15							Week 16							Week 17							Week 18
Stage 3 (continued)