SC09 – Interesting Tech – Filesystems/Storage

All the usual suspects were visible in Portland this year. Including Panasas, Data Direct, Isilon, Lustre and IBM/GPFS. But we’ve all seen those all before. Two storage related technologies caught my eye at SC09 because I’d never seen them before.

I caught a technical session from a Korean company called Pspace. They developed a parallel filesystem called Infinistor for a couple of big Telco/ISPs in Korea. It’s a pretty straight-forward parallel filesystem with metadata and object data handled by separate servers. Object servers are always at least N+1 so that you can lose a whole object server without losing access to your data.

The neat things about infinistor are that it keeps track of how often data is accessed and it understands the concept that some storage is faster than others. So you could have some smaller servers based on SSD and Infinistor will replicate frequently accessed content to the fast disks. It can even handle multiple speeds of storage within one object server.

As you might expect from a project born in ISP-land it has a lot of support for replication across multiple sites. Since it’s always good to serve your client data from a node close to them on the network. Infinistor can replicate synchronously or asynchronously. With the latter prioritised for frequently accessed content.

File access is POSIX filesystem (It will do NFS or CIFS) or a REST API.

As ever with big conferences not everything you learn comes from the sessions or the Exhibition hall. I got chatting to an engineer from Pittsburgh Supercomputing Centre about the parallel filesystem they wrote called ZEST. The best thing about this filesystem is that you can’t read from it.

So I should back up for a second here and describe the problem ZEST is trying to solve since most of you are probably thinking “what use is a filesystem you can’t read from”. Here in HPC land we have all these big machines with thousands of very fast cores and big, fast interconnects. All this cost money. Unfortunately the more nodes you are running across the more likely you are to hit a problem (e.g dividing the current day in Mayan Long Count by the least significant digit in the firmware revision of your firmware cause your HCA to turn into a pumpkin or one of the million other failure modes that a wearyingly familiar to HPC ops people around the world). When this happens you don’t want to lose all the time you’ve spent up until the fault happened. And Lo unto the world did come Checkpointing.

Which is basically to say a lot of big codes will periodically dump their running state to disk so that in the event of a problem they can pick up from the last checkpoint. Now obviously this can be Terabytes of data and it takes a while to write it to disk. While you are doing that all those shiny, shiny CPUs are sitting idle. This makes the Intel salesman happy, but make your funding agencies cry.

So the approach in ZEST is to remove all the complexity involved in making a filesystem that you can read from in order to allow clients to write as fast as possible. There are a number of design decisions here that are interesting. ZEST storage servers don’t use RAID but assign write queues to each individual disk. All the checksumming and parity calculations are done on the client ( because these are over-endowed HPC nodes we are talking about here). By stripping away all this complexity ZEST aims to give each write client the the full bandwidth of the disk it’s writing to. Because most codes will be doing checkpointing from multiple nodes at once this is going to add up to significant aggregate bandwidth.

As an offline process the files that have been dumped to disk are re-aggreagated and copied onto a Lustre filesystem from where they can be read. So I kind of lied when I said it was read only. More technical detail can be found in the ZEST paper.

SC09 – Interesting Tech – Shared Memory

We are beginning to approach the end of the conference formally known as SuperComputing, so I thought it was about time that I started to write up some of the copious volumes of notes that have begun to clutter up the hard-drive of my netbook.

One of the problems we had when we performed our last procurement was that real shared-memory systems couldn’t be fitted into the budget so we had to make do with a set of 16-core commoditty boxes. We have some codes that could do with scaling-out a little bit further than that.

Which brings me nicely to 3Leaf who are building technology to hook multiple commodity boxes together so that the OS (a normal Linux build plus some proprietary kernel modules) sees them as one machine. All hardware on the individual nodes should be visible to the OS just like it would on a single machine. So you can do weird things like software RAID across all the single SATA disks in a bunch of nodes. 3Leaf caution that it’s possible that there is some funky hardware out there that wouldn’t interact well with their setup but they haven’t met it yet. The interconnect is InfiniBand DDR. While it’s not stated up-front by 3Leaf conversations with them indicate that the ASIC is implementing some kind of vitualisation layer which makes it sound sort of like ScaleMP in hardware.

A stack of 3Leaf nodes is essentially a set of AMD boxes with the 3Leaf ASIC sitting in an extra AMD CPU socket. The on-board IB is then used to carry communications traffic between the separate nodes. The manager node (a separate lower spec box) controls the booting and partitioning of the nodes such that a stack can be brought up as one big box or several smaller units.

My favourite thing about the 3Leaf solution is that you can add extra IB cards which will behave normally for the OS. This means you interface the stack to things like Lustre or NFS/RDMA over IB which many HPC facilities will already have in operation.
While currently AMD only 3Leaf claim they will have a product ready for the release of the next version of Intel’s QPI.

And in case you think this might be vapourware apparently Florida State have just bought one.

On a more traditional note SGI announced the availability their new UV shared-memory machine. Essentially an ALTIX 4700 with uprated numalink and x86_64 chips rather than Itanium. The SGI folks swear that there is no proprietary code necessary to make these machines work and that all the kernel support is in mainline. If so that is a very positive step for SGI to take. Hardware MPI acceleration is supported by the SGI MPI hardware stack. It wasn’t clear to me whether SGI are expecting other MPIs to be able to take advantage of this capability. Depending on the price-point UV might be a very interesting machine.

Speaking of all things NUMA I had an interesting chat to the chaps at Numacale. It turns out they are a spin-off from Dolphin. They are making an interconnect card on HTX that will do ccNUMA on commodity AMD kit. The ccNUMA engine is a direct descendant of the one in the Dolphin SCI system (I should note that we still have a Dolphin cluster in operation back home). Like SCI this interconnect is wired together in a loop/torus/3d torus topology without a switch.

Numascale have evaluation kit built on FPGAs at the moment and expect tot tape-out the real ASICs early next year. Like 3Leaf they claim to be working on version for the next version of Intel’s QPI.

And now we move from shared-memory to memory-sharing. Portland’s own RNA Networks have a software technology for sharing memory over IB. You can take chunks of memory on several nodes and hook them together as a block device to use as fast cache. If you stack mount this over another networked-filesystem it acts as an extra layer of caching. So access will go to the local page cache then over IB to the RNA cache and finally over the network the original filesystem. I can see a number of use cases where this could be used to add parallel scaling to a single network filesystem. Although at roughly $1000 per node plus IB I’m not sure it works out cheaper than some of the cheaper ethernet based clustered storage systems.

You can also use this memory-based block device to run a local paralell filesystem if you want although I can’t quite see the use case.

One thing I forgot to ask was whether the cache can be used a straight physical RAM for those really naive codes that just run a whole bunch of data into memory and could do with access to extra space.