118 lines
6.6 KiB
Markdown
118 lines
6.6 KiB
Markdown
|
- Author: Xuanwo <github@xuanwo.io>
|
||
|
- Start Date: 2021-07-13
|
||
|
- RFC PR: [beyondstorage/beyond-fs#5](https://github.com/beyondstorage/beyond-fs/issues/5)
|
||
|
- Tracking Issue: [beyondstorage/beyond-fs#6](https://github.com/beyondstorage/beyond-fs/issues/6)
|
||
|
|
||
|
# RFC-5: BeyondFS Design
|
||
|
|
||
|
## Background
|
||
|
|
||
|
[FUSE](https://en.wikipedia.org/wiki/Filesystem_in_Userspace), a.k.a., Filesystem in Userspace has a wide range of applications in different scenarios, from lightweight data viewing to big data analytics applications for massive amounts of data. FUSE is a bridge that allows users to mount a storage service as local file systems so that the application doesn't need to refactor code.
|
||
|
|
||
|
Many filesystems implemented in FUSE already:
|
||
|
|
||
|
- [s3fs-fuse](https://github.com/s3fs-fuse/s3fs-fuse): allow mount an S3 bucket.
|
||
|
- [goofys](https://github.com/kahing/goofys): a high-performance, POSIX-ish Amazon S3 file system.
|
||
|
- [gcsfuse](https://github.com/GoogleCloudPlatform/gcsfuse/): A user-space file system for interacting with Google Cloud Storage
|
||
|
- [juicefs](https://github.com/juicedata/juicefs): a distributed POSIX file system built on top of Redis and S3.
|
||
|
- [glusterfs](https://glusterdocs-beta.readthedocs.io/en/latest/overview-concepts/fuse.html): GlusterFS is a userspace filesystem. This was a decision made by the GlusterFS developers initially as getting the modules into the Linux kernel is a very long and difficult process.
|
||
|
- [ntfs-3g](https://github.com/tuxera/ntfs-3g): NTFS-3G provide NTFS support via FUSE
|
||
|
- [Lustre](https://git.whamcloud.com/fs/lustre-release.git): a type of parallel distributed file system, generally used for large-scale cluster computing
|
||
|
- [moosefs](https://github.com/moosefs/moosefs): Open Source, Petabyte, Fault-Tolerant, Highly Performing, Scalable Network Distributed File System (Software-Defined Storage)
|
||
|
- [sshfs](https://github.com/libfuse/sshfs): A network filesystem client to connect to SSH servers
|
||
|
- [hdfs](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html): The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.
|
||
|
- ...
|
||
|
|
||
|
Those file systems are designed for different purposes. Today we will focus on those FUSE services that mount remote storage services locally.
|
||
|
|
||
|
To design a FUSE filesystem, there are the following decisions to be made:
|
||
|
|
||
|
### Metadata maintenance
|
||
|
|
||
|
There are mainly two kinds of design:
|
||
|
|
||
|
**No metadata maintenance**
|
||
|
|
||
|
Only cache metadata that is used locally, all data will be stored at underlying storage services.
|
||
|
|
||
|
This design is adopted by s3fs and goofys.
|
||
|
|
||
|
Under this design, the user cannot do the following things:
|
||
|
|
||
|
- Write the same path from different nodes.
|
||
|
- Read data that has been written by another node.
|
||
|
|
||
|
**Standalone metadata**
|
||
|
|
||
|
Maintain metadata in separate services, only use underlying storage services as data storage.
|
||
|
|
||
|
This design is adopted by juicefs.
|
||
|
|
||
|
Under this design, all file metadata are stored in separate metadata services which makes it far more quickly than store metadata in an underlying storage service either.
|
||
|
|
||
|
But if the metadata service is down or broken, the user could be failed to read data even data loss.
|
||
|
|
||
|
### POSIX Compatibility
|
||
|
|
||
|
POSIX(the Portable Operating System Interface) is a family of standards specified by the IEEE Computer Society for maintaining compatibility between operating systems. To the FUSE filesystem, we only care about the API about files and dirs. More specifically, the FUSE API that defined in [libfuse](https://github.com/libfuse/libfuse) / [macFUSE](https://osxfuse.github.io/). Most FUSEs implement only part of the interface. For example, HDFS only supports append write, s3fs doesn't support atomic renames of files or directories.
|
||
|
|
||
|
### Underlying Storage Service
|
||
|
|
||
|
Most FUSEs choose to use existing underlying storage services instead of writing their own. [ByondStorage](https://beyondstorage.io/) community has built a vendor-neutral storage library [go-storage](https://github.com/beyondstorage/go-storage) which allows operating data upon various storage services from s3, gcs, oss to ftp, google drive, ipfs.
|
||
|
|
||
|
## Proposal
|
||
|
|
||
|
I propose to build a POSIX-ish file system based on [go-storage](https://github.com/beyondstorage/go-storage) which is called BeyondFS.
|
||
|
|
||
|
**BeyondFS only cache metadata**
|
||
|
|
||
|
BeyondFS will not maintain metadata and all of them will be persisted on underlying storage services eventually.
|
||
|
|
||
|
**BeyondFS is sharable**
|
||
|
|
||
|
BeyondFS only caches metadata, but the cache could be stored in a distributed key/value store which can be shared between different nodes.
|
||
|
|
||
|
**BeyondFS is POSIX-ish**
|
||
|
|
||
|
BeyondFS will try its best to satisfy POSIX requirements but won't commit to being fully POSIX-compatible. That means BeyondFS will give up some hard to implement features and won't implement some features depends on underlying storage services ability.
|
||
|
|
||
|
BeyondFS is designed for these scenarios:
|
||
|
|
||
|
- Write Once Read Many: data only be written once and read by other clients many times.
|
||
|
- Rarely Random Write: data will be written at once instead of random write.
|
||
|
- Small Amount List: services have their index and don't depend on `readdir` to fetch file lists.
|
||
|
|
||
|
## Rationale
|
||
|
|
||
|
### Why not maintain metadata?
|
||
|
|
||
|
Maintain metadata requires an extra metadata service. The service could be down or broken.
|
||
|
|
||
|
And this enforces the user to read/write data from this service. Data that stores in underlying storage services are not readable for users anymore.
|
||
|
|
||
|
[ByondStorage](https://beyondstorage.io/) focuses on providing cross-cloud data services, intends to build a world that data flows freely. Maintaining metadata in extra services is another form of vendor lock-in, so it doesn't fit our route.
|
||
|
|
||
|
### Why not focus on a standalone machine?
|
||
|
|
||
|
Makes BeyondFS sharable doesn't mean we will sacrifice users that only have a single node. Distributed caching is a natural step forward for BeyondFS.
|
||
|
|
||
|
Users who don't care about distributed cache can use local memory mode without extra cost. For example, users use BeyondFS in thousands of nodes that all of them only write their unique path and don't read other nodes' data.
|
||
|
|
||
|
### Why not fully POSIX-compatible?
|
||
|
|
||
|
It's hard and impossible in our current design.
|
||
|
|
||
|
For example, without separate metadata services and data slicing, it's impossible to implement random write upon s3.
|
||
|
|
||
|
We choose to forgo meeting these limitations in exchange for higher throughput and concurrency performance.
|
||
|
|
||
|
## Compatibility
|
||
|
|
||
|
Say `Hello, World!` instead.
|
||
|
|
||
|
## Implementation
|
||
|
|
||
|
Firstly, we will implement a POSIX-ish file system that only caches metadata locally.
|
||
|
Then, we will focus on performance improvement including prefetch and cache logic.
|
||
|
Finally, we will extend our metadata cache logic to other key/value systems.
|