From 236c009d20125a52012a751f7acbfc0e4078a8c3 Mon Sep 17 00:00:00 2001 From: Xuanwo Date: Wed, 14 Jul 2021 15:51:36 +0800 Subject: [PATCH] RFC-5: BeyondFS Design (#5) * Add proposal beyond fs design Signed-off-by: Xuanwo * Assign number Signed-off-by: Xuanwo * Rename Signed-off-by: Xuanwo * Update rfc Signed-off-by: Xuanwo * Update implement Signed-off-by: Xuanwo * Update retional Signed-off-by: Xuanwo * Update tracking issue Signed-off-by: Xuanwo * code format Signed-off-by: Xuanwo --- Makefile | 9 +-- cmd/{aofs => beyondfs}/main.go | 2 +- docs/rfcs/0-example.md | 41 +++++++++++ docs/rfcs/5-beyond-fs-design.md | 117 ++++++++++++++++++++++++++++++++ go.mod | 4 +- 5 files changed, 163 insertions(+), 10 deletions(-) rename cmd/{aofs => beyondfs}/main.go (96%) create mode 100644 docs/rfcs/0-example.md create mode 100644 docs/rfcs/5-beyond-fs-design.md diff --git a/Makefile b/Makefile index e33c54c..524e87f 100644 --- a/Makefile +++ b/Makefile @@ -20,9 +20,9 @@ vet: @go vet ./... @echo "ok" -build: tidy check +build: tidy format check @echo "build storage" - @go build -o bin/aofs ./cmd/aofs + @go build -o bin/beyondfs ./cmd/beyondfs @echo "ok" test: @@ -34,8 +34,3 @@ test: tidy: @go mod tidy @go mod verify - -clean: - @echo "clean generated files" - @find . -type f -name 'generated.go' -delete - @echo "Done" \ No newline at end of file diff --git a/cmd/aofs/main.go b/cmd/beyondfs/main.go similarity index 96% rename from cmd/aofs/main.go rename to cmd/beyondfs/main.go index bc134f5..cfd7a0a 100644 --- a/cmd/aofs/main.go +++ b/cmd/beyondfs/main.go @@ -2,4 +2,4 @@ package main func main() { print("Hello, world!") -} \ No newline at end of file +} diff --git a/docs/rfcs/0-example.md b/docs/rfcs/0-example.md new file mode 100644 index 0000000..ff9b292 --- /dev/null +++ b/docs/rfcs/0-example.md @@ -0,0 +1,41 @@ +- Author: (fill me in with `name `, e.g., Xuanwo ) +- Start Date: (fill me in with today's date, YYYY-MM-DD) +- RFC PR: [beyondstorage/beyond-fs#0](https://github.com/beyondstorage/beyond-fs/issues/0) +- Tracking Issue: [beyondstorage/beyond-fs#0](https://github.com/beyondstorage/beyond-fs/issues/0) + +# RFC-0: + +- Updates: (delete this part if not applicable) + - [GSP-20](./20-abc): Deletes something +- Updated By: (delete this part if not applicable) + - [GSP-10](./10-do-be-do-be-do): Adds something + - [GSP-1000](./1000-lalala): Deprecates this RFC + +## Background + +Explain why we are doing this. + +Related issues and early discussions can be linked, but the RFC should try to be self-contained if possible. + +## Proposal + + + +## Rationale + + + +Possible content: + +- Design Principles +- Drawbacks +- Alternative implementations and comparison +- Possible Q&As + +## Compatibility + + + +## Implementation + +Explain what steps should be done to implement this proposal. diff --git a/docs/rfcs/5-beyond-fs-design.md b/docs/rfcs/5-beyond-fs-design.md new file mode 100644 index 0000000..916d0ff --- /dev/null +++ b/docs/rfcs/5-beyond-fs-design.md @@ -0,0 +1,117 @@ +- Author: Xuanwo +- Start Date: 2021-07-13 +- RFC PR: [beyondstorage/beyond-fs#5](https://github.com/beyondstorage/beyond-fs/issues/5) +- Tracking Issue: [beyondstorage/beyond-fs#6](https://github.com/beyondstorage/beyond-fs/issues/6) + +# RFC-5: BeyondFS Design + +## Background + +[FUSE](https://en.wikipedia.org/wiki/Filesystem_in_Userspace), a.k.a., Filesystem in Userspace has a wide range of applications in different scenarios, from lightweight data viewing to big data analytics applications for massive amounts of data. FUSE is a bridge that allows users to mount a storage service as local file systems so that the application doesn't need to refactor code. + +Many filesystems implemented in FUSE already: + +- [s3fs-fuse](https://github.com/s3fs-fuse/s3fs-fuse): allow mount an S3 bucket. +- [goofys](https://github.com/kahing/goofys): a high-performance, POSIX-ish Amazon S3 file system. +- [gcsfuse](https://github.com/GoogleCloudPlatform/gcsfuse/): A user-space file system for interacting with Google Cloud Storage +- [juicefs](https://github.com/juicedata/juicefs): a distributed POSIX file system built on top of Redis and S3. +- [glusterfs](https://glusterdocs-beta.readthedocs.io/en/latest/overview-concepts/fuse.html): GlusterFS is a userspace filesystem. This was a decision made by the GlusterFS developers initially as getting the modules into the Linux kernel is a very long and difficult process. +- [ntfs-3g](https://github.com/tuxera/ntfs-3g): NTFS-3G provide NTFS support via FUSE +- [Lustre](https://git.whamcloud.com/fs/lustre-release.git): a type of parallel distributed file system, generally used for large-scale cluster computing +- [moosefs](https://github.com/moosefs/moosefs): Open Source, Petabyte, Fault-Tolerant, Highly Performing, Scalable Network Distributed File System (Software-Defined Storage) +- [sshfs](https://github.com/libfuse/sshfs): A network filesystem client to connect to SSH servers +- [hdfs](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html): The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. +- ... + +Those file systems are designed for different purposes. Today we will focus on those FUSE services that mount remote storage services locally. + +To design a FUSE filesystem, there are the following decisions to be made: + +### Metadata maintenance + +There are mainly two kinds of design: + +**No metadata maintenance** + +Only cache metadata that is used locally, all data will be stored at underlying storage services. + +This design is adopted by s3fs and goofys. + +Under this design, the user cannot do the following things: + +- Write the same path from different nodes. +- Read data that has been written by another node. + +**Standalone metadata** + +Maintain metadata in separate services, only use underlying storage services as data storage. + +This design is adopted by juicefs. + +Under this design, all file metadata are stored in separate metadata services which makes it far more quickly than store metadata in an underlying storage service either. + +But if the metadata service is down or broken, the user could be failed to read data even data loss. + +### POSIX Compatibility + +POSIX(the Portable Operating System Interface) is a family of standards specified by the IEEE Computer Society for maintaining compatibility between operating systems. To the FUSE filesystem, we only care about the API about files and dirs. More specifically, the FUSE API that defined in [libfuse](https://github.com/libfuse/libfuse) / [macFUSE](https://osxfuse.github.io/). Most FUSEs implement only part of the interface. For example, HDFS only supports append write, s3fs doesn't support atomic renames of files or directories. + +### Underlying Storage Service + +Most FUSEs choose to use existing underlying storage services instead of writing their own. [ByondStorage](https://beyondstorage.io/) community has built a vendor-neutral storage library [go-storage](https://github.com/beyondstorage/go-storage) which allows operating data upon various storage services from s3, gcs, oss to ftp, google drive, ipfs. + +## Proposal + +I propose to build a POSIX-ish file system based on [go-storage](https://github.com/beyondstorage/go-storage) which is called BeyondFS. + +**BeyondFS only cache metadata** + +BeyondFS will not maintain metadata and all of them will be persisted on underlying storage services eventually. + +**BeyondFS is sharable** + +BeyondFS only caches metadata, but the cache could be stored in a distributed key/value store which can be shared between different nodes. + +**BeyondFS is POSIX-ish** + +BeyondFS will try its best to satisfy POSIX requirements but won't commit to being fully POSIX-compatible. That means BeyondFS will give up some hard to implement features and won't implement some features depends on underlying storage services ability. + +BeyondFS is designed for these scenarios: + +- Write Once Read Many: data only be written once and read by other clients many times. +- Rarely Random Write: data will be written at once instead of random write. +- Small Amount List: services have their index and don't depend on `readdir` to fetch file lists. + +## Rationale + +### Why not maintain metadata? + +Maintain metadata requires an extra metadata service. The service could be down or broken. + +And this enforces the user to read/write data from this service. Data that stores in underlying storage services are not readable for users anymore. + +[ByondStorage](https://beyondstorage.io/) focuses on providing cross-cloud data services, intends to build a world that data flows freely. Maintaining metadata in extra services is another form of vendor lock-in, so it doesn't fit our route. + +### Why not focus on a standalone machine? + +Makes BeyondFS sharable doesn't mean we will sacrifice users that only have a single node. Distributed caching is a natural step forward for BeyondFS. + +Users who don't care about distributed cache can use local memory mode without extra cost. For example, users use BeyondFS in thousands of nodes that all of them only write their unique path and don't read other nodes' data. + +### Why not fully POSIX-compatible? + +It's hard and impossible in our current design. + +For example, without separate metadata services and data slicing, it's impossible to implement random write upon s3. + +We choose to forgo meeting these limitations in exchange for higher throughput and concurrency performance. + +## Compatibility + +Say `Hello, World!` instead. + +## Implementation + +Firstly, we will implement a POSIX-ish file system that only caches metadata locally. +Then, we will focus on performance improvement including prefetch and cache logic. +Finally, we will extend our metadata cache logic to other key/value systems. diff --git a/go.mod b/go.mod index c9e61e3..20d2d1f 100644 --- a/go.mod +++ b/go.mod @@ -1,3 +1,3 @@ -module github.com/aos-dev/go-fs +module github.com/beyondstorage/beyond-fs -go 1.16 +go 1.15