golang开源分布式文件系统weed-fs golang分布式文件系统weed-fs
程序员文章站
2022-05-21 19:56:50
...
weedfs,http://code.google.com/p/ weed-fs /
go语言,代码很少,3个可执行文件,很强悍
[root@ghost-rider weedfs]# ls
weedclient weedmaster weedvolume
部署测试,
服务器,192.168.2.100
1.启动master服务:
[root@ghost-rider weedfs]# ./weedmaster
2012/07/25 15:10:15 Volume Size Limit is 32768 MB
2012/07/25 15:10:15 Setting file id sequence 10000
2012/07/25 15:10:15 Start directory service at http://127.0.0.1:9333
2012/07/25 15:13:09 Saving file id sequence 20000 to /tmp/directory.seq
2.启动磁盘卷服务,挂载本地的/tmp目录
weedvolume -dir="/tmp" -volumes=0-4 -mserver="localhost:9333" -port=8080 -publicUrl="localhost:8080" &
[root@ghost-rider weedfs]# 2012/07/25 15:11:03 Store started on dir: /tmp with 5 volumes
2012/07/25 15:11:03 store joined at localhost:9333
2012/07/25 15:11:03 Start storage service at http://127.0.0.1:8080 public url localhost:8080
客户端:
1.第一步,获取一个自动分配的id,唯一的文件标示
A:\>curl http://192.168.2.100:9333/dir/assign
{"count":"1","fid":"3,2711f0c5341e","publicUrl":"localhost:8080","url":"127.0.0.
1:8080"}
A:\>ls
12.png 2012-07-25_114343.png despath output_1.jpg temp
2012-07-25_103150.png 2012-07-25_121223.png logo.jpg srcpath
2.将当前目录的12.png上传到服务器
A:\>curl -F file=@12.png http://192.168.2.100:8080/3,2711f0c5341e
{"size":1049185}
3.使用浏览器就可以直接访问刚刚上传的文件了
http://192.168.2.100:8080/3,2711f0c5341e
挂载其他目录,(-volumes=5-7参数用来设置该卷所拥有的卷id范围)
[root@ghost-rider weedfs]# mkdir /var/weedfs
[root@ghost-rider weedfs]# ./weedvolume -dir="/var/weedfs" -volumes=5-7 -mserver="localhost:9333" -port=8081 -publicUrl="localhost:8081" &
[3] 31467
[root@ghost-rider weedfs]# 2012/07/25 15:23:35 Store started on dir: /var/weedfs with 3 volumes
2012/07/25 15:23:35 store joined at localhost:9333
2012/07/25 15:23:35 Start storage service at http://127.0.0.1:8081 public url localhost:8081
查看目录下面都有些什么文件
[root@ghost-rider weedfs]# cd /var/weedfs/
[root@ghost-rider weedfs]# ls
5.dat 5.idx 6.dat 6.idx 7.dat 7.idx
如果将访问的端口改一下,http://192.168.2.100:8081/3,2711f0c5341e
服务端直接报异常,这块处理看来还不是很完善,另外是否存在单点故障呢?复制策略呢?
2012/07/25 15:26:49 http: panic serving 192.168.2.151:10935: runtime error: invalid memory address or nil pointer dereference
/home/chris/apps/go/src/pkg/net/http/server.go:576 (0x44e357)
/home/chris/apps/go/src/pkg/runtime/proc.c:1443 (0x411327)
/home/chris/apps/go/src/pkg/runtime/runtime.c:128 (0x411df3)
/home/chris/apps/go/src/pkg/runtime/thread_linux.c:209 (0x414ce6)
/home/chris/apps/go/src/pkg/sync/atomic/asm_amd64.s:12 (0x4e0e6c)
/home/chris/apps/go/src/pkg/sync/mutex.go:40 (0x48b2d2)
/home/chris/dev/workspace/home/weed-fs/src/pkg/storage/volume.go:87 (0x453cb1)
/home/chris/dev/workspace/home/weed-fs/src/pkg/storage/store.go:101 (0x453386)
/home/chris/dev/workspace/home/weed-fs/src/cmd/weedvolume/weedvolume.go:56 (0x4010f9)
/home/chris/dev/workspace/home/weed-fs/src/cmd/weedvolume/weedvolume.go:39 (0x400e3e)
/home/chris/apps/go/src/pkg/net/http/server.go:690 (0x442303)
/home/chris/apps/go/src/pkg/net/http/server.go:926 (0x443185)
/home/chris/apps/go/src/pkg/net/http/server.go:656 (0x442116)
/home/chris/apps/go/src/pkg/runtime/proc.c:271 (0x40f42d)
原来是参数不对,8081端口对应的应该是卷5-7,而我传的是3,这块异常应该更加明确一点才好
图片的id分为3部分,第一个数字为卷id,非负32位整型,第二个是文件的id,非负64位整型,,第三部分是cookie,长度是32位非负整型,随机生成防止猜测,
所以id的唯一长度为:8+1+16+8=33
由于服务器可能会变动,所以实际存储文件的地址会变,通过接口可以得到最新的卷地址
[root@ghost-rider weedfs]# curl http://localhost:9333/dir/lookup?volumeId=3
{"Url":"127.0.0.1:8080","PublicUrl":"localhost:8080"}
除了通过接口来获取一个唯一id之外,还可以自己指定,id最好不要能够被猜解,
A:\>curl -F file=@12.png http://192.168.2.100:8080/3,123
{"size":1049185}
虽然可以正常访问到文件,但是服务端会有错误,
2012/07/25 15:54:32 Invalid fid 123 length 3
自己生成的id最好满足 规则
A:\>curl -F file=@12.png http://192.168.2.100:8080/3,123412345678
{"size":1049185}
另外,还可以随意增加一个后缀名,方便访问
http://192.168.2.100:8080/3,123412345678.png
ok,测试告一段路,简单介绍下weed-fs
weedfs架构介绍(翻译)
常见的分布式文件系统都会将每个文件拆分放到多个chunk块里面去,然后由一个*服务器来保存这些文件名和块索引之间的映射关系以及这些块所在服务器等元数据信息。
因此这些中心的master服务器无法有效的处理大量小文件的情况,因为所有的请求都需要经过chunk master服务器,在大并发请求的情况下,响应速度势必要下降。
Weed-FS的master server选择管理数据卷(data volumes)而不是数据块,每个数据卷大小是32GB,能够保存大量的文件( 小文件 ),每个存储节点能够拥有很多个数据卷,master节点只需要保存这些卷的元数据就可以了,并且这些数据量很少,并且大部分情况下是很少会变化的。
实际的文件的元数据是保存在每个卷服务器的每个卷里面的,所以每个卷服务只管理自己文件的元数据,并且每个文件的元数据只有16个字节,所有文件的元数据都可以放在内存中,所以实际上每次文件的访问请求都只会执行一次磁盘操作
作为比较,你可以想想linux的xfs文件系统的inode t结构需要占用536个字节。
Master Server和Volume Server
架构简直是巨简单,实际的数据在存储节点的卷服务上,一个卷服务器可以包含多个卷,并且读写同时支持基本的权限验证(basic auth)
所有的卷都由master服务器来管理,master服务器包含了卷id和卷服务器的mapping,这些信息基本不变,可以很好的缓存起来。
每次写文件请求,master服务器会生成一个key,是一个增长的64位的无符号的整数,因为一般来说写请求没有读请求繁忙,一个master服务能够很好的支持大量并发。
读写文件(Write and Read files)
当一个客户端发送一个写文件请求,master返回这样格式的id,
然后client通过REST的方式自己去联系卷节点来上传文件。
当一个客户端使用标识: 来读取文件的时候,它可以通过联系master节点,通过参数 和 ,或者是来自缓存,来获取到实际的文件地址,或者直接返回文件内容数据给客户端。
存储大小(Storage Size)
当前代码实现的是,每个卷大小是 8x232=32G 字节,这是因为weefs是按照8个字节来进行对齐,通过修改2句代码,耗费一些padding空间,我们可以很轻松的扩展到64G或者128G或者更大,最大支持2 32 个卷,所以理论总容量可达:8 x 2 32 x 2 32 = 8 x 4G x 4G = 128GG bytes,每个独立文件的大小不能超过卷的大小。
节省内存(saving memory)
所以卷服务器上的文件元数据信息都来自内存而不不需要磁盘访问,每个文件占用16个字节的map对象(?<64bit key, 32bit offset, 32bit size>),当然这个你不用担心,磁盘肯定要比内存先用完。
与其它文件系统比较
HDFS :
hdfs使用块,适合大文件,weedfs是小文件的理想存储,速度快,支持高并发
MogileFS :
WeedFS 只有2个组件: 目录服务器(directory server),存储节点( storage nodes).
MogileFS 有3个组件: 跟踪(tracers),数据库( database),存储节点( storage nodes).
分层越多,访问越慢,操作更加复杂,故障率越高。
GlusterFS:
weedfs是非POSIX接口兼容的,只是简单的实现,GlusterFS是POSIX兼容的,更复杂一些。
Mongo's GridFS将文件拆分成块,通过中心的mongodb来存储元数据信息,每次读写都需要请求元数据,并发上不去,什么都扯淡。
TODO:
weed-fs将提供fail-over的masterserver节点(累死hadoop的second namenode)
Weed-FS 将支持数据的多份拷贝,目前只有一份,根据需求,多份甚至是根据数据来调整以及优化等
总之,很小很精悍。
附带介绍原文,翻译不到位,自行对照,:)
Weed-FS is a simple and highly scalable distributed file system. There are two objectives:
to store billions of files!
to serve the files fast!
Instead of supporting full POSIX file system semantics, Weed-FS choose to implement only a key~file mapping. Similar to the word "NoSQL", you can call it as "NoFS".
Instead of managing all file metadata in a central master, Weed-FS choose to manages file volumes in the central master, and let volume servers manage files and the metadata. This relieves concurrency pressure from the central master and spreads file metadata into volume servers' memories, allowing faster file access with just one disk read operation!
Weed-FS models after? Facebook's Haystack design paper .
By default, the master node runs on port 9333, and the volume nodes runs on port 8080. Here I will start one master node, and two volume nodes on port 8080 and 8081. Ideally, they should be started from different machines. Here I just use localhost as example.
Weed-FS uses HTTP REST operations to write, read, delete. The return results are JSON or JSONP format.
Start Master Server
> ./weedmaster
Start Volume Servers
> weedvolume -dir="/tmp" -volumes=0-4 -mserver="localhost:9333" -port=8080 -publicUrl="localhost:8080" &
> weedvolume -dir="/tmp/data2" -volumes=5-7 -mserver="localhost:9333" -port=8081 -publicUrl="localhost:8081" &
Here is a simple usage on how to save a file:
> curl http://localhost:9333/dir/assign
{"fid":"3,01637037d6","url":"127.0.0.1:8080","publicUrl":"localhost:8080"}
First, send a HTTP request to get an fid and a volume server url.
> curl -F file=@/home/chris/myphoto.jpg http://127.0.0.1:8080/3,01637037d6
{"size": 43234}
Second, send a HTTP multipart POST request to the volume server url+'/'+fid, to really store the file content.
Now you can save the fid, 3,01637037d6 in this case, to some database field.
The number 3 here, is a volume id. After the comma, it's one file key, 01, and a file cookie, 637037d6.
The volume id is an unsigned 32 bit integer. The file key is an unsigned 64bit integer. The file cookie is an unsigned 32bit integer, used to prevent URL guessing.
The file key and file cookie are both coded in hex. You can store the tuple in your own format, or simply store the fid as string, in theory, you would need 8+1+16+8=33 bytes. A char(33) would be enough, if not more than enough, since most usage would not need 2^32 volumes.
Here is the example on how to render the URL.
> curl http://localhost:9333/dir/lookup?volumeId=3
{"Url":"127.0.0.1:8080","PublicUrl":"localhost:8080"}
First lookup the volume server's URLs by the file's volumeId. However, since usually there are not too many volume servers, and volumes does not move often, you can cache the results most of the time.
Now you can take the public url, render the url or directly read from the volume server via url:
?http://localhost:8080/3,01637037d6.jpg
Notice we add an file extension ".jpg" here. It's optional and just one way for the client to specify the file content type.
Usually distributed file system split each file into chunks, and a central master keeps a mapping of a filename and a chunk index to chunk handles, and also which chunks each chunk server has.
This has the draw back that the central master can not handle many small files efficiently, and since all read requests need to go through the chunk master, responses would be slow for many concurrent web users.
Instead of managing chunks, Weed-FS choose to manage data volumes in the master server. Each data volume is size 32GB, and can hold a lot of files. And each storage node can has many data volumes. So the master node only needs to store the metadata about the volumes, which is fairly small amount of data and pretty stale most of the time.
The actual file metadata is stored in each volume on volume servers. Since each volume server only manage metadata of files on its own disk, and only 16 bytes for each file, all file access can read file metadata just from memory and only needs one disk operation to actually read file data.
For comparison, consider that an xfs inode t structure in Linux is 536 bytes.
Master Server and Volume Server
The architecture is fairly simple. The actual data is stored in volumes on storage nodes. One volume server can have multiple volumes, and can both support read and write access with basic authentication.
All volumes are managed by a master server. The master server contains volume id to volume server mapping. This is fairly static information, and could be cached easily.
On each write request, the master server also generates a file key, which is a growing 64bit unsigned integer. Since the write requests are not as busy as read requests, one master server should be able to handle the concurrency well.
Write and Read files
When a client sends a write request, the master server returns for the file. The client then contact the volume node and POST the file content via REST.
When a client needs to read a file based on , it can ask the master server by the for the , or from cache. Then the client can HTTP GET the content via REST, or just render the URL on web pages and let browsers to fetch the content.
Please see the example for details on write-read process.
In current implementation, each volume can be size of 8x2 32 =32G bytes. This is because of aligning contents to 8 bytes. We can be easily increased to 64G, or 128G, or more, by changing 2 lines of code, at the cost of some wasted padding space due to alignment.
There can be 2 32 volumes. So total system size is 8 x 2 32 x 2 32 = 8 x 4G x 4G = 128GG bytes. (Sorry, I don't know the word for giga of giga bytes.)
Each individual file size is limited to the volume size.
All file meta information on volume server is readable from memory without disk access. Each file just takes an 16-byte map entry of<64bit key, 32bit offset, 32bit size>. Of course, each map entry has its own the space cost for the map. But usually the disk runs out before the memory does.
Compared to Other File Systems
Frankly, I don't use other distributed file systems too often. All seems more complicated than necessary. Please correct me if anything here is wrong.
HDFS uses the chunk approach for each file, and is ideal for streaming large files.
WeedFS is ideal for serving relatively smaller files quickly and concurrently.
Compared to MogileFS
WeedFS has 2 components: directory server, storage nodes.
MogileFS has 3 components: tracers, database, storage nodes.
One more layer means slower access, more operation complexity, more failure possibility.
Compared to GlusterFS
WeedFS is not POSIX compliant, and has simple implementation.
GlusterFS is POSIX compliant, much more complex.
Compared to Mongo's GridFS
Mongo's GridFS splits files into chunks and manage chunks in the central mongodb. For every read or write request, the database needs to query the metadata. It's OK if this is not a bottleneck yet, but for a lot of concurrent reads this unnecessary query could slow things down.
On the contrary, Weed-FS uses large file volume of 32G size to store lots of files, and only manages file volumes in the master server. Each volume manages file metadata themselves. So all the file metadata is spread onto the volume nodes memories, and just one disk read is needed.
Weed-FS will support fail-over master server.
Weed-FS will support multiple copies of the data. Right now, data has just one copy. Depending on demands, multiple copy, and even data-center awareness and optimization will be implemented.
Weed-FS may add more optimization for pictures. For example, automatically resizing pictures when storing them.
WeedFS does not plan to add namespaces.
To use WeedFS, the namespace is supposed to be managed by the clients. Many use cases, like a user's avatar picture, do not really need namespaces. Actually, it takes some effort to create and maintain the file path in order to avoid too many files under a directory.
Advanced users can actually create the namespace layer on top of the Key-file store, just like how the common file system creates the namespace on top of inode for each file.
./weedmaster
./weedvolume -dir="/var/weedfs1" -volumes=0-4 -mserver="localhost:9333" -port=8080 -publicUrl="localhost:8080" &
./weedvolume -dir="/var/weedfs2" -volumes=5-7 -mserver="localhost:9333" -port=8081 -publicUrl="localhost:8081" &
go语言,代码很少,3个可执行文件,很强悍
[root@ghost-rider weedfs]# ls
weedclient weedmaster weedvolume
部署测试,
服务器,192.168.2.100
1.启动master服务:
[root@ghost-rider weedfs]# ./weedmaster
2012/07/25 15:10:15 Volume Size Limit is 32768 MB
2012/07/25 15:10:15 Setting file id sequence 10000
2012/07/25 15:10:15 Start directory service at http://127.0.0.1:9333
2012/07/25 15:13:09 Saving file id sequence 20000 to /tmp/directory.seq
2.启动磁盘卷服务,挂载本地的/tmp目录
weedvolume -dir="/tmp" -volumes=0-4 -mserver="localhost:9333" -port=8080 -publicUrl="localhost:8080" &
[root@ghost-rider weedfs]# 2012/07/25 15:11:03 Store started on dir: /tmp with 5 volumes
2012/07/25 15:11:03 store joined at localhost:9333
2012/07/25 15:11:03 Start storage service at http://127.0.0.1:8080 public url localhost:8080
客户端:
1.第一步,获取一个自动分配的id,唯一的文件标示
A:\>curl http://192.168.2.100:9333/dir/assign
{"count":"1","fid":"3,2711f0c5341e","publicUrl":"localhost:8080","url":"127.0.0.
1:8080"}
A:\>ls
12.png 2012-07-25_114343.png despath output_1.jpg temp
2012-07-25_103150.png 2012-07-25_121223.png logo.jpg srcpath
2.将当前目录的12.png上传到服务器
A:\>curl -F file=@12.png http://192.168.2.100:8080/3,2711f0c5341e
{"size":1049185}
3.使用浏览器就可以直接访问刚刚上传的文件了
http://192.168.2.100:8080/3,2711f0c5341e
挂载其他目录,(-volumes=5-7参数用来设置该卷所拥有的卷id范围)
[root@ghost-rider weedfs]# mkdir /var/weedfs
[root@ghost-rider weedfs]# ./weedvolume -dir="/var/weedfs" -volumes=5-7 -mserver="localhost:9333" -port=8081 -publicUrl="localhost:8081" &
[3] 31467
[root@ghost-rider weedfs]# 2012/07/25 15:23:35 Store started on dir: /var/weedfs with 3 volumes
2012/07/25 15:23:35 store joined at localhost:9333
2012/07/25 15:23:35 Start storage service at http://127.0.0.1:8081 public url localhost:8081
查看目录下面都有些什么文件
[root@ghost-rider weedfs]# cd /var/weedfs/
[root@ghost-rider weedfs]# ls
5.dat 5.idx 6.dat 6.idx 7.dat 7.idx
如果将访问的端口改一下,http://192.168.2.100:8081/3,2711f0c5341e
服务端直接报异常,这块处理看来还不是很完善,另外是否存在单点故障呢?复制策略呢?
2012/07/25 15:26:49 http: panic serving 192.168.2.151:10935: runtime error: invalid memory address or nil pointer dereference
/home/chris/apps/go/src/pkg/net/http/server.go:576 (0x44e357)
/home/chris/apps/go/src/pkg/runtime/proc.c:1443 (0x411327)
/home/chris/apps/go/src/pkg/runtime/runtime.c:128 (0x411df3)
/home/chris/apps/go/src/pkg/runtime/thread_linux.c:209 (0x414ce6)
/home/chris/apps/go/src/pkg/sync/atomic/asm_amd64.s:12 (0x4e0e6c)
/home/chris/apps/go/src/pkg/sync/mutex.go:40 (0x48b2d2)
/home/chris/dev/workspace/home/weed-fs/src/pkg/storage/volume.go:87 (0x453cb1)
/home/chris/dev/workspace/home/weed-fs/src/pkg/storage/store.go:101 (0x453386)
/home/chris/dev/workspace/home/weed-fs/src/cmd/weedvolume/weedvolume.go:56 (0x4010f9)
/home/chris/dev/workspace/home/weed-fs/src/cmd/weedvolume/weedvolume.go:39 (0x400e3e)
/home/chris/apps/go/src/pkg/net/http/server.go:690 (0x442303)
/home/chris/apps/go/src/pkg/net/http/server.go:926 (0x443185)
/home/chris/apps/go/src/pkg/net/http/server.go:656 (0x442116)
/home/chris/apps/go/src/pkg/runtime/proc.c:271 (0x40f42d)
原来是参数不对,8081端口对应的应该是卷5-7,而我传的是3,这块异常应该更加明确一点才好
图片的id分为3部分,第一个数字为卷id,非负32位整型,第二个是文件的id,非负64位整型,,第三部分是cookie,长度是32位非负整型,随机生成防止猜测,
所以id的唯一长度为:8+1+16+8=33
由于服务器可能会变动,所以实际存储文件的地址会变,通过接口可以得到最新的卷地址
[root@ghost-rider weedfs]# curl http://localhost:9333/dir/lookup?volumeId=3
{"Url":"127.0.0.1:8080","PublicUrl":"localhost:8080"}
除了通过接口来获取一个唯一id之外,还可以自己指定,id最好不要能够被猜解,
A:\>curl -F file=@12.png http://192.168.2.100:8080/3,123
{"size":1049185}
虽然可以正常访问到文件,但是服务端会有错误,
2012/07/25 15:54:32 Invalid fid 123 length 3
自己生成的id最好满足 规则
A:\>curl -F file=@12.png http://192.168.2.100:8080/3,123412345678
{"size":1049185}
另外,还可以随意增加一个后缀名,方便访问
http://192.168.2.100:8080/3,123412345678.png
ok,测试告一段路,简单介绍下weed-fs
weedfs架构介绍(翻译)
常见的分布式文件系统都会将每个文件拆分放到多个chunk块里面去,然后由一个*服务器来保存这些文件名和块索引之间的映射关系以及这些块所在服务器等元数据信息。
因此这些中心的master服务器无法有效的处理大量小文件的情况,因为所有的请求都需要经过chunk master服务器,在大并发请求的情况下,响应速度势必要下降。
Weed-FS的master server选择管理数据卷(data volumes)而不是数据块,每个数据卷大小是32GB,能够保存大量的文件( 小文件 ),每个存储节点能够拥有很多个数据卷,master节点只需要保存这些卷的元数据就可以了,并且这些数据量很少,并且大部分情况下是很少会变化的。
实际的文件的元数据是保存在每个卷服务器的每个卷里面的,所以每个卷服务只管理自己文件的元数据,并且每个文件的元数据只有16个字节,所有文件的元数据都可以放在内存中,所以实际上每次文件的访问请求都只会执行一次磁盘操作
作为比较,你可以想想linux的xfs文件系统的inode t结构需要占用536个字节。
Master Server和Volume Server
架构简直是巨简单,实际的数据在存储节点的卷服务上,一个卷服务器可以包含多个卷,并且读写同时支持基本的权限验证(basic auth)
所有的卷都由master服务器来管理,master服务器包含了卷id和卷服务器的mapping,这些信息基本不变,可以很好的缓存起来。
每次写文件请求,master服务器会生成一个key,是一个增长的64位的无符号的整数,因为一般来说写请求没有读请求繁忙,一个master服务能够很好的支持大量并发。
读写文件(Write and Read files)
当一个客户端发送一个写文件请求,master返回这样格式的id,
然后client通过REST的方式自己去联系卷节点来上传文件。
当一个客户端使用标识: 来读取文件的时候,它可以通过联系master节点,通过参数 和 ,或者是来自缓存,来获取到实际的文件地址,或者直接返回文件内容数据给客户端。
存储大小(Storage Size)
当前代码实现的是,每个卷大小是 8x232=32G 字节,这是因为weefs是按照8个字节来进行对齐,通过修改2句代码,耗费一些padding空间,我们可以很轻松的扩展到64G或者128G或者更大,最大支持2 32 个卷,所以理论总容量可达:8 x 2 32 x 2 32 = 8 x 4G x 4G = 128GG bytes,每个独立文件的大小不能超过卷的大小。
节省内存(saving memory)
所以卷服务器上的文件元数据信息都来自内存而不不需要磁盘访问,每个文件占用16个字节的map对象(?<64bit key, 32bit offset, 32bit size>),当然这个你不用担心,磁盘肯定要比内存先用完。
与其它文件系统比较
HDFS :
hdfs使用块,适合大文件,weedfs是小文件的理想存储,速度快,支持高并发
MogileFS :
WeedFS 只有2个组件: 目录服务器(directory server),存储节点( storage nodes).
MogileFS 有3个组件: 跟踪(tracers),数据库( database),存储节点( storage nodes).
分层越多,访问越慢,操作更加复杂,故障率越高。
GlusterFS:
weedfs是非POSIX接口兼容的,只是简单的实现,GlusterFS是POSIX兼容的,更复杂一些。
Mongo's GridFS将文件拆分成块,通过中心的mongodb来存储元数据信息,每次读写都需要请求元数据,并发上不去,什么都扯淡。
TODO:
weed-fs将提供fail-over的masterserver节点(累死hadoop的second namenode)
Weed-FS 将支持数据的多份拷贝,目前只有一份,根据需求,多份甚至是根据数据来调整以及优化等
总之,很小很精悍。
附带介绍原文,翻译不到位,自行对照,:)
Weed-FS is a simple and highly scalable distributed file system. There are two objectives:
to store billions of files!
to serve the files fast!
Instead of supporting full POSIX file system semantics, Weed-FS choose to implement only a key~file mapping. Similar to the word "NoSQL", you can call it as "NoFS".
Instead of managing all file metadata in a central master, Weed-FS choose to manages file volumes in the central master, and let volume servers manage files and the metadata. This relieves concurrency pressure from the central master and spreads file metadata into volume servers' memories, allowing faster file access with just one disk read operation!
Weed-FS models after? Facebook's Haystack design paper .
By default, the master node runs on port 9333, and the volume nodes runs on port 8080. Here I will start one master node, and two volume nodes on port 8080 and 8081. Ideally, they should be started from different machines. Here I just use localhost as example.
Weed-FS uses HTTP REST operations to write, read, delete. The return results are JSON or JSONP format.
Start Master Server
> ./weedmaster
Start Volume Servers
> weedvolume -dir="/tmp" -volumes=0-4 -mserver="localhost:9333" -port=8080 -publicUrl="localhost:8080" &
> weedvolume -dir="/tmp/data2" -volumes=5-7 -mserver="localhost:9333" -port=8081 -publicUrl="localhost:8081" &
Here is a simple usage on how to save a file:
> curl http://localhost:9333/dir/assign
{"fid":"3,01637037d6","url":"127.0.0.1:8080","publicUrl":"localhost:8080"}
First, send a HTTP request to get an fid and a volume server url.
> curl -F file=@/home/chris/myphoto.jpg http://127.0.0.1:8080/3,01637037d6
{"size": 43234}
Second, send a HTTP multipart POST request to the volume server url+'/'+fid, to really store the file content.
Now you can save the fid, 3,01637037d6 in this case, to some database field.
The number 3 here, is a volume id. After the comma, it's one file key, 01, and a file cookie, 637037d6.
The volume id is an unsigned 32 bit integer. The file key is an unsigned 64bit integer. The file cookie is an unsigned 32bit integer, used to prevent URL guessing.
The file key and file cookie are both coded in hex. You can store the tuple in your own format, or simply store the fid as string, in theory, you would need 8+1+16+8=33 bytes. A char(33) would be enough, if not more than enough, since most usage would not need 2^32 volumes.
Here is the example on how to render the URL.
> curl http://localhost:9333/dir/lookup?volumeId=3
{"Url":"127.0.0.1:8080","PublicUrl":"localhost:8080"}
First lookup the volume server's URLs by the file's volumeId. However, since usually there are not too many volume servers, and volumes does not move often, you can cache the results most of the time.
Now you can take the public url, render the url or directly read from the volume server via url:
?http://localhost:8080/3,01637037d6.jpg
Notice we add an file extension ".jpg" here. It's optional and just one way for the client to specify the file content type.
Usually distributed file system split each file into chunks, and a central master keeps a mapping of a filename and a chunk index to chunk handles, and also which chunks each chunk server has.
This has the draw back that the central master can not handle many small files efficiently, and since all read requests need to go through the chunk master, responses would be slow for many concurrent web users.
Instead of managing chunks, Weed-FS choose to manage data volumes in the master server. Each data volume is size 32GB, and can hold a lot of files. And each storage node can has many data volumes. So the master node only needs to store the metadata about the volumes, which is fairly small amount of data and pretty stale most of the time.
The actual file metadata is stored in each volume on volume servers. Since each volume server only manage metadata of files on its own disk, and only 16 bytes for each file, all file access can read file metadata just from memory and only needs one disk operation to actually read file data.
For comparison, consider that an xfs inode t structure in Linux is 536 bytes.
Master Server and Volume Server
The architecture is fairly simple. The actual data is stored in volumes on storage nodes. One volume server can have multiple volumes, and can both support read and write access with basic authentication.
All volumes are managed by a master server. The master server contains volume id to volume server mapping. This is fairly static information, and could be cached easily.
On each write request, the master server also generates a file key, which is a growing 64bit unsigned integer. Since the write requests are not as busy as read requests, one master server should be able to handle the concurrency well.
Write and Read files
When a client sends a write request, the master server returns for the file. The client then contact the volume node and POST the file content via REST.
When a client needs to read a file based on , it can ask the master server by the for the , or from cache. Then the client can HTTP GET the content via REST, or just render the URL on web pages and let browsers to fetch the content.
Please see the example for details on write-read process.
In current implementation, each volume can be size of 8x2 32 =32G bytes. This is because of aligning contents to 8 bytes. We can be easily increased to 64G, or 128G, or more, by changing 2 lines of code, at the cost of some wasted padding space due to alignment.
There can be 2 32 volumes. So total system size is 8 x 2 32 x 2 32 = 8 x 4G x 4G = 128GG bytes. (Sorry, I don't know the word for giga of giga bytes.)
Each individual file size is limited to the volume size.
All file meta information on volume server is readable from memory without disk access. Each file just takes an 16-byte map entry of<64bit key, 32bit offset, 32bit size>. Of course, each map entry has its own the space cost for the map. But usually the disk runs out before the memory does.
Compared to Other File Systems
Frankly, I don't use other distributed file systems too often. All seems more complicated than necessary. Please correct me if anything here is wrong.
HDFS uses the chunk approach for each file, and is ideal for streaming large files.
WeedFS is ideal for serving relatively smaller files quickly and concurrently.
Compared to MogileFS
WeedFS has 2 components: directory server, storage nodes.
MogileFS has 3 components: tracers, database, storage nodes.
One more layer means slower access, more operation complexity, more failure possibility.
Compared to GlusterFS
WeedFS is not POSIX compliant, and has simple implementation.
GlusterFS is POSIX compliant, much more complex.
Compared to Mongo's GridFS
Mongo's GridFS splits files into chunks and manage chunks in the central mongodb. For every read or write request, the database needs to query the metadata. It's OK if this is not a bottleneck yet, but for a lot of concurrent reads this unnecessary query could slow things down.
On the contrary, Weed-FS uses large file volume of 32G size to store lots of files, and only manages file volumes in the master server. Each volume manages file metadata themselves. So all the file metadata is spread onto the volume nodes memories, and just one disk read is needed.
Weed-FS will support fail-over master server.
Weed-FS will support multiple copies of the data. Right now, data has just one copy. Depending on demands, multiple copy, and even data-center awareness and optimization will be implemented.
Weed-FS may add more optimization for pictures. For example, automatically resizing pictures when storing them.
WeedFS does not plan to add namespaces.
To use WeedFS, the namespace is supposed to be managed by the clients. Many use cases, like a user's avatar picture, do not really need namespaces. Actually, it takes some effort to create and maintain the file path in order to avoid too many files under a directory.
Advanced users can actually create the namespace layer on top of the Key-file store, just like how the common file system creates the namespace on top of inode for each file.
./weedmaster
./weedvolume -dir="/var/weedfs1" -volumes=0-4 -mserver="localhost:9333" -port=8080 -publicUrl="localhost:8080" &
./weedvolume -dir="/var/weedfs2" -volumes=5-7 -mserver="localhost:9333" -port=8081 -publicUrl="localhost:8081" &