1.4 ObjectStore的接口

标志位

// Flag bits
typedef uint32_t osflagbits_t;
const int SKIP_JOURNAL_REPLAY = 1 << 0;
const int SKIP_MOUNT_OMAP = 1 << 1;

这里主要是工程意义上的写法比较有意思,比如要设置第x位,就
1<<x

成员以及函数

以下部分都是ObjectStore的成员或者成员函数。

路径

ObjectStore上下文以及路径。

// ObjectStore的路径
string path;
CephContext* cct;

工厂方法

好像很多抽象类在有很多派生类的时候,都很喜欢用这种create工厂方法。

/**
* create - create an ObjectStore instance.
*
* This is invoked once at initialization time.
*
* @param type type of store. This is a string from the configuration file.
* @param data path (or other descriptor) for data
* @param journal path (or other descriptor) for journal (optional)
* @param flags which filestores should check if applicable
*/
static ObjectStore *create(CephContext *cct,
const string& type,
const string& data,
const string& journal,
osflagbits_t flags = 0);

可以看到,这里需要指定:

  • 需要的ObjectStore的类型
  • 需要的ObjectStore的数据区
  • ObjectStore的日志区
  • ObjectStore的标志位

这里的标志位主要是两个:

  1. 是否需要journal重放
  2. 是否需要mount omap

注意看前面给出的两个SKIP_标志位。

读取fsid

/**
* probe a block device to learn the uuid of the owning OSD
*
* @param cct cct
* @param path path to device
* @param fsid [out] osd uuid
*/
static int probe_block_device_fsid(
CephContext *cct,
const string& path,
uuid_d *fsid);

这里是用来读取ObjectStorefsid,也就是看一下属于哪个cluster。
不同的ObjectStore的实现是不一样的。

int ObjectStore::probe_block_device_fsid(
CephContext *cct,
const string& path,
uuid_d *fsid)
{
int r;
#if defined(WITH_BLUESTORE)
// first try bluestore -- it has a crc on its header and will fail
// reliably.
r = BlueStore::get_block_device_fsid(cct, path, fsid);
if (r == 0) {
lgeneric_dout(cct, 0) << __func__ << " " << path << " is bluestore, "
<< *fsid << dendl;
return r;
}
#endif
// okay, try FileStore (journal).
r = FileStore::get_block_device_fsid(cct, path, fsid);
if (r == 0) {
lgeneric_dout(cct, 0) << __func__ << " " << path << " is filestore, "
<< *fsid << dendl;
return r;
}
return -EINVAL;
}

从代码的设计上来,实际上最好是把各种代码,分散到自己的实现里面。而不是在这里通过if else来调用。这里稍微通过FileStore展开一下。

int FileStore::get_block_device_fsid(CephContext* cct, const string& path,
uuid_d *fsid)
{
// make sure we don't try to use aio or direct_io (and get annoying
// error messages from failing to do so); performance implications
// should be irrelevant for this use
FileJournal j(cct, *fsid, 0, 0, path.c_str(), false, false);
return j.peek_fsid(*fsid);
}
// This can not be used on an active journal
int FileJournal::peek_fsid(uuid_d& fsid)
{
assert(fd == -1);
int r = _open(false, false);
if (r)
return r;
r = read_header(&header);
if (r < 0)
goto out;
fsid = header.fsid;
out:
close();
return r;
}

可以看出这段代码实际上就是打开journal,然后读取journal的头部,从头部中拿到journalheader信息,然后再取出其中的fsid

获取ObjectStore的性能数据

通过注释可以看出,这里主要是获取ObjectStorecommit/apply的latency信息。

/**
* Fetch Object Store statistics.
*
* Currently only latency of write and apply times are measured.
*
* This appears to be called with nothing locked.
*/
virtual objectstore_perf_stat_t get_cur_stats() = 0;

那么这里可以展开get_cur_stats看一下FileStore.h是如何处理的。

struct FSPerfTracker {
PerfCounters::avg_tracker<uint64_t> os_commit_latency_ns;
PerfCounters::avg_tracker<uint64_t> os_apply_latency_ns;
objectstore_perf_stat_t get_cur_stats() const {
objectstore_perf_stat_t ret;
ret.os_commit_latency_ns = os_commit_latency_ns.current_avg();
ret.os_apply_latency_ns = os_apply_latency_ns.current_avg();
return ret;
}
void update_from_perfcounters(PerfCounters &logger);
} perf_tracker;
objectstore_perf_stat_t get_cur_stats() override {
perf_tracker.update_from_perfcounters(*logger);
return perf_tracker.get_cur_stats();
}

可以看出来,这里更新的变量主要是:

  • os_commit_latency_ns
  • os_apply_latency_ns

拿到性能计数器

/**
* Fetch Object Store performance counters.
*
*
* This appears to be called with nothing locked.
*/
virtual const PerfCounters* get_perf_counters() const = 0;

这里大部分子类,如果存在性能计数器,那么基本上都是一句话:

const PerfCounters* get_perf_counters() const override {
return logger;
}

Collection

Ceph在通常情况下认为事务之间是没有相关性的,也就是说事务A事务B可以用任何顺序来进行提交。那么如果是用户对于某些事务有先后顺序的要求呢?

比如一定要事务A, 事务B,事务C,这个时候就需要利用CollectionA,B,C排好序并且放在一起。

/**
* 一个collection里面包含的是一系列有先后顺序的事务
* 在同一个collection里面的事务队列,在apply的时候,必面根据先后顺序一个一个来。
* 在不同collection的事务是可以并行提交的。
*
* ObjectStore users可以得到collection的指针,通过两种方式
* - open_collection()
* - create_new_collection()
*/
struct CollectionImpl : public RefCountedObject {
const coll_t cid;
CollectionImpl(const coll_t& c)
: RefCountedObject(NULL, 0),
cid(c) {}
/// wait for any queued transactions to apply
// block until any previous transactions are visible. specifically,
// collection_list and collection_empty need to reflect prior operations.
// flush函数的作用就是一个一个地apply transactions
// 必须要等到前面的事务都生效之后,后面的事务才可以推进。
// collection_list()和collection_empty()这两个函数
// 需要反映之前的操作。
virtual void flush() = 0;

问题暂时不去管这个collection_listcollection_empty这两个函数的具体作用。

/**
* Async flush_commit
* 这个是异步flush commit
* 两种情况:
* 1. collection当前是空闲的,flush_commit返回true
* c不动
* 2. collection并不空闲,这个方法返回false并且c会被异步调用.
* 一旦这个collection里面所有的事务都先于flush_commit函数applied/commited了
* 那么一个0值就会返回
*/
virtual bool flush_commit(Context *c) = 0;
const coll_t &get_cid() { return cid; }
};
// 定义Collection的句柄
typedef boost::intrusive_ptr<CollectionImpl> CollectionHandle;

Object的内容与语义

所有ObjectStore里面的objects都是唯一的,无论是ghobject_t和hobject_t
ObjectStore的操作支持创建,修改,删除,罗列collection中的objects

但是这里的罗列是根据object key来进行排列的。所有的object name在整个Ceph系统里面
都是唯一的。

每个object都会有三个离散的三个部分:

  • 数据
  • xattrs
  • omap_header
  • omap_entries

关于Omap可以看一下这个链表:http://bean-li.github.io/ceph-omap/

简单地概述一下就是。

FileStore的omap中存放的都是对象的属性信息,以key-value的形式存在,那么对于不同的属性,如何定义对象的键值key呢?
最直接的想法就是(object_id + xattr_key),两者结合一起,形成对象的键值key,但是这种方法有一个问题,object_id可能会很长,尤其是当单个对象存在很多属性的时候,object_id不得不在key值中出现多次,这必然会造成存储空间的浪费。
Ceph的FileStore分成了2步:
第一步: 根据object_id生成一个比较短的seq,然后把这个seq存放到omap_header中。
第二步: 然后seq + xattr_key形成对象的某个属性的键值。

如何生成seq

如果是LevelDB来实现Omap的话,那么就是在LevelDB中存储一个OSD当前全局的key值。

key: SYS_PREFIX + GLOBAL_STATE_KEY
value: state

要申请seq的时候,针对这个seq上锁然后递增。seq是放在state里面的。state的内容就存放到LevelDB中。

struct State {
__u8 v;
uint64_t seq;
};

object_id到seq

struct _Header {
uint64_t seq;
uint64_t parent;
uint64_t num_children;
coll_t c;
ghobject_t oid;
SequencerPosition spos;
};

当生成seq之后,立即生成一个header结构。然后把这个header存放到LevelDB中。

key: HOBJECT_TO_SEQ + ghobject_key(oid)
value: header

Object的data

object的数据部分理念上是等价于一个文件系统里面的文件。对于object的随机和部分读写都要可以进行。对于数据部分的稀疏处理并不是一个强需求。一般而言,单个object不要太大,大的话一般100MB左右。

Object的xattrs

xattrs主要是存放在文件系统的attrs上。而omap一般则是存放在leveldb上。

/*********************************
* 事务
*
* 一个事务包含了一系列修改操作。
*
* 一个事务的三个事件会导致回调。任何一个事务都会带如下的
* 回调函数。
*
* on_applied_sync, on_applied, and on_commit.
*
* `on_applied`和`on_applied_sync`这两个回调都是在修改正式生效之后才会被触发。所谓的修改生效就是指修改被后面的操作可见。
*
* 唯一理论上的差异`on_applied`和`on_applied_sync`是在于callback发生的操作线程以及锁环境。`on_applied_sync`语意上就是说直接会被执行线程触发,往往是在急着要执行,并且在当前的环境下不能持有锁的调用环境。(去申请锁可能会导致wait也就是等待)
* 相反地`on_applied`则是另外一个Finisher线程来调用的。这也就是意味着调用环境满足去申请锁的各种条件(这里主要是指申请的时候可以wait)。
* 需要注意的是:on_applied和on_applied_sync有时候也会被叫做on_readable和on_readable_sync。
*
* on_commit回调则肯定是由另外一个Finisher线程来调用的。并且所有的修改操作已经写到journal上。也就是持久化了。
*
* 就从journal写日志的实现上来说,每次原始的修改(包含相关的数据)都可以被串行化到一个单一的buffer里面。这个串行化并不会拷贝任何数据本身,而是直接引用到原有的数据。这样一来,就需要原有的数据保持不变,直接on_commit回调函数成功。在实践上,缓冲区处理所有的这种情况,主要是通过bufferlist::raw_static引用到相应的data缓冲区。
*
* 一些ObjectStore的实施选实施他们自己形式的journal并且利用串行化来实现一个事务。在这种情况下就需要保证encode/decode逻辑合理地处理好version,并且要处理好升级。
*
*
* TRANSACTION ISOLATION 事务独立性
*
* 事务的独立性是由于调用方来实施的。除此之外,独立性是指,
* object相关的四个部分被一个事务修改/删除的时候,调用方并不会说去读取这个object的相应元素,特别是当这个事务有可能被阻塞的情况下。这里阻塞指的是`one_applied_sync`回调被执行了。
* 对于这个规则的违反并不会被ObjectStore所监管到。并且也不会有相应的错误被raise出来。
* 简单地说就是事务与事务之间是相互独立的,事务之间的关系,需要调用者来加以保证。
* Except as noted above, isolation is the responsibility of the
* caller. In other words, if any storage element (storage element
* == any of the four portions of an object as described above) is
* altered by a transaction (including deletion), the caller
* promises not to attempt to read that element while the
* transaction is pending (here pending means from the time of
* issuance until the "on_applied_sync" callback has been
* received). Violations of isolation need not be detected by
* ObjectStore and there is no corresponding error mechanism for
* reporting an isolation violation (crashing would be the
* appropriate way to report an isolation violation if detected).
*
* Enumeration operations may violate transaction isolation as
* described above when a storage element is being created or
* deleted as part of a transaction. In this case, ObjectStore is
* allowed to consider the enumeration operation to either precede
* or follow the violating transaction element. In other words, the
* presence/absence of the mutated element in the enumeration is
* entirely at the discretion of ObjectStore. The arbitrary ordering
* applies independently to each transaction element. For example,
* if a transaction contains two mutating elements "create A" and
* "delete B". And an enumeration operation is performed while this
* transaction is pending. It is permissable for ObjectStore to
* report any of the four possible combinations of the existence of
* A and B.
*
*/

事务

class Transaction {
public:
// 这里有点类似于设计了一套指令。
enum {
OP_NOP = 0,
OP_TOUCH = 9, // cid, oid
OP_WRITE = 10, // cid, oid, offset, len, bl
OP_ZERO = 11, // cid, oid, offset, len
OP_TRUNCATE = 12, // cid, oid, len
OP_REMOVE = 13, // cid, oid
OP_SETATTR = 14, // cid, oid, attrname, bl
OP_SETATTRS = 15, // cid, oid, attrset
OP_RMATTR = 16, // cid, oid, attrname
OP_CLONE = 17, // cid, oid, newoid
OP_CLONERANGE = 18, // cid, oid, newoid, offset, len
OP_CLONERANGE2 = 30, // cid, oid, newoid, srcoff, len, dstoff
OP_TRIMCACHE = 19, // cid, oid, offset, len **DEPRECATED**
OP_MKCOLL = 20, // cid
OP_RMCOLL = 21, // cid
OP_COLL_ADD = 22, // cid, oldcid, oid
OP_COLL_REMOVE = 23, // cid, oid
OP_COLL_SETATTR = 24, // cid, attrname, bl
OP_COLL_RMATTR = 25, // cid, attrname
OP_COLL_SETATTRS = 26, // cid, attrset
OP_COLL_MOVE = 8, // newcid, oldcid, oid
OP_RMATTRS = 28, // cid, oid
OP_COLL_RENAME = 29, // cid, newcid
OP_OMAP_CLEAR = 31, // cid
OP_OMAP_SETKEYS = 32, // cid, attrset
OP_OMAP_RMKEYS = 33, // cid, keyset
OP_OMAP_SETHEADER = 34, // cid, header
OP_SPLIT_COLLECTION = 35, // cid, bits, destination
OP_SPLIT_COLLECTION2 = 36, /* cid, bits, destination
doesn't create the destination */
OP_OMAP_RMKEYRANGE = 37, // cid, oid, firstkey, lastkey
OP_COLL_MOVE_RENAME = 38, // oldcid, oldoid, newcid, newoid
OP_SETALLOCHINT = 39, // cid, oid, object_size, write_size
OP_COLL_HINT = 40, // cid, type, bl
OP_TRY_RENAME = 41, // oldcid, oldoid, newoid
OP_COLL_SET_BITS = 42, // cid, bits
};
// Transaction hint type
enum {
COLL_HINT_EXPECTED_NUM_OBJECTS = 1,
};
// 真正的操作
struct Op {
__le32 op; // 这里用数字来表示操作的类型,也可以看做是指令的类型
__le32 cid;
__le32 oid;
__le64 off;
__le64 len;
__le32 dest_cid;
__le32 dest_oid; //OP_CLONE, OP_CLONERANGE
__le64 dest_off; //OP_CLONERANGE
union {
struct {
__le32 hint_type; //OP_COLL_HINT
};
struct {
__le32 alloc_hint_flags; //OP_SETALLOCHINT
};
};
__le64 expected_object_size; //OP_SETALLOCHINT
__le64 expected_write_size; //OP_SETALLOCHINT
__le32 split_bits; //OP_SPLIT_COLLECTION2,OP_COLL_SET_BITS,
//OP_MKCOLL
__le32 split_rem; //OP_SPLIT_COLLECTION2
} __attribute__ ((packed)) ;
//
struct TransactionData {
__le64 ops; // 这个应该是指的操作的数量
__le32 largest_data_len;
__le32 largest_data_off;
__le32 largest_data_off_in_data_bl;
__le32 fadvise_flags;
} __attribute__ ((packed)) ;
private:
TransactionData data;
map<coll_t, __le32> coll_index;
map<ghobject_t, __le32> object_index;
__le32 coll_id {0};
__le32 object_id {0};
bufferlist data_bl;
bufferlist op_bl;
bufferptr op_ptr;
list<Context *> on_applied;
list<Context *> on_commit;
list<Context *> on_applied_sync;
public:
void _update_op(Op* op,
vector<__le32> &cm,
vector<__le32> &om) {
// 根据情况来决定是否需要更新collection id
// 或者是object id
// 根据op的类型来决定
op->cid = cm[op->cid];
op->oid = om[op->oid];
op->dest_oid = om[op->dest_oid];
}
// bl里面是一个list
// list里面的每个元素都是一个Op结构
// 然后再通过_update_op(op_memory, cm, om)
// 来进行更新
void _update_op_bl(
bufferlist& bl,
vector<__le32> &cm,
vector<__le32> &om)
{
list<bufferptr> list = bl.buffers();
std::list<bufferptr>::iterator p;
for(p = list.begin(); p != list.end(); ++p) {
assert(p->length() % sizeof(Op) == 0);
char* raw_p = p->c_str();
char* raw_end = raw_p + p->length();
while (raw_p < raw_end) {
_update_op(reinterpret_cast<Op*>(raw_p), cm, om);
raw_p += sizeof(Op);
}
}
}
/// Append the operations of the parameter to this Transaction.
// Those operations are removed from the parameter Transaction
// 这里更加类似于两个事务的合并,注意:
// other.op_bl是深度复制了的。
// ohter.data_bl则是没有深度复制
// 可能是觉得other还会在别的地方会有用处
void append(Transaction& other) {
data.ops += other.data.ops;
if (other.data.largest_data_len > data.largest_data_len) {
data.largest_data_len = other.data.largest_data_len;
data.largest_data_off = other.data.largest_data_off;
data.largest_data_off_in_data_bl = data_bl.length() + other.data.largest_data_off_in_data_bl;
}
data.fadvise_flags |= other.data.fadvise_flags;
// splice的含义是把另外一个list放到on_applied/on_commit后面
// splice函数是说
// splice(Iterator position, list<T> l);
// 把l插入到postion位置。然后l里面的元素被move过去。所以
// 操作之后l变成空的了。
on_applied.splice(on_applied.end(), other.on_applied);
on_commit.splice(on_commit.end(), other.on_commit);
on_applied_sync.splice(on_applied_sync.end(), other.on_applied_sync);
//append coll_index & object_index
// cm新生成,后面用来更新
vector<__le32> cm(other.coll_index.size());
map<coll_t, __le32>::iterator coll_index_p;
for (coll_index_p = other.coll_index.begin();
coll_index_p != other.coll_index.end();
++coll_index_p) {
// 这里更新cm这个vector
cm[coll_index_p->second] = _get_coll_id(coll_index_p->first);
}
vector<__le32> om(other.object_index.size());
map<ghobject_t, __le32>::iterator object_index_p;
for (object_index_p = other.object_index.begin();
object_index_p != other.object_index.end();
++object_index_p) {
// 这里更新的是om这个vector
om[object_index_p->second] = _get_object_id(object_index_p->first);
}
// other.op_bl在这里是不能被更改的
//the other.op_bl SHOULD NOT be changes during append operation,
// 这里使用了另外一个bufferlist来处理这种case.
//we use additional bufferlist to avoid this problem
// 申请一个新的内存,长度为other.op_bl.length()
bufferptr other_op_bl_ptr(other.op_bl.length());
// 这里把other.op_bl里面的内容复制到新申请的内存里
other.op_bl.copy(0, other.op_bl.length(), other_op_bl_ptr.c_str());
bufferlist other_op_bl;
// 注意这里是一个list<bufferptr>, 所以这里用append把前面的内存缓冲区放进去
other_op_bl.append(other_op_bl_ptr);
//update other_op_bl with cm & om
//When the other is appended to current transaction, all coll_index and
//object_index in other.op_buffer should be updated by new index of the
//combined transaction
// 然后利用list<buffer>把当前的transaction更新一把
_update_op_bl(other_op_bl, cm, om);
//append op_bl
// 把other的op_bl list append到op_bl里面
// 完成两个事务的op的合并
op_bl.append(other_op_bl);
//append data_bl
// data bl也是需要合并
data_bl.append(other.data_bl);
}
/** Inquires about the Transaction as a whole. */
/// How big is the encoded Transaction buffer?
// 得到整个事务的长度
// 感觉这里不应该老是去计算
// 最好是有办法去优化
uint64_t get_encoded_bytes() {
//layout: data_bl + op_bl + coll_index + object_index + data
// coll_index size, object_index size and sizeof(transaction_data)
// all here, so they may be computed at compile-time
size_t final_size = sizeof(__u32) * 2 + sizeof(data);
// coll_index second and object_index second
final_size += (coll_index.size() + object_index.size()) * sizeof(__le32);
// coll_index first
for (auto p = coll_index.begin(); p != coll_index.end(); ++p) {
final_size += p->first.encoded_size();
}
// object_index first
for (auto p = object_index.begin(); p != object_index.end(); ++p) {
final_size += p->first.encoded_size();
}
return data_bl.length() +
op_bl.length() +
final_size;
}
uint64_t get_num_bytes() {
return get_encoded_bytes();
}
/// Size of largest data buffer to the "write" operation encountered so far
uint32_t get_data_length() {
return data.largest_data_len;
}
/// offset within the encoded buffer to the start of the largest data buffer that's encoded
uint32_t get_data_offset()
{
if (data.largest_data_off_in_data_bl) {
return data.largest_data_off_in_data_bl +
sizeof(__u8) + // encode struct_v
sizeof(__u8) + // encode compat_v
sizeof(__u32) + // encode len
sizeof(__u32); // data_bl len
}
return 0; // none
}
/// offset of buffer as aligned to destination within object.
int get_data_alignment()
{
if (!data.largest_data_len)
return 0;
return (0 - get_data_offset()) & ~CEPH_PAGE_MASK;
}
/// Is the Transaction empty (no operations)
bool empty()
{
// data里面的ops就是用来计数ops操作的数目
return !data.ops;
}
/// Number of operations in the transation
int get_num_ops()
{
return data.ops;
}
/**
* iterator
*
* Helper object to parse Transactions.
*
* ObjectStore instances use this object to step down the encoded
* buffer decoding operation codes and parameters as we go.
*
*/
class iterator
{
Transaction *t;
uint64_t ops;
char* op_buffer_p;
bufferlist::const_iterator data_bl_p;
public:
vector<coll_t> colls;
vector<ghobject_t> objects;
private:
explicit iterator(Transaction *t)
: t(t),
data_bl_p(t->data_bl.cbegin()),
colls(t->coll_index.size()),
objects(t->object_index.size())
{
ops = t->data.ops;
op_buffer_p = t->op_bl.get_contiguous(0, t->data.ops * sizeof(Op));
map<coll_t, __le32>::iterator coll_index_p;
for (coll_index_p = t->coll_index.begin();
coll_index_p != t->coll_index.end();
++coll_index_p) {
colls[coll_index_p->second] = coll_index_p->first;
}
map<ghobject_t, __le32>::iterator object_index_p;
for (object_index_p = t->object_index.begin();
object_index_p != t->object_index.end();
++object_index_p) {
objects[object_index_p->second] = object_index_p->first;
}
}
friend class Transaction;
public:
bool have_op()
{
return ops > 0;
}
Op* decode_op()
{
assert(ops > 0);
Op* op = reinterpret_cast<Op*>(op_buffer_p);
op_buffer_p += sizeof(Op);
ops--;
return op;
}
string decode_string()
{
using ceph::decode;
string s;
decode(s, data_bl_p);
return s;
}
void decode_bp(bufferptr& bp)
{
using ceph::decode;
decode(bp, data_bl_p);
}
void decode_bl(bufferlist& bl)
{
using ceph::decode;
decode(bl, data_bl_p);
}
void decode_attrset(map<string,bufferptr>& aset)
{
using ceph::decode;
decode(aset, data_bl_p);
}
void decode_attrset(map<string,bufferlist>& aset)
{
using ceph::decode;
decode(aset, data_bl_p);
}
void decode_attrset_bl(bufferlist *pbl)
{
decode_str_str_map_to_bl(data_bl_p, pbl);
}
void decode_keyset(set<string> &keys)
{
using ceph::decode;
decode(keys, data_bl_p);
}
void decode_keyset_bl(bufferlist *pbl)
{
decode_str_set_to_bl(data_bl_p, pbl);
}
const ghobject_t &get_oid(__le32 oid_id)
{
assert(oid_id < objects.size());
return objects[oid_id];
}
const coll_t &get_cid(__le32 cid_id)
{
assert(cid_id < colls.size());
return colls[cid_id];
}
uint32_t get_fadvise_flags() const
{
return t->get_fadvise_flags();
}
};
iterator begin()
{
return iterator(this);
}
private:
void _build_actions_from_tbl();
/**
* Helper functions to encode the various mutation elements of a
* transaction. These are 1:1 with the operation codes (see
* enumeration above). These routines ensure that the
* encoder/creator of a transaction gets the right data in the
* right place. Sadly, there's no corresponding version nor any
* form of seat belts for the decoder.
*/
Op* _get_next_op()
{
if (op_ptr.length() == 0 || op_ptr.offset() >= op_ptr.length()) {
op_ptr = bufferptr(sizeof(Op) * OPS_PER_PTR);
}
bufferptr ptr(op_ptr, 0, sizeof(Op));
op_bl.append(ptr);
op_ptr.set_offset(op_ptr.offset() + sizeof(Op));
char* p = ptr.c_str();
memset(p, 0, sizeof(Op));
return reinterpret_cast<Op*>(p);
}
__le32 _get_coll_id(const coll_t& coll)
{
map<coll_t, __le32>::iterator c = coll_index.find(coll);
if (c != coll_index.end())
return c->second;
__le32 index_id = coll_id++;
coll_index[coll] = index_id;
return index_id;
}
__le32 _get_object_id(const ghobject_t& oid)
{
map<ghobject_t, __le32>::iterator o = object_index.find(oid);
if (o != object_index.end())
return o->second;
__le32 index_id = object_id++;
object_index[oid] = index_id;
return index_id;
}
public:
// 接下来这里生成各种事务的参数,指令
/// noop. 'nuf said
void nop()
{
Op* _op = _get_next_op();
_op->op = OP_NOP;
data.ops++;
}
/**
* touch
*
* Ensure the existance of an object in a collection. Create an
* empty object if necessary
*/
void touch(const coll_t& cid, const ghobject_t& oid)
{
Op* _op = _get_next_op();
_op->op = OP_TOUCH;
_op->cid = _get_coll_id(cid);
_op->oid = _get_object_id(oid);
data.ops++;
}
/**
* Write data to an offset within an object. If the object is too
* small, it is expanded as needed. It is possible to specify an
* offset beyond the current end of an object and it will be
* expanded as needed. Simple implementations of ObjectStore will
* just zero the data between the old end of the object and the
* newly provided data. More sophisticated implementations of
* ObjectStore will omit the untouched data and store it as a
* "hole" in the file.
*
* Note that a 0-length write does not affect the size of the object.
*/
void write(const coll_t& cid, const ghobject_t& oid, uint64_t off, uint64_t len,
const bufferlist& write_data, uint32_t flags = 0)
{
using ceph::encode;
uint32_t orig_len = data_bl.length();
Op* _op = _get_next_op();
_op->op = OP_WRITE;
_op->cid = _get_coll_id(cid);
_op->oid = _get_object_id(oid);
_op->off = off;
_op->len = len;
encode(write_data, data_bl);
assert(len == write_data.length());
data.fadvise_flags = data.fadvise_flags | flags;
if (write_data.length() > data.largest_data_len) {
data.largest_data_len = write_data.length();
data.largest_data_off = off;
data.largest_data_off_in_data_bl = orig_len + sizeof(__u32); // we are about to
}
data.ops++;
}
/**
* zero out the indicated byte range within an object. Some
* ObjectStore instances may optimize this to release the
* underlying storage space.
*
* If the zero range extends beyond the end of the object, the object
* size is extended, just as if we were writing a buffer full of zeros.
* EXCEPT if the length is 0, in which case (just like a 0-length write)
* we do not adjust the object size.
*/
void zero(const coll_t& cid, const ghobject_t& oid, uint64_t off, uint64_t len)
{
Op* _op = _get_next_op();
_op->op = OP_ZERO;
_op->cid = _get_coll_id(cid);
_op->oid = _get_object_id(oid);
_op->off = off;
_op->len = len;
data.ops++;
}
/// Discard all data in the object beyond the specified size.
void truncate(const coll_t& cid, const ghobject_t& oid, uint64_t off)
{
Op* _op = _get_next_op();
_op->op = OP_TRUNCATE;
_op->cid = _get_coll_id(cid);
_op->oid = _get_object_id(oid);
_op->off = off;
data.ops++;
}
/// Remove an object. All four parts of the object are removed.
void remove(const coll_t& cid, const ghobject_t& oid)
{
Op* _op = _get_next_op();
_op->op = OP_REMOVE;
_op->cid = _get_coll_id(cid);
_op->oid = _get_object_id(oid);
data.ops++;
}
/// Set an xattr of an object
void setattr(const coll_t& cid, const ghobject_t& oid, const char* name, bufferlist& val)
{
string n(name);
setattr(cid, oid, n, val);
}
/// Set an xattr of an object
void setattr(const coll_t& cid, const ghobject_t& oid, const string& s, bufferlist& val)
{
using ceph::encode;
Op* _op = _get_next_op();
_op->op = OP_SETATTR;
_op->cid = _get_coll_id(cid);
_op->oid = _get_object_id(oid);
encode(s, data_bl);
encode(val, data_bl);
data.ops++;
}
/// Set multiple xattrs of an object
void setattrs(const coll_t& cid, const ghobject_t& oid, const map<string,bufferptr>& attrset)
{
using ceph::encode;
Op* _op = _get_next_op();
_op->op = OP_SETATTRS;
_op->cid = _get_coll_id(cid);
_op->oid = _get_object_id(oid);
encode(attrset, data_bl);
data.ops++;
}
/// Set multiple xattrs of an object
void setattrs(const coll_t& cid, const ghobject_t& oid, const map<string,bufferlist>& attrset)
{
using ceph::encode;
Op* _op = _get_next_op();
_op->op = OP_SETATTRS;
_op->cid = _get_coll_id(cid);
_op->oid = _get_object_id(oid);
encode(attrset, data_bl);
data.ops++;
}
/// remove an xattr from an object
void rmattr(const coll_t& cid, const ghobject_t& oid, const char *name)
{
string n(name);
rmattr(cid, oid, n);
}
/// remove an xattr from an object
void rmattr(const coll_t& cid, const ghobject_t& oid, const string& s)
{
using ceph::encode;
Op* _op = _get_next_op();
_op->op = OP_RMATTR;
_op->cid = _get_coll_id(cid);
_op->oid = _get_object_id(oid);
encode(s, data_bl);
data.ops++;
}
/// remove all xattrs from an object
void rmattrs(const coll_t& cid, const ghobject_t& oid)
{
Op* _op = _get_next_op();
_op->op = OP_RMATTRS;
_op->cid = _get_coll_id(cid);
_op->oid = _get_object_id(oid);
data.ops++;
}
/**
* Clone an object into another object.
*
* Low-cost (e.g., O(1)) cloning (if supported) is best, but
* fallback to an O(n) copy is allowed. All four parts of the
* object are cloned (data, xattrs, omap header, omap
* entries).
*
* The destination named object may already exist, in
* which case its previous contents are discarded.
*/
void clone(const coll_t& cid, const ghobject_t& oid,
const ghobject_t& noid)
{
Op* _op = _get_next_op();
_op->op = OP_CLONE;
_op->cid = _get_coll_id(cid);
_op->oid = _get_object_id(oid);
_op->dest_oid = _get_object_id(noid);
data.ops++;
}
/**
* Clone a byte range from one object to another.
*
* The data portion of the destination object receives a copy of a
* portion of the data from the source object. None of the other
* three parts of an object is copied from the source.
*
* The destination object size may be extended to the dstoff + len.
*
* The source range *must* overlap with the source object data. If it does
* not the result is undefined.
*/
void clone_range(const coll_t& cid, const ghobject_t& oid,
const ghobject_t& noid,
uint64_t srcoff, uint64_t srclen, uint64_t dstoff)
{
Op* _op = _get_next_op();
_op->op = OP_CLONERANGE2;
_op->cid = _get_coll_id(cid);
_op->oid = _get_object_id(oid);
_op->dest_oid = _get_object_id(noid);
_op->off = srcoff;
_op->len = srclen;
_op->dest_off = dstoff;
data.ops++;
}
/// Create the collection
void create_collection(const coll_t& cid, int bits)
{
Op* _op = _get_next_op();
_op->op = OP_MKCOLL;
_op->cid = _get_coll_id(cid);
_op->split_bits = bits;
data.ops++;
}
/**
* Give the collection a hint.
*
* @param cid - collection id.
* @param type - hint type.
* @param hint - the hint payload, which contains the customized
* data along with the hint type.
*/
void collection_hint(const coll_t& cid, uint32_t type, const bufferlist& hint)
{
using ceph::encode;
Op* _op = _get_next_op();
_op->op = OP_COLL_HINT;
_op->cid = _get_coll_id(cid);
_op->hint_type = type;
encode(hint, data_bl);
data.ops++;
}
/// remove the collection, the collection must be empty
void remove_collection(const coll_t& cid)
{
Op* _op = _get_next_op();
_op->op = OP_RMCOLL;
_op->cid = _get_coll_id(cid);
data.ops++;
}
void collection_move(const coll_t& cid, const coll_t &oldcid, const ghobject_t& oid)
__attribute__ ((deprecated))
{
// NOTE: we encode this as a fixed combo of ADD + REMOVE. they
// always appear together, so this is effectively a single MOVE.
Op* _op = _get_next_op();
_op->op = OP_COLL_ADD;
_op->cid = _get_coll_id(oldcid);
_op->oid = _get_object_id(oid);
_op->dest_cid = _get_coll_id(cid);
data.ops++;
_op = _get_next_op();
_op->op = OP_COLL_REMOVE;
_op->cid = _get_coll_id(oldcid);
_op->oid = _get_object_id(oid);
data.ops++;
}
void collection_move_rename(const coll_t& oldcid, const ghobject_t& oldoid,
const coll_t &cid, const ghobject_t& oid)
{
Op* _op = _get_next_op();
_op->op = OP_COLL_MOVE_RENAME;
_op->cid = _get_coll_id(oldcid);
_op->oid = _get_object_id(oldoid);
_op->dest_cid = _get_coll_id(cid);
_op->dest_oid = _get_object_id(oid);
data.ops++;
}
void try_rename(const coll_t &cid, const ghobject_t& oldoid,
const ghobject_t& oid)
{
Op* _op = _get_next_op();
_op->op = OP_TRY_RENAME;
_op->cid = _get_coll_id(cid);
_op->oid = _get_object_id(oldoid);
_op->dest_oid = _get_object_id(oid);
data.ops++;
}
/// Remove omap from oid
void omap_clear(
const coll_t &cid, ///< [in] Collection containing oid
const ghobject_t &oid ///< [in] Object from which to remove omap
)
{
Op* _op = _get_next_op();
_op->op = OP_OMAP_CLEAR;
_op->cid = _get_coll_id(cid);
_op->oid = _get_object_id(oid);
data.ops++;
}
/// Set keys on oid omap. Replaces duplicate keys.
void omap_setkeys(
const coll_t& cid, ///< [in] Collection containing oid
const ghobject_t &oid, ///< [in] Object to update
const map<string, bufferlist> &attrset ///< [in] Replacement keys and values
)
{
using ceph::encode;
Op* _op = _get_next_op();
_op->op = OP_OMAP_SETKEYS;
_op->cid = _get_coll_id(cid);
_op->oid = _get_object_id(oid);
encode(attrset, data_bl);
data.ops++;
}
/// Set keys on an oid omap (bufferlist variant).
void omap_setkeys(
const coll_t &cid, ///< [in] Collection containing oid
const ghobject_t &oid, ///< [in] Object to update
const bufferlist &attrset_bl ///< [in] Replacement keys and values
)
{
Op* _op = _get_next_op();
_op->op = OP_OMAP_SETKEYS;
_op->cid = _get_coll_id(cid);
_op->oid = _get_object_id(oid);
data_bl.append(attrset_bl);
data.ops++;
}
/// Remove keys from oid omap
void omap_rmkeys(
const coll_t &cid, ///< [in] Collection containing oid
const ghobject_t &oid, ///< [in] Object from which to remove the omap
const set<string> &keys ///< [in] Keys to clear
)
{
using ceph::encode;
Op* _op = _get_next_op();
_op->op = OP_OMAP_RMKEYS;
_op->cid = _get_coll_id(cid);
_op->oid = _get_object_id(oid);
encode(keys, data_bl);
data.ops++;
}
/// Remove keys from oid omap
void omap_rmkeys(
const coll_t &cid, ///< [in] Collection containing oid
const ghobject_t &oid, ///< [in] Object from which to remove the omap
const bufferlist &keys_bl ///< [in] Keys to clear
)
{
Op* _op = _get_next_op();
_op->op = OP_OMAP_RMKEYS;
_op->cid = _get_coll_id(cid);
_op->oid = _get_object_id(oid);
data_bl.append(keys_bl);
data.ops++;
}
/// Remove key range from oid omap
void omap_rmkeyrange(
const coll_t &cid, ///< [in] Collection containing oid
const ghobject_t &oid, ///< [in] Object from which to remove the omap keys
const string& first, ///< [in] first key in range
const string& last ///< [in] first key past range, range is [first,last)
)
{
using ceph::encode;
Op* _op = _get_next_op();
_op->op = OP_OMAP_RMKEYRANGE;
_op->cid = _get_coll_id(cid);
_op->oid = _get_object_id(oid);
encode(first, data_bl);
encode(last, data_bl);
data.ops++;
}
/// Set omap header
void omap_setheader(
const coll_t &cid, ///< [in] Collection containing oid
const ghobject_t &oid, ///< [in] Object
const bufferlist &bl ///< [in] Header value
)
{
using ceph::encode;
Op* _op = _get_next_op();
_op->op = OP_OMAP_SETHEADER;
_op->cid = _get_coll_id(cid);
_op->oid = _get_object_id(oid);
encode(bl, data_bl);
data.ops++;
}
/// Split collection based on given prefixes, objects matching the specified bits/rem are
/// moved to the new collection
void split_collection(
const coll_t &cid,
uint32_t bits,
uint32_t rem,
const coll_t &destination)
{
Op* _op = _get_next_op();
_op->op = OP_SPLIT_COLLECTION2;
_op->cid = _get_coll_id(cid);
_op->dest_cid = _get_coll_id(destination);
_op->split_bits = bits;
_op->split_rem = rem;
data.ops++;
}
void collection_set_bits(
const coll_t &cid,
int bits)
{
Op* _op = _get_next_op();
_op->op = OP_COLL_SET_BITS;
_op->cid = _get_coll_id(cid);
_op->split_bits = bits;
data.ops++;
}
/// Set allocation hint for an object
/// make 0 values(expected_object_size, expected_write_size) noops for all implementations
void set_alloc_hint(
const coll_t &cid,
const ghobject_t &oid,
uint64_t expected_object_size,
uint64_t expected_write_size,
uint32_t flags
)
{
Op* _op = _get_next_op();
_op->op = OP_SETALLOCHINT;
_op->cid = _get_coll_id(cid);
_op->oid = _get_object_id(oid);
_op->expected_object_size = expected_object_size;
_op->expected_write_size = expected_write_size;
_op->alloc_hint_flags = flags;
data.ops++;
}
};

事务入队

int queue_transaction(CollectionHandle& ch,
Transaction&& t,
TrackedOpRef op = TrackedOpRef(),
ThreadPool::TPHandle *handle = NULL)
{
vector<Transaction> tls;
tls.push_back(std::move(t));
return queue_transactions(ch, tls, op, handle);
}
virtual int queue_transactions(
CollectionHandle& ch, vector<Transaction>& tls,
TrackedOpRef op = TrackedOpRef(),
ThreadPool::TPHandle *handle = NULL) = 0;
public:
// versioning
virtual int upgrade() {
return 0;
}
virtual void get_db_statistics(Formatter *f) { }
virtual void generate_db_histogram(Formatter *f) { }
virtual void flush_cache() { }
virtual void dump_perf_counters(Formatter *f) {}
virtual string get_type() = 0;
// mgmt
virtual bool test_mount_in_use() = 0;
virtual int mount() = 0;
virtual int umount() = 0;
virtual int fsck(bool deep)
{
return -EOPNOTSUPP;
}
virtual int repair(bool deep)
{
return -EOPNOTSUPP;
}
virtual void set_cache_shards(unsigned num) { }
/**
* Returns 0 if the hobject is valid, -error otherwise
*
* Errors:
* -ENAMETOOLONG: locator/namespace/name too large
*/
virtual int validate_hobject_key(const hobject_t &obj) const = 0;
virtual unsigned get_max_attr_name_length() = 0;
virtual int mkfs() = 0; // wipe
virtual int mkjournal() = 0; // journal only
virtual bool needs_journal() = 0; //< requires a journal
virtual bool wants_journal() = 0; //< prefers a journal
virtual bool allows_journal() = 0; //< allows a journal
/// enumerate hardware devices (by 'devname', e.g., 'sda' as in /sys/block/sda)
virtual int get_devices(std::set<string> *devls)
{
return -EOPNOTSUPP;
}
/// true if a txn is readable immediately after it is queued.
virtual bool is_sync_onreadable() const
{
return true;
}
/**
* is_rotational
*
* Check whether store is backed by a rotational (HDD) or non-rotational
* (SSD) device.
*
* This must be usable *before* the store is mounted.
*
* @return true for HDD, false for SSD
*/
virtual bool is_rotational()
{
return true;
}
/**
* is_journal_rotational
*
* Check whether journal is backed by a rotational (HDD) or non-rotational
* (SSD) device.
*
*
* @return true for HDD, false for SSD
*/
virtual bool is_journal_rotational()
{
return true;
}
virtual string get_default_device_class()
{
return is_rotational() ? "hdd" : "ssd";
}
virtual bool can_sort_nibblewise()
{
return false; // assume a backend cannot, unless it says otherwise
}
virtual int statfs(struct store_statfs_t *buf) = 0;
virtual void collect_metadata(map<string,string> *pm) { }
/**
* write_meta - write a simple configuration key out-of-band
*
* Write a simple key/value pair for basic store configuration
* (e.g., a uuid or magic number) to an unopened/unmounted store.
* The default implementation writes this to a plaintext file in the
* path.
*
* A newline is appended.
*
* @param key key name (e.g., "fsid")
* @param value value (e.g., a uuid rendered as a string)
* @returns 0 for success, or an error code
*/
virtual int write_meta(const std::string& key,
const std::string& value);
/**
* read_meta - read a simple configuration key out-of-band
*
* Read a simple key value to an unopened/mounted store.
*
* Trailing whitespace is stripped off.
*
* @param key key name
* @param value pointer to value string
* @returns 0 for success, or an error code
*/
virtual int read_meta(const std::string& key,
std::string *value);
/**
* get ideal max value for collection_list()
*
* default to some arbitrary values; the implementation will override.
*/
virtual int get_ideal_list_max()
{
return 64;
}
/**
* get a collection handle
*
* Provide a trivial handle as a default to avoid converting legacy
* implementations.
*/
virtual CollectionHandle open_collection(const coll_t &cid) = 0;
/**
* get a collection handle for a soon-to-be-created collection
*
* This handle must be used by queue_transaction that includes a
* create_collection call in order to become valid. It will become the
* reference to the created collection.
*/
virtual CollectionHandle create_new_collection(const coll_t &cid) = 0;
/**
* Synchronous read operations
*/
/**
* exists -- Test for existance of object
*
* @param cid collection for object
* @param oid oid of object
* @returns true if object exists, false otherwise
*/
virtual bool exists(CollectionHandle& c, const ghobject_t& oid) = 0;
/**
* set_collection_opts -- set pool options for a collectioninformation for an object
*
* @param cid collection
* @param opts new collection options
* @returns 0 on success, negative error code on failure.
*/
virtual int set_collection_opts(
CollectionHandle& c,
const pool_opts_t& opts) = 0;
/**
* stat -- get information for an object
*
* @param cid collection for object
* @param oid oid of object
* @param st output information for the object
* @param allow_eio if false, assert on -EIO operation failure
* @returns 0 on success, negative error code on failure.
*/
virtual int stat(
CollectionHandle &c,
const ghobject_t& oid,
struct stat *st,
bool allow_eio = false) = 0;
/**
* read -- read a byte range of data from an object
*
* Note: if reading from an offset past the end of the object, we
* return 0 (not, say, -EINVAL).
*
* @param cid collection for object
* @param oid oid of object
* @param offset location offset of first byte to be read
* @param len number of bytes to be read
* @param bl output bufferlist
* @param op_flags is CEPH_OSD_OP_FLAG_*
* @returns number of bytes read on success, or negative error code on failure.
*/
virtual int read(
CollectionHandle &c,
const ghobject_t& oid,
uint64_t offset,
size_t len,
bufferlist& bl,
uint32_t op_flags = 0) = 0;
/**
* fiemap -- get extent map of data of an object
*
* Returns an encoded map of the extents of an object's data portion
* (map<offset,size>).
*
* A non-enlightened implementation is free to return the extent (offset, len)
* as the sole extent.
*
* @param cid collection for object
* @param oid oid of object
* @param offset location offset of first byte to be read
* @param len number of bytes to be read
* @param bl output bufferlist for extent map information.
* @returns 0 on success, negative error code on failure.
*/
virtual int fiemap(CollectionHandle& c, const ghobject_t& oid,
uint64_t offset, size_t len, bufferlist& bl) = 0;
virtual int fiemap(CollectionHandle& c, const ghobject_t& oid,
uint64_t offset, size_t len, map<uint64_t, uint64_t>& destmap) = 0;
/**
* getattr -- get an xattr of an object
*
* @param cid collection for object
* @param oid oid of object
* @param name name of attr to read
* @param value place to put output result.
* @returns 0 on success, negative error code on failure.
*/
virtual int getattr(CollectionHandle &c, const ghobject_t& oid,
const char *name, bufferptr& value) = 0;
/**
* getattr -- get an xattr of an object
*
* @param cid collection for object
* @param oid oid of object
* @param name name of attr to read
* @param value place to put output result.
* @returns 0 on success, negative error code on failure.
*/
int getattr(
CollectionHandle &c, const ghobject_t& oid,
const string& name, bufferlist& value)
{
bufferptr bp;
int r = getattr(c, oid, name.c_str(), bp);
value.push_back(bp);
return r;
}
/**
* getattrs -- get all of the xattrs of an object
*
* @param cid collection for object
* @param oid oid of object
* @param aset place to put output result.
* @returns 0 on success, negative error code on failure.
*/
virtual int getattrs(CollectionHandle &c, const ghobject_t& oid,
map<string,bufferptr>& aset) = 0;
/**
* getattrs -- get all of the xattrs of an object
*
* @param cid collection for object
* @param oid oid of object
* @param aset place to put output result.
* @returns 0 on success, negative error code on failure.
*/
int getattrs(CollectionHandle &c, const ghobject_t& oid,
map<string,bufferlist>& aset)
{
map<string,bufferptr> bmap;
int r = getattrs(c, oid, bmap);
for (map<string,bufferptr>::iterator i = bmap.begin();
i != bmap.end();
++i) {
aset[i->first].append(i->second);
}
return r;
}
// collections
/**
* list_collections -- get all of the collections known to this ObjectStore
*
* @param ls list of the collections in sorted order.
* @returns 0 on success, negative error code on failure.
*/
virtual int list_collections(vector<coll_t>& ls) = 0;
/**
* does a collection exist?
*
* @param c collection
* @returns true if it exists, false otherwise
*/
virtual bool collection_exists(const coll_t& c) = 0;
/**
* is a collection empty?
*
* @param c collection
* @param empty true if the specified collection is empty, false otherwise
* @returns 0 on success, negative error code on failure.
*/
virtual int collection_empty(CollectionHandle& c, bool *empty) = 0;
/**
* return the number of significant bits of the coll_t::pgid.
*
* This should return what the last create_collection or split_collection
* set. A legacy backend may return -EAGAIN if the value is unavailable
* (because we upgraded from an older version, e.g., FileStore).
*/
virtual int collection_bits(CollectionHandle& c) = 0;
/**
* list contents of a collection that fall in the range [start, end) and no more than a specified many result
*
* @param c collection
* @param start list object that sort >= this value
* @param end list objects that sort < this value
* @param max return no more than this many results
* @param seq return no objects with snap < seq
* @param ls [out] result
* @param next [out] next item sorts >= this value
* @return zero on success, or negative error
*/
virtual int collection_list(CollectionHandle &c,
const ghobject_t& start, const ghobject_t& end,
int max,
vector<ghobject_t> *ls, ghobject_t *next) = 0;
/// OMAP
/// Get omap contents
virtual int omap_get(
CollectionHandle &c, ///< [in] Collection containing oid
const ghobject_t &oid, ///< [in] Object containing omap
bufferlist *header, ///< [out] omap header
map<string, bufferlist> *out /// < [out] Key to value map
) = 0;
/// Get omap header
virtual int omap_get_header(
CollectionHandle &c, ///< [in] Collection containing oid
const ghobject_t &oid, ///< [in] Object containing omap
bufferlist *header, ///< [out] omap header
bool allow_eio = false ///< [in] don't assert on eio
) = 0;
/// Get keys defined on oid
virtual int omap_get_keys(
CollectionHandle &c, ///< [in] Collection containing oid
const ghobject_t &oid, ///< [in] Object containing omap
set<string> *keys ///< [out] Keys defined on oid
) = 0;
/// Get key values
virtual int omap_get_values(
CollectionHandle &c, ///< [in] Collection containing oid
const ghobject_t &oid, ///< [in] Object containing omap
const set<string> &keys, ///< [in] Keys to get
map<string, bufferlist> *out ///< [out] Returned keys and values
) = 0;
/// Filters keys into out which are defined on oid
virtual int omap_check_keys(
CollectionHandle &c, ///< [in] Collection containing oid
const ghobject_t &oid, ///< [in] Object containing omap
const set<string> &keys, ///< [in] Keys to check
set<string> *out ///< [out] Subset of keys defined on oid
) = 0;
/**
* Returns an object map iterator
*
* Warning! The returned iterator is an implicit lock on filestore
* operations in c. Do not use filestore methods on c while the returned
* iterator is live. (Filling in a transaction is no problem).
*
* @return iterator, null on error
*/
virtual ObjectMap::ObjectMapIterator get_omap_iterator(
CollectionHandle &c, ///< [in] collection
const ghobject_t &oid ///< [in] object
) = 0;
virtual int flush_journal() {
return -EOPNOTSUPP;
}
virtual int dump_journal(ostream& out) {
return -EOPNOTSUPP;
}
virtual int snapshot(const string& name) {
return -EOPNOTSUPP;
}
/**
* Set and get internal fsid for this instance. No external data is modified
*/
virtual void set_fsid(uuid_d u) = 0;
virtual uuid_d get_fsid() = 0;
/**
* Estimates additional disk space used by the specified amount of objects and caused by file allocation granularity and metadata store
* - num objects - total (including witeouts) object count to measure used space for.
*/
virtual uint64_t estimate_objects_overhead(uint64_t num_objects) = 0;
virtual void compact() {}
virtual bool has_builtin_csum() const
{
return false;
}
};