2018-07-31

1.4 ObjectStore的接口

标志位

// Flag bits
typedef uint32_t osflagbits_t;
const int SKIP_JOURNAL_REPLAY = 1 << 0;
const int SKIP_MOUNT_OMAP = 1 << 1;

这里主要是工程意义上的写法比较有意思，比如要设置第x位，就
1<<x。

成员以及函数

以下部分都是ObjectStore的成员或者成员函数。

路径

ObjectStore上下文以及路径。

// ObjectStore的路径
string path;
CephContext* cct;

工厂方法

好像很多抽象类在有很多派生类的时候，都很喜欢用这种create工厂方法。


/**
 * create - create an ObjectStore instance.
 *
 * This is invoked once at initialization time.
 *
 * @param type type of store. This is a string from the configuration file.
 * @param data path (or other descriptor) for data
 * @param journal path (or other descriptor) for journal (optional)
 * @param flags which filestores should check if applicable
 */
static ObjectStore *create(CephContext *cct,
                           const string& type,
                           const string& data,
                           const string& journal,
                           osflagbits_t flags = 0);

可以看到，这里需要指定：

需要的ObjectStore的类型
需要的ObjectStore的数据区
ObjectStore的日志区
ObjectStore的标志位

这里的标志位主要是两个：

是否需要journal重放
是否需要mount omap

注意看前面给出的两个SKIP_标志位。

读取fsid

/**
 * probe a block device to learn the uuid of the owning OSD
 *
 * @param cct cct
 * @param path path to device
 * @param fsid [out] osd uuid
 */
static int probe_block_device_fsid(
    CephContext *cct,
    const string& path,
    uuid_d *fsid);

这里是用来读取ObjectStore的fsid，也就是看一下属于哪个cluster。
不同的ObjectStore的实现是不一样的。

int ObjectStore::probe_block_device_fsid(
    CephContext *cct,
    const string& path,
    uuid_d *fsid)
{
    int r;
#if defined(WITH_BLUESTORE)
    // first try bluestore -- it has a crc on its header and will fail
    // reliably.
    r = BlueStore::get_block_device_fsid(cct, path, fsid);
    if (r == 0) {
        lgeneric_dout(cct, 0) << __func__ << " " << path << " is bluestore, "
                              << *fsid << dendl;
        return r;
    }
#endif
    // okay, try FileStore (journal).
    r = FileStore::get_block_device_fsid(cct, path, fsid);
    if (r == 0) {
        lgeneric_dout(cct, 0) << __func__ << " " << path << " is filestore, "
                              << *fsid << dendl;
        return r;
    }
    return -EINVAL;
}

从代码的设计上来，实际上最好是把各种代码，分散到自己的实现里面。而不是在这里通过if else来调用。这里稍微通过FileStore展开一下。

int FileStore::get_block_device_fsid(CephContext* cct, const string& path,
                                     uuid_d *fsid)
{
    // make sure we don't try to use aio or direct_io (and get annoying
    // error messages from failing to do so); performance implications
    // should be irrelevant for this use
    FileJournal j(cct, *fsid, 0, 0, path.c_str(), false, false);
    return j.peek_fsid(*fsid);
}
// This can not be used on an active journal
int FileJournal::peek_fsid(uuid_d& fsid)
{
    assert(fd == -1);
    int r = _open(false, false);
    if (r)
        return r;
    r = read_header(&header);
    if (r < 0)
        goto out;
    fsid = header.fsid;
out:
    close();
    return r;
}

可以看出这段代码实际上就是打开journal，然后读取journal的头部，从头部中拿到journal的header信息，然后再取出其中的fsid。

获取ObjectStore的性能数据

通过注释可以看出，这里主要是获取ObjectStore的commit/apply的latency信息。

/**
 * Fetch Object Store statistics.
 *
 * Currently only latency of write and apply times are measured.
 *
 * This appears to be called with nothing locked.
 */
virtual objectstore_perf_stat_t get_cur_stats() = 0;

那么这里可以展开get_cur_stats看一下FileStore.h是如何处理的。

struct FSPerfTracker {
    PerfCounters::avg_tracker<uint64_t> os_commit_latency_ns;
    PerfCounters::avg_tracker<uint64_t> os_apply_latency_ns;
    objectstore_perf_stat_t get_cur_stats() const {
        objectstore_perf_stat_t ret;
        ret.os_commit_latency_ns = os_commit_latency_ns.current_avg();
        ret.os_apply_latency_ns = os_apply_latency_ns.current_avg();
        return ret;
    }
    void update_from_perfcounters(PerfCounters &logger);
} perf_tracker;
objectstore_perf_stat_t get_cur_stats() override {
    perf_tracker.update_from_perfcounters(*logger);
    return perf_tracker.get_cur_stats();
}

可以看出来，这里更新的变量主要是：

os_commit_latency_ns
os_apply_latency_ns

拿到性能计数器

/**
 * Fetch Object Store performance counters.
 *
 *
 * This appears to be called with nothing locked.
 */
virtual const PerfCounters* get_perf_counters() const = 0;

这里大部分子类，如果存在性能计数器，那么基本上都是一句话：

const PerfCounters* get_perf_counters() const override {
    return logger;
}

Collection

Ceph在通常情况下认为事务之间是没有相关性的，也就是说事务A与事务B可以用任何顺序来进行提交。那么如果是用户对于某些事务有先后顺序的要求呢？

比如一定要事务A，事务B，事务C，这个时候就需要利用Collection把A,B,C排好序并且放在一起。

/**
 * 一个collection里面包含的是一系列有先后顺序的事务
 * 在同一个collection里面的事务队列，在apply的时候，必面根据先后顺序一个一个来。
 * 在不同collection的事务是可以并行提交的。
 *
 * ObjectStore users可以得到collection的指针，通过两种方式
 * - open_collection()
 * - create_new_collection()
 */
struct CollectionImpl : public RefCountedObject {
    const coll_t cid;
    CollectionImpl(const coll_t& c)
        : RefCountedObject(NULL, 0),
          cid(c) {}
    /// wait for any queued transactions to apply
    // block until any previous transactions are visible.  specifically,
    // collection_list and collection_empty need to reflect prior operations.
    // flush函数的作用就是一个一个地apply transactions
    // 必须要等到前面的事务都生效之后，后面的事务才可以推进。
    // collection_list()和collection_empty()这两个函数
    // 需要反映之前的操作。
    virtual void flush() = 0;

问题暂时不去管这个collection_list和collection_empty这两个函数的具体作用。

    /**
     * Async flush_commit
     * 这个是异步flush commit
     * 两种情况：
     * 1. collection当前是空闲的，flush_commit返回true
     *    c不动
     * 2. collection并不空闲，这个方法返回false并且c会被异步调用.
     *    一旦这个collection里面所有的事务都先于flush_commit函数applied/commited了
     *    那么一个0值就会返回
     */
    virtual bool flush_commit(Context *c) = 0;
    const coll_t &get_cid() { return cid; }
};
// 定义Collection的句柄
typedef boost::intrusive_ptr<CollectionImpl> CollectionHandle;

Object的内容与语义

所有ObjectStore里面的objects都是唯一的，无论是ghobject_t和hobject_t。
ObjectStore的操作支持创建，修改，删除，罗列collection中的objects。

但是这里的罗列是根据object key来进行排列的。所有的object name在整个Ceph系统里面
都是唯一的。

每个object都会有三个离散的三个部分：

数据
xattrs
omap_header
omap_entries

关于Omap可以看一下这个链表：http://bean-li.github.io/ceph-omap/

简单地概述一下就是。

FileStore的omap中存放的都是对象的属性信息，以key-value的形式存在，那么对于不同的属性，如何定义对象的键值key呢？
最直接的想法就是（object_id + xattr_key），两者结合一起，形成对象的键值key，但是这种方法有一个问题，object_id可能会很长，尤其是当单个对象存在很多属性的时候，object_id不得不在key值中出现多次，这必然会造成存储空间的浪费。

Ceph的FileStore分成了2步：
第一步: 根据object_id生成一个比较短的seq，然后把这个seq存放到omap_header中。
第二步: 然后seq + xattr_key形成对象的某个属性的键值。

如何生成seq

如果是LevelDB来实现Omap的话，那么就是在LevelDB中存储一个OSD当前全局的key值。

key: SYS_PREFIX + GLOBAL_STATE_KEY
value: state

要申请seq的时候，针对这个seq上锁然后递增。seq是放在state里面的。state的内容就存放到LevelDB中。

struct State {
    __u8 v;
    uint64_t seq;
};

object_id到seq

struct _Header {
    uint64_t seq;
    uint64_t parent;
    uint64_t num_children;
    coll_t c;
    ghobject_t oid; 
    SequencerPosition spos;
};

当生成seq之后，立即生成一个header结构。然后把这个header存放到LevelDB中。

key： HOBJECT_TO_SEQ + ghobject_key(oid)
value: header

Object的data

object的数据部分理念上是等价于一个文件系统里面的文件。对于object的随机和部分读写都要可以进行。对于数据部分的稀疏处理并不是一个强需求。一般而言，单个object不要太大，大的话一般100MB左右。

Object的xattrs

xattrs主要是存放在文件系统的attrs上。而omap一般则是存放在leveldb上。

/*********************************
 * 事务
 *
 * 一个事务包含了一系列修改操作。
 *
 * 一个事务的三个事件会导致回调。任何一个事务都会带如下的
 * 回调函数。
 *
 *    on_applied_sync, on_applied, and on_commit.
 *
 * `on_applied`和`on_applied_sync`这两个回调都是在修改正式生效之后才会被触发。所谓的修改生效就是指修改被后面的操作可见。
 *
 * 唯一理论上的差异`on_applied`和`on_applied_sync`是在于callback发生的操作线程以及锁环境。`on_applied_sync`语意上就是说直接会被执行线程触发，往往是在急着要执行，并且在当前的环境下不能持有锁的调用环境。(去申请锁可能会导致wait也就是等待)
 * 相反地`on_applied`则是另外一个Finisher线程来调用的。这也就是意味着调用环境满足去申请锁的各种条件(这里主要是指申请的时候可以wait)。
 * 需要注意的是：on_applied和on_applied_sync有时候也会被叫做on_readable和on_readable_sync。
 *
 * on_commit回调则肯定是由另外一个Finisher线程来调用的。并且所有的修改操作已经写到journal上。也就是持久化了。
 *
 * 就从journal写日志的实现上来说，每次原始的修改（包含相关的数据）都可以被串行化到一个单一的buffer里面。这个串行化并不会拷贝任何数据本身，而是直接引用到原有的数据。这样一来，就需要原有的数据保持不变，直接on_commit回调函数成功。在实践上，缓冲区处理所有的这种情况，主要是通过bufferlist::raw_static引用到相应的data缓冲区。
 *
 * 一些ObjectStore的实施选实施他们自己形式的journal并且利用串行化来实现一个事务。在这种情况下就需要保证encode/decode逻辑合理地处理好version,并且要处理好升级。
 *
 *
 * TRANSACTION ISOLATION  事务独立性
 *
 * 事务的独立性是由于调用方来实施的。除此之外，独立性是指，
 * object相关的四个部分被一个事务修改/删除的时候，调用方并不会说去读取这个object的相应元素，特别是当这个事务有可能被阻塞的情况下。这里阻塞指的是`one_applied_sync`回调被执行了。
 * 对于这个规则的违反并不会被ObjectStore所监管到。并且也不会有相应的错误被raise出来。
 * 简单地说就是事务与事务之间是相互独立的，事务之间的关系，需要调用者来加以保证。
 * Except as noted above, isolation is the responsibility of the
 * caller. In other words, if any storage element (storage element
 * == any of the four portions of an object as described above) is
 * altered by a transaction (including deletion), the caller
 * promises not to attempt to read that element while the
 * transaction is pending (here pending means from the time of
 * issuance until the "on_applied_sync" callback has been
 * received). Violations of isolation need not be detected by
 * ObjectStore and there is no corresponding error mechanism for
 * reporting an isolation violation (crashing would be the
 * appropriate way to report an isolation violation if detected).
 *
 * Enumeration operations may violate transaction isolation as
 * described above when a storage element is being created or
 * deleted as part of a transaction. In this case, ObjectStore is
 * allowed to consider the enumeration operation to either precede
 * or follow the violating transaction element. In other words, the
 * presence/absence of the mutated element in the enumeration is
 * entirely at the discretion of ObjectStore. The arbitrary ordering
 * applies independently to each transaction element. For example,
 * if a transaction contains two mutating elements "create A" and
 * "delete B". And an enumeration operation is performed while this
 * transaction is pending. It is permissable for ObjectStore to
 * report any of the four possible combinations of the existence of
 * A and B.
 *
 */

事务

class Transaction {
public:
    // 这里有点类似于设计了一套指令。
    enum {
        OP_NOP =          0,
        OP_TOUCH =        9,   // cid, oid
        OP_WRITE =        10,  // cid, oid, offset, len, bl
        OP_ZERO =         11,  // cid, oid, offset, len
        OP_TRUNCATE =     12,  // cid, oid, len
        OP_REMOVE =       13,  // cid, oid
        OP_SETATTR =      14,  // cid, oid, attrname, bl
        OP_SETATTRS =     15,  // cid, oid, attrset
        OP_RMATTR =       16,  // cid, oid, attrname
        OP_CLONE =        17,  // cid, oid, newoid
        OP_CLONERANGE =   18,  // cid, oid, newoid, offset, len
        OP_CLONERANGE2 =  30,  // cid, oid, newoid, srcoff, len, dstoff
        OP_TRIMCACHE =    19,  // cid, oid, offset, len  **DEPRECATED**
        OP_MKCOLL =       20,  // cid
        OP_RMCOLL =       21,  // cid
        OP_COLL_ADD =     22,  // cid, oldcid, oid
        OP_COLL_REMOVE =  23,  // cid, oid
        OP_COLL_SETATTR = 24,  // cid, attrname, bl
        OP_COLL_RMATTR =  25,  // cid, attrname
        OP_COLL_SETATTRS = 26,  // cid, attrset
        OP_COLL_MOVE =    8,   // newcid, oldcid, oid
        OP_RMATTRS =      28,  // cid, oid
        OP_COLL_RENAME =       29,  // cid, newcid
        OP_OMAP_CLEAR = 31,   // cid
        OP_OMAP_SETKEYS = 32, // cid, attrset
        OP_OMAP_RMKEYS = 33,  // cid, keyset
        OP_OMAP_SETHEADER = 34, // cid, header
        OP_SPLIT_COLLECTION = 35, // cid, bits, destination
        OP_SPLIT_COLLECTION2 = 36, /* cid, bits, destination
    doesn't create the destination */
        OP_OMAP_RMKEYRANGE = 37,  // cid, oid, firstkey, lastkey
        OP_COLL_MOVE_RENAME = 38,   // oldcid, oldoid, newcid, newoid
        OP_SETALLOCHINT = 39,  // cid, oid, object_size, write_size
        OP_COLL_HINT = 40, // cid, type, bl
        OP_TRY_RENAME = 41,   // oldcid, oldoid, newoid
        OP_COLL_SET_BITS = 42, // cid, bits
    };
    // Transaction hint type
    enum {
        COLL_HINT_EXPECTED_NUM_OBJECTS = 1,
    };
    // 真正的操作
    struct Op {
        __le32 op;  // 这里用数字来表示操作的类型，也可以看做是指令的类型
        __le32 cid;
        __le32 oid;
        __le64 off;
        __le64 len;
        __le32 dest_cid;
        __le32 dest_oid;                  //OP_CLONE, OP_CLONERANGE
        __le64 dest_off;                  //OP_CLONERANGE
        union {
            struct {
                __le32 hint_type;             //OP_COLL_HINT
            };
            struct {
                __le32 alloc_hint_flags;      //OP_SETALLOCHINT
            };
        };
        __le64 expected_object_size;      //OP_SETALLOCHINT
        __le64 expected_write_size;       //OP_SETALLOCHINT
        __le32 split_bits;                //OP_SPLIT_COLLECTION2,OP_COLL_SET_BITS,
        //OP_MKCOLL
        __le32 split_rem;                 //OP_SPLIT_COLLECTION2
    } __attribute__ ((packed)) ;
    // 
    struct TransactionData {
        __le64 ops;  // 这个应该是指的操作的数量
        __le32 largest_data_len;
        __le32 largest_data_off;
        __le32 largest_data_off_in_data_bl;
        __le32 fadvise_flags;
    } __attribute__ ((packed)) ;
private:
    TransactionData data;
    map<coll_t, __le32> coll_index;
    map<ghobject_t, __le32> object_index;
    __le32 coll_id {0};
    __le32 object_id {0};
    bufferlist data_bl;
    bufferlist op_bl;
    bufferptr op_ptr;
    list<Context *> on_applied;
    list<Context *> on_commit;
    list<Context *> on_applied_sync;
public:
    void _update_op(Op* op,
                    vector<__le32> &cm,
                    vector<__le32> &om) {
        // 根据情况来决定是否需要更新collection id
        // 或者是object id
        // 根据op的类型来决定
        op->cid = cm[op->cid];
        op->oid = om[op->oid];
        op->dest_oid = om[op->dest_oid];
    }
    // bl里面是一个list
    // list里面的每个元素都是一个Op结构
    // 然后再通过_update_op(op_memory, cm, om)
    // 来进行更新
    void _update_op_bl(
        bufferlist& bl,
        vector<__le32> &cm,
        vector<__le32> &om)
    {
        list<bufferptr> list = bl.buffers();
        std::list<bufferptr>::iterator p;
        for(p = list.begin(); p != list.end(); ++p) {
            assert(p->length() % sizeof(Op) == 0);
            char* raw_p = p->c_str();
            char* raw_end = raw_p + p->length();
            while (raw_p < raw_end) {
                _update_op(reinterpret_cast<Op*>(raw_p), cm, om);
                raw_p += sizeof(Op);
            }
        }
    }
    /// Append the operations of the parameter to this Transaction.
    // Those operations are removed from the parameter Transaction
    // 这里更加类似于两个事务的合并，注意：
    // other.op_bl是深度复制了的。
    // ohter.data_bl则是没有深度复制
    // 可能是觉得other还会在别的地方会有用处
    void append(Transaction& other) {
        data.ops += other.data.ops;
        if (other.data.largest_data_len > data.largest_data_len) {
            data.largest_data_len = other.data.largest_data_len;
            data.largest_data_off = other.data.largest_data_off;
            data.largest_data_off_in_data_bl = data_bl.length() + other.data.largest_data_off_in_data_bl;
        }
        data.fadvise_flags |= other.data.fadvise_flags;
        // splice的含义是把另外一个list放到on_applied/on_commit后面
        // splice函数是说
        // splice(Iterator position, list<T> l);
        // 把l插入到postion位置。然后l里面的元素被move过去。所以
        // 操作之后l变成空的了。
        on_applied.splice(on_applied.end(), other.on_applied);
        on_commit.splice(on_commit.end(), other.on_commit);
        on_applied_sync.splice(on_applied_sync.end(), other.on_applied_sync);
        //append coll_index & object_index
        // cm新生成，后面用来更新
        vector<__le32> cm(other.coll_index.size());
        map<coll_t, __le32>::iterator coll_index_p;
        for (coll_index_p = other.coll_index.begin();
             coll_index_p != other.coll_index.end();
             ++coll_index_p) {
            // 这里更新cm这个vector
            cm[coll_index_p->second] = _get_coll_id(coll_index_p->first);
        }
        vector<__le32> om(other.object_index.size());
        map<ghobject_t, __le32>::iterator object_index_p;
        for (object_index_p = other.object_index.begin();
             object_index_p != other.object_index.end();
             ++object_index_p) {
            // 这里更新的是om这个vector
            om[object_index_p->second] = _get_object_id(object_index_p->first);
        }
        // other.op_bl在这里是不能被更改的
        //the other.op_bl SHOULD NOT be changes during append operation,
        // 这里使用了另外一个bufferlist来处理这种case. 
        //we use additional bufferlist to avoid this problem
        // 申请一个新的内存，长度为other.op_bl.length()
        bufferptr other_op_bl_ptr(other.op_bl.length());
        // 这里把other.op_bl里面的内容复制到新申请的内存里
        other.op_bl.copy(0, other.op_bl.length(), other_op_bl_ptr.c_str());
        bufferlist other_op_bl;
        // 注意这里是一个list<bufferptr>, 所以这里用append把前面的内存缓冲区放进去
        other_op_bl.append(other_op_bl_ptr);
        //update other_op_bl with cm & om
        //When the other is appended to current transaction, all coll_index and
        //object_index in other.op_buffer should be updated by new index of the
        //combined transaction
        // 然后利用list<buffer>把当前的transaction更新一把
        _update_op_bl(other_op_bl, cm, om);
        //append op_bl
        // 把other的op_bl list append到op_bl里面
        // 完成两个事务的op的合并
        op_bl.append(other_op_bl);
        //append data_bl
        // data bl也是需要合并
        data_bl.append(other.data_bl);
    }
    /** Inquires about the Transaction as a whole. */
    /// How big is the encoded Transaction buffer?
    // 得到整个事务的长度
    // 感觉这里不应该老是去计算
    // 最好是有办法去优化
    uint64_t get_encoded_bytes() {
        //layout: data_bl + op_bl + coll_index + object_index + data
        // coll_index size, object_index size and sizeof(transaction_data)
        // all here, so they may be computed at compile-time
        size_t final_size = sizeof(__u32) * 2 + sizeof(data);
        // coll_index second and object_index second
        final_size += (coll_index.size() + object_index.size()) * sizeof(__le32);
        // coll_index first
        for (auto p = coll_index.begin(); p != coll_index.end(); ++p) {
            final_size += p->first.encoded_size();
        }
        // object_index first
        for (auto p = object_index.begin(); p != object_index.end(); ++p) {
            final_size += p->first.encoded_size();
        }
        return data_bl.length() +
               op_bl.length() +
               final_size;
    }
    uint64_t get_num_bytes() {
        return get_encoded_bytes();
    }
    /// Size of largest data buffer to the "write" operation encountered so far
    uint32_t get_data_length() {
        return data.largest_data_len;
    }
    /// offset within the encoded buffer to the start of the largest data buffer that's encoded
    uint32_t get_data_offset()
    {
        if (data.largest_data_off_in_data_bl) {
            return data.largest_data_off_in_data_bl +
                   sizeof(__u8) +      // encode struct_v
                   sizeof(__u8) +      // encode compat_v
                   sizeof(__u32) +     // encode len
                   sizeof(__u32);      // data_bl len
        }
        return 0;  // none
    }
    /// offset of buffer as aligned to destination within object.
    int get_data_alignment()
    {
        if (!data.largest_data_len)
            return 0;
        return (0 - get_data_offset()) & ~CEPH_PAGE_MASK;
    }
    /// Is the Transaction empty (no operations)
    bool empty()
    {
        // data里面的ops就是用来计数ops操作的数目
        return !data.ops;
    }
    /// Number of operations in the transation
    int get_num_ops()
    {
        return data.ops;
    }
    /**
     * iterator
     *
     * Helper object to parse Transactions.
     *
     * ObjectStore instances use this object to step down the encoded
     * buffer decoding operation codes and parameters as we go.
     *
     */
    class iterator
    {
        Transaction *t;
        uint64_t ops;
        char* op_buffer_p;
        bufferlist::const_iterator data_bl_p;
    public:
        vector<coll_t> colls;
        vector<ghobject_t> objects;
    private:
        explicit iterator(Transaction *t)
            : t(t),
              data_bl_p(t->data_bl.cbegin()),
              colls(t->coll_index.size()),
              objects(t->object_index.size())
        {
            ops = t->data.ops;
            op_buffer_p = t->op_bl.get_contiguous(0, t->data.ops * sizeof(Op));
            map<coll_t, __le32>::iterator coll_index_p;
            for (coll_index_p = t->coll_index.begin();
                 coll_index_p != t->coll_index.end();
                 ++coll_index_p) {
                colls[coll_index_p->second] = coll_index_p->first;
            }
            map<ghobject_t, __le32>::iterator object_index_p;
            for (object_index_p = t->object_index.begin();
                 object_index_p != t->object_index.end();
                 ++object_index_p) {
                objects[object_index_p->second] = object_index_p->first;
            }
        }
        friend class Transaction;
    public:
        bool have_op()
        {
            return ops > 0;
        }
        Op* decode_op()
        {
            assert(ops > 0);
            Op* op = reinterpret_cast<Op*>(op_buffer_p);
            op_buffer_p += sizeof(Op);
            ops--;
            return op;
        }
        string decode_string()
        {
            using ceph::decode;
            string s;
            decode(s, data_bl_p);
            return s;
        }
        void decode_bp(bufferptr& bp)
        {
            using ceph::decode;
            decode(bp, data_bl_p);
        }
        void decode_bl(bufferlist& bl)
        {
            using ceph::decode;
            decode(bl, data_bl_p);
        }
        void decode_attrset(map<string,bufferptr>& aset)
        {
            using ceph::decode;
            decode(aset, data_bl_p);
        }
        void decode_attrset(map<string,bufferlist>& aset)
        {
            using ceph::decode;
            decode(aset, data_bl_p);
        }
        void decode_attrset_bl(bufferlist *pbl)
        {
            decode_str_str_map_to_bl(data_bl_p, pbl);
        }
        void decode_keyset(set<string> &keys)
        {
            using ceph::decode;
            decode(keys, data_bl_p);
        }
        void decode_keyset_bl(bufferlist *pbl)
        {
            decode_str_set_to_bl(data_bl_p, pbl);
        }
        const ghobject_t &get_oid(__le32 oid_id)
        {
            assert(oid_id < objects.size());
            return objects[oid_id];
        }
        const coll_t &get_cid(__le32 cid_id)
        {
            assert(cid_id < colls.size());
            return colls[cid_id];
        }
        uint32_t get_fadvise_flags() const
        {
            return t->get_fadvise_flags();
        }
    };
    iterator begin()
    {
        return iterator(this);
    }
private:
    void _build_actions_from_tbl();
 
    /**
     * Helper functions to encode the various mutation elements of a
     * transaction.  These are 1:1 with the operation codes (see
     * enumeration above).  These routines ensure that the
     * encoder/creator of a transaction gets the right data in the
     * right place. Sadly, there's no corresponding version nor any
     * form of seat belts for the decoder.
     */
    Op* _get_next_op()
    {
        if (op_ptr.length() == 0 || op_ptr.offset() >= op_ptr.length()) {
            op_ptr = bufferptr(sizeof(Op) * OPS_PER_PTR);
        }
        bufferptr ptr(op_ptr, 0, sizeof(Op));
        op_bl.append(ptr);
        op_ptr.set_offset(op_ptr.offset() + sizeof(Op));
        char* p = ptr.c_str();
        memset(p, 0, sizeof(Op));
        return reinterpret_cast<Op*>(p);
    }
    __le32 _get_coll_id(const coll_t& coll)
    {
        map<coll_t, __le32>::iterator c = coll_index.find(coll);
        if (c != coll_index.end())
            return c->second;
        __le32 index_id = coll_id++;
        coll_index[coll] = index_id;
        return index_id;
    }
    __le32 _get_object_id(const ghobject_t& oid)
    {
        map<ghobject_t, __le32>::iterator o = object_index.find(oid);
        if (o != object_index.end())
            return o->second;
        __le32 index_id = object_id++;
        object_index[oid] = index_id;
        return index_id;
    }
public:
    // 接下来这里生成各种事务的参数，指令
    /// noop. 'nuf said
    void nop()
    {
        Op* _op = _get_next_op();
        _op->op = OP_NOP;
        data.ops++;
    }
    /**
     * touch
     *
     * Ensure the existance of an object in a collection. Create an
     * empty object if necessary
     */
    void touch(const coll_t& cid, const ghobject_t& oid)
    {
        Op* _op = _get_next_op();
        _op->op = OP_TOUCH;
        _op->cid = _get_coll_id(cid);
        _op->oid = _get_object_id(oid);
        data.ops++;
    }
    /**
     * Write data to an offset within an object. If the object is too
     * small, it is expanded as needed.  It is possible to specify an
     * offset beyond the current end of an object and it will be
     * expanded as needed. Simple implementations of ObjectStore will
     * just zero the data between the old end of the object and the
     * newly provided data. More sophisticated implementations of
     * ObjectStore will omit the untouched data and store it as a
     * "hole" in the file.
     *
     * Note that a 0-length write does not affect the size of the object.
     */
    void write(const coll_t& cid, const ghobject_t& oid, uint64_t off, uint64_t len,
               const bufferlist& write_data, uint32_t flags = 0)
    {
        using ceph::encode;
        uint32_t orig_len = data_bl.length();
        Op* _op = _get_next_op();
        _op->op = OP_WRITE;
        _op->cid = _get_coll_id(cid);
        _op->oid = _get_object_id(oid);
        _op->off = off;
        _op->len = len;
        encode(write_data, data_bl);
        assert(len == write_data.length());
        data.fadvise_flags = data.fadvise_flags | flags;
        if (write_data.length() > data.largest_data_len) {
            data.largest_data_len = write_data.length();
            data.largest_data_off = off;
            data.largest_data_off_in_data_bl = orig_len + sizeof(__u32);  // we are about to
        }
        data.ops++;
    }
    /**
     * zero out the indicated byte range within an object. Some
     * ObjectStore instances may optimize this to release the
     * underlying storage space.
     *
     * If the zero range extends beyond the end of the object, the object
     * size is extended, just as if we were writing a buffer full of zeros.
     * EXCEPT if the length is 0, in which case (just like a 0-length write)
     * we do not adjust the object size.
     */
    void zero(const coll_t& cid, const ghobject_t& oid, uint64_t off, uint64_t len)
    {
        Op* _op = _get_next_op();
        _op->op = OP_ZERO;
        _op->cid = _get_coll_id(cid);
        _op->oid = _get_object_id(oid);
        _op->off = off;
        _op->len = len;
        data.ops++;
    }
    /// Discard all data in the object beyond the specified size.
    void truncate(const coll_t& cid, const ghobject_t& oid, uint64_t off)
    {
        Op* _op = _get_next_op();
        _op->op = OP_TRUNCATE;
        _op->cid = _get_coll_id(cid);
        _op->oid = _get_object_id(oid);
        _op->off = off;
        data.ops++;
    }
    /// Remove an object. All four parts of the object are removed.
    void remove(const coll_t& cid, const ghobject_t& oid)
    {
        Op* _op = _get_next_op();
        _op->op = OP_REMOVE;
        _op->cid = _get_coll_id(cid);
        _op->oid = _get_object_id(oid);
        data.ops++;
    }
    /// Set an xattr of an object
    void setattr(const coll_t& cid, const ghobject_t& oid, const char* name, bufferlist& val)
    {
        string n(name);
        setattr(cid, oid, n, val);
    }
    /// Set an xattr of an object
    void setattr(const coll_t& cid, const ghobject_t& oid, const string& s, bufferlist& val)
    {
        using ceph::encode;
        Op* _op = _get_next_op();
        _op->op = OP_SETATTR;
        _op->cid = _get_coll_id(cid);
        _op->oid = _get_object_id(oid);
        encode(s, data_bl);
        encode(val, data_bl);
        data.ops++;
    }
    /// Set multiple xattrs of an object
    void setattrs(const coll_t& cid, const ghobject_t& oid, const map<string,bufferptr>& attrset)
    {
        using ceph::encode;
        Op* _op = _get_next_op();
        _op->op = OP_SETATTRS;
        _op->cid = _get_coll_id(cid);
        _op->oid = _get_object_id(oid);
        encode(attrset, data_bl);
        data.ops++;
    }
    /// Set multiple xattrs of an object
    void setattrs(const coll_t& cid, const ghobject_t& oid, const map<string,bufferlist>& attrset)
    {
        using ceph::encode;
        Op* _op = _get_next_op();
        _op->op = OP_SETATTRS;
        _op->cid = _get_coll_id(cid);
        _op->oid = _get_object_id(oid);
        encode(attrset, data_bl);
        data.ops++;
    }
    /// remove an xattr from an object
    void rmattr(const coll_t& cid, const ghobject_t& oid, const char *name)
    {
        string n(name);
        rmattr(cid, oid, n);
    }
    /// remove an xattr from an object
    void rmattr(const coll_t& cid, const ghobject_t& oid, const string& s)
    {
        using ceph::encode;
        Op* _op = _get_next_op();
        _op->op = OP_RMATTR;
        _op->cid = _get_coll_id(cid);
        _op->oid = _get_object_id(oid);
        encode(s, data_bl);
        data.ops++;
    }
    /// remove all xattrs from an object
    void rmattrs(const coll_t& cid, const ghobject_t& oid)
    {
        Op* _op = _get_next_op();
        _op->op = OP_RMATTRS;
        _op->cid = _get_coll_id(cid);
        _op->oid = _get_object_id(oid);
        data.ops++;
    }
    /**
     * Clone an object into another object.
     *
     * Low-cost (e.g., O(1)) cloning (if supported) is best, but
     * fallback to an O(n) copy is allowed.  All four parts of the
     * object are cloned (data, xattrs, omap header, omap
     * entries).
     *
     * The destination named object may already exist, in
     * which case its previous contents are discarded.
     */
    void clone(const coll_t& cid, const ghobject_t& oid,
               const ghobject_t& noid)
    {
        Op* _op = _get_next_op();
        _op->op = OP_CLONE;
        _op->cid = _get_coll_id(cid);
        _op->oid = _get_object_id(oid);
        _op->dest_oid = _get_object_id(noid);
        data.ops++;
    }
    /**
     * Clone a byte range from one object to another.
     *
     * The data portion of the destination object receives a copy of a
     * portion of the data from the source object. None of the other
     * three parts of an object is copied from the source.
     *
     * The destination object size may be extended to the dstoff + len.
     *
     * The source range *must* overlap with the source object data. If it does
     * not the result is undefined.
     */
    void clone_range(const coll_t& cid, const ghobject_t& oid,
                     const ghobject_t& noid,
                     uint64_t srcoff, uint64_t srclen, uint64_t dstoff)
    {
        Op* _op = _get_next_op();
        _op->op = OP_CLONERANGE2;
        _op->cid = _get_coll_id(cid);
        _op->oid = _get_object_id(oid);
        _op->dest_oid = _get_object_id(noid);
        _op->off = srcoff;
        _op->len = srclen;
        _op->dest_off = dstoff;
        data.ops++;
    }
    /// Create the collection
    void create_collection(const coll_t& cid, int bits)
    {
        Op* _op = _get_next_op();
        _op->op = OP_MKCOLL;
        _op->cid = _get_coll_id(cid);
        _op->split_bits = bits;
        data.ops++;
    }
    /**
     * Give the collection a hint.
     *
     * @param cid  - collection id.
     * @param type - hint type.
     * @param hint - the hint payload, which contains the customized
     *               data along with the hint type.
     */
    void collection_hint(const coll_t& cid, uint32_t type, const bufferlist& hint)
    {
        using ceph::encode;
        Op* _op = _get_next_op();
        _op->op = OP_COLL_HINT;
        _op->cid = _get_coll_id(cid);
        _op->hint_type = type;
        encode(hint, data_bl);
        data.ops++;
    }
    /// remove the collection, the collection must be empty
    void remove_collection(const coll_t& cid)
    {
        Op* _op = _get_next_op();
        _op->op = OP_RMCOLL;
        _op->cid = _get_coll_id(cid);
        data.ops++;
    }
    void collection_move(const coll_t& cid, const coll_t &oldcid, const ghobject_t& oid)
    __attribute__ ((deprecated))
    {
        // NOTE: we encode this as a fixed combo of ADD + REMOVE.  they
        // always appear together, so this is effectively a single MOVE.
        Op* _op = _get_next_op();
        _op->op = OP_COLL_ADD;
        _op->cid = _get_coll_id(oldcid);
        _op->oid = _get_object_id(oid);
        _op->dest_cid = _get_coll_id(cid);
        data.ops++;
        _op = _get_next_op();
        _op->op = OP_COLL_REMOVE;
        _op->cid = _get_coll_id(oldcid);
        _op->oid = _get_object_id(oid);
        data.ops++;
    }
    void collection_move_rename(const coll_t& oldcid, const ghobject_t& oldoid,
                                const coll_t &cid, const ghobject_t& oid)
    {
        Op* _op = _get_next_op();
        _op->op = OP_COLL_MOVE_RENAME;
        _op->cid = _get_coll_id(oldcid);
        _op->oid = _get_object_id(oldoid);
        _op->dest_cid = _get_coll_id(cid);
        _op->dest_oid = _get_object_id(oid);
        data.ops++;
    }
    void try_rename(const coll_t &cid, const ghobject_t& oldoid,
                    const ghobject_t& oid)
    {
        Op* _op = _get_next_op();
        _op->op = OP_TRY_RENAME;
        _op->cid = _get_coll_id(cid);
        _op->oid = _get_object_id(oldoid);
        _op->dest_oid = _get_object_id(oid);
        data.ops++;
    }
    /// Remove omap from oid
    void omap_clear(
        const coll_t &cid,           ///< [in] Collection containing oid
        const ghobject_t &oid  ///< [in] Object from which to remove omap
    )
    {
        Op* _op = _get_next_op();
        _op->op = OP_OMAP_CLEAR;
        _op->cid = _get_coll_id(cid);
        _op->oid = _get_object_id(oid);
        data.ops++;
    }
    /// Set keys on oid omap.  Replaces duplicate keys.
    void omap_setkeys(
        const coll_t& cid,                           ///< [in] Collection containing oid
        const ghobject_t &oid,                ///< [in] Object to update
        const map<string, bufferlist> &attrset ///< [in] Replacement keys and values
    )
    {
        using ceph::encode;
        Op* _op = _get_next_op();
        _op->op = OP_OMAP_SETKEYS;
        _op->cid = _get_coll_id(cid);
        _op->oid = _get_object_id(oid);
        encode(attrset, data_bl);
        data.ops++;
    }
    /// Set keys on an oid omap (bufferlist variant).
    void omap_setkeys(
        const coll_t &cid,                           ///< [in] Collection containing oid
        const ghobject_t &oid,                ///< [in] Object to update
        const bufferlist &attrset_bl          ///< [in] Replacement keys and values
    )
    {
        Op* _op = _get_next_op();
        _op->op = OP_OMAP_SETKEYS;
        _op->cid = _get_coll_id(cid);
        _op->oid = _get_object_id(oid);
        data_bl.append(attrset_bl);
        data.ops++;
    }
    /// Remove keys from oid omap
    void omap_rmkeys(
        const coll_t &cid,             ///< [in] Collection containing oid
        const ghobject_t &oid,  ///< [in] Object from which to remove the omap
        const set<string> &keys ///< [in] Keys to clear
    )
    {
        using ceph::encode;
        Op* _op = _get_next_op();
        _op->op = OP_OMAP_RMKEYS;
        _op->cid = _get_coll_id(cid);
        _op->oid = _get_object_id(oid);
        encode(keys, data_bl);
        data.ops++;
    }
    /// Remove keys from oid omap
    void omap_rmkeys(
        const coll_t &cid,             ///< [in] Collection containing oid
        const ghobject_t &oid,  ///< [in] Object from which to remove the omap
        const bufferlist &keys_bl ///< [in] Keys to clear
    )
    {
        Op* _op = _get_next_op();
        _op->op = OP_OMAP_RMKEYS;
        _op->cid = _get_coll_id(cid);
        _op->oid = _get_object_id(oid);
        data_bl.append(keys_bl);
        data.ops++;
    }
    /// Remove key range from oid omap
    void omap_rmkeyrange(
        const coll_t &cid,             ///< [in] Collection containing oid
        const ghobject_t &oid,  ///< [in] Object from which to remove the omap keys
        const string& first,    ///< [in] first key in range
        const string& last      ///< [in] first key past range, range is [first,last)
    )
    {
        using ceph::encode;
        Op* _op = _get_next_op();
        _op->op = OP_OMAP_RMKEYRANGE;
        _op->cid = _get_coll_id(cid);
        _op->oid = _get_object_id(oid);
        encode(first, data_bl);
        encode(last, data_bl);
        data.ops++;
    }
    /// Set omap header
    void omap_setheader(
        const coll_t &cid,             ///< [in] Collection containing oid
        const ghobject_t &oid,  ///< [in] Object
        const bufferlist &bl    ///< [in] Header value
    )
    {
        using ceph::encode;
        Op* _op = _get_next_op();
        _op->op = OP_OMAP_SETHEADER;
        _op->cid = _get_coll_id(cid);
        _op->oid = _get_object_id(oid);
        encode(bl, data_bl);
        data.ops++;
    }
    /// Split collection based on given prefixes, objects matching the specified bits/rem are
    /// moved to the new collection
    void split_collection(
        const coll_t &cid,
        uint32_t bits,
        uint32_t rem,
        const coll_t &destination)
    {
        Op* _op = _get_next_op();
        _op->op = OP_SPLIT_COLLECTION2;
        _op->cid = _get_coll_id(cid);
        _op->dest_cid = _get_coll_id(destination);
        _op->split_bits = bits;
        _op->split_rem = rem;
        data.ops++;
    }
    void collection_set_bits(
        const coll_t &cid,
        int bits)
    {
        Op* _op = _get_next_op();
        _op->op = OP_COLL_SET_BITS;
        _op->cid = _get_coll_id(cid);
        _op->split_bits = bits;
        data.ops++;
    }
    /// Set allocation hint for an object
    /// make 0 values(expected_object_size, expected_write_size) noops for all implementations
    void set_alloc_hint(
        const coll_t &cid,
        const ghobject_t &oid,
        uint64_t expected_object_size,
        uint64_t expected_write_size,
        uint32_t flags
    )
    {
        Op* _op = _get_next_op();
        _op->op = OP_SETALLOCHINT;
        _op->cid = _get_coll_id(cid);
        _op->oid = _get_object_id(oid);
        _op->expected_object_size = expected_object_size;
        _op->expected_write_size = expected_write_size;
        _op->alloc_hint_flags = flags;
        data.ops++;
    }
};

事务入队

int queue_transaction(CollectionHandle& ch,
                      Transaction&& t,
                      TrackedOpRef op = TrackedOpRef(),
                      ThreadPool::TPHandle *handle = NULL)
{
    vector<Transaction> tls;
    tls.push_back(std::move(t));
    return queue_transactions(ch, tls, op, handle);
}
virtual int queue_transactions(
    CollectionHandle& ch, vector<Transaction>& tls,
    TrackedOpRef op = TrackedOpRef(),
    ThreadPool::TPHandle *handle = NULL) = 0;

public:
    // versioning
    virtual int upgrade() {
        return 0;
    }
    virtual void get_db_statistics(Formatter *f) { }
    virtual void generate_db_histogram(Formatter *f) { }
    virtual void flush_cache() { }
    virtual void dump_perf_counters(Formatter *f) {}
    virtual string get_type() = 0;
    // mgmt
    virtual bool test_mount_in_use() = 0;
    virtual int mount() = 0;
    virtual int umount() = 0;
    virtual int fsck(bool deep)
    {
        return -EOPNOTSUPP;
    }
    virtual int repair(bool deep)
    {
        return -EOPNOTSUPP;
    }
    virtual void set_cache_shards(unsigned num) { }
    /**
     * Returns 0 if the hobject is valid, -error otherwise
     *
     * Errors:
     * -ENAMETOOLONG: locator/namespace/name too large
     */
    virtual int validate_hobject_key(const hobject_t &obj) const = 0;
    virtual unsigned get_max_attr_name_length() = 0;
    virtual int mkfs() = 0;  // wipe
    virtual int mkjournal() = 0; // journal only
    virtual bool needs_journal() = 0;  //< requires a journal
    virtual bool wants_journal() = 0;  //< prefers a journal
    virtual bool allows_journal() = 0; //< allows a journal
    /// enumerate hardware devices (by 'devname', e.g., 'sda' as in /sys/block/sda)
    virtual int get_devices(std::set<string> *devls)
    {
        return -EOPNOTSUPP;
    }
    /// true if a txn is readable immediately after it is queued.
    virtual bool is_sync_onreadable() const
    {
        return true;
    }
    /**
     * is_rotational
     *
     * Check whether store is backed by a rotational (HDD) or non-rotational
     * (SSD) device.
     *
     * This must be usable *before* the store is mounted.
     *
     * @return true for HDD, false for SSD
     */
    virtual bool is_rotational()
    {
        return true;
    }
    /**
     * is_journal_rotational
     *
     * Check whether journal is backed by a rotational (HDD) or non-rotational
     * (SSD) device.
     *
     *
     * @return true for HDD, false for SSD
     */
    virtual bool is_journal_rotational()
    {
        return true;
    }
    virtual string get_default_device_class()
    {
        return is_rotational() ? "hdd" : "ssd";
    }
    virtual bool can_sort_nibblewise()
    {
        return false;   // assume a backend cannot, unless it says otherwise
    }
    virtual int statfs(struct store_statfs_t *buf) = 0;
    virtual void collect_metadata(map<string,string> *pm) { }
    /**
     * write_meta - write a simple configuration key out-of-band
     *
     * Write a simple key/value pair for basic store configuration
     * (e.g., a uuid or magic number) to an unopened/unmounted store.
     * The default implementation writes this to a plaintext file in the
     * path.
     *
     * A newline is appended.
     *
     * @param key key name (e.g., "fsid")
     * @param value value (e.g., a uuid rendered as a string)
     * @returns 0 for success, or an error code
     */
    virtual int write_meta(const std::string& key,
                           const std::string& value);
    /**
     * read_meta - read a simple configuration key out-of-band
     *
     * Read a simple key value to an unopened/mounted store.
     *
     * Trailing whitespace is stripped off.
     *
     * @param key key name
     * @param value pointer to value string
     * @returns 0 for success, or an error code
     */
    virtual int read_meta(const std::string& key,
                          std::string *value);
    /**
     * get ideal max value for collection_list()
     *
     * default to some arbitrary values; the implementation will override.
     */
    virtual int get_ideal_list_max()
    {
        return 64;
    }
    /**
     * get a collection handle
     *
     * Provide a trivial handle as a default to avoid converting legacy
     * implementations.
     */
    virtual CollectionHandle open_collection(const coll_t &cid) = 0;
    /**
     * get a collection handle for a soon-to-be-created collection
     *
     * This handle must be used by queue_transaction that includes a
     * create_collection call in order to become valid.  It will become the
     * reference to the created collection.
     */
    virtual CollectionHandle create_new_collection(const coll_t &cid) = 0;
    /**
     * Synchronous read operations
     */
    /**
     * exists -- Test for existance of object
     *
     * @param cid collection for object
     * @param oid oid of object
     * @returns true if object exists, false otherwise
     */
    virtual bool exists(CollectionHandle& c, const ghobject_t& oid) = 0;
    /**
     * set_collection_opts -- set pool options for a collectioninformation for an object
     *
     * @param cid collection
     * @param opts new collection options
     * @returns 0 on success, negative error code on failure.
     */
    virtual int set_collection_opts(
        CollectionHandle& c,
        const pool_opts_t& opts) = 0;
    /**
     * stat -- get information for an object
     *
     * @param cid collection for object
     * @param oid oid of object
     * @param st output information for the object
     * @param allow_eio if false, assert on -EIO operation failure
     * @returns 0 on success, negative error code on failure.
     */
    virtual int stat(
        CollectionHandle &c,
        const ghobject_t& oid,
        struct stat *st,
        bool allow_eio = false) = 0;
    /**
     * read -- read a byte range of data from an object
     *
     * Note: if reading from an offset past the end of the object, we
     * return 0 (not, say, -EINVAL).
     *
     * @param cid collection for object
     * @param oid oid of object
     * @param offset location offset of first byte to be read
     * @param len number of bytes to be read
     * @param bl output bufferlist
     * @param op_flags is CEPH_OSD_OP_FLAG_*
     * @returns number of bytes read on success, or negative error code on failure.
     */
    virtual int read(
        CollectionHandle &c,
        const ghobject_t& oid,
        uint64_t offset,
        size_t len,
        bufferlist& bl,
        uint32_t op_flags = 0) = 0;
    /**
     * fiemap -- get extent map of data of an object
     *
     * Returns an encoded map of the extents of an object's data portion
     * (map<offset,size>).
     *
     * A non-enlightened implementation is free to return the extent (offset, len)
     * as the sole extent.
     *
     * @param cid collection for object
     * @param oid oid of object
     * @param offset location offset of first byte to be read
     * @param len number of bytes to be read
     * @param bl output bufferlist for extent map information.
     * @returns 0 on success, negative error code on failure.
     */
    virtual int fiemap(CollectionHandle& c, const ghobject_t& oid,
                       uint64_t offset, size_t len, bufferlist& bl) = 0;
    virtual int fiemap(CollectionHandle& c, const ghobject_t& oid,
                       uint64_t offset, size_t len, map<uint64_t, uint64_t>& destmap) = 0;
    /**
     * getattr -- get an xattr of an object
     *
     * @param cid collection for object
     * @param oid oid of object
     * @param name name of attr to read
     * @param value place to put output result.
     * @returns 0 on success, negative error code on failure.
     */
    virtual int getattr(CollectionHandle &c, const ghobject_t& oid,
                        const char *name, bufferptr& value) = 0;
    /**
     * getattr -- get an xattr of an object
     *
     * @param cid collection for object
     * @param oid oid of object
     * @param name name of attr to read
     * @param value place to put output result.
     * @returns 0 on success, negative error code on failure.
     */
    int getattr(
        CollectionHandle &c, const ghobject_t& oid,
        const string& name, bufferlist& value)
    {
        bufferptr bp;
        int r = getattr(c, oid, name.c_str(), bp);
        value.push_back(bp);
        return r;
    }
    /**
     * getattrs -- get all of the xattrs of an object
     *
     * @param cid collection for object
     * @param oid oid of object
     * @param aset place to put output result.
     * @returns 0 on success, negative error code on failure.
     */
    virtual int getattrs(CollectionHandle &c, const ghobject_t& oid,
                         map<string,bufferptr>& aset) = 0;
    /**
     * getattrs -- get all of the xattrs of an object
     *
     * @param cid collection for object
     * @param oid oid of object
     * @param aset place to put output result.
     * @returns 0 on success, negative error code on failure.
     */
    int getattrs(CollectionHandle &c, const ghobject_t& oid,
                 map<string,bufferlist>& aset)
    {
        map<string,bufferptr> bmap;
        int r = getattrs(c, oid, bmap);
        for (map<string,bufferptr>::iterator i = bmap.begin();
             i != bmap.end();
             ++i) {
            aset[i->first].append(i->second);
        }
        return r;
    }
    // collections
    /**
     * list_collections -- get all of the collections known to this ObjectStore
     *
     * @param ls list of the collections in sorted order.
     * @returns 0 on success, negative error code on failure.
     */
    virtual int list_collections(vector<coll_t>& ls) = 0;
    /**
     * does a collection exist?
     *
     * @param c collection
     * @returns true if it exists, false otherwise
     */
    virtual bool collection_exists(const coll_t& c) = 0;
    /**
     * is a collection empty?
     *
     * @param c collection
     * @param empty true if the specified collection is empty, false otherwise
     * @returns 0 on success, negative error code on failure.
     */
    virtual int collection_empty(CollectionHandle& c, bool *empty) = 0;
    /**
     * return the number of significant bits of the coll_t::pgid.
     *
     * This should return what the last create_collection or split_collection
     * set.  A legacy backend may return -EAGAIN if the value is unavailable
     * (because we upgraded from an older version, e.g., FileStore).
     */
    virtual int collection_bits(CollectionHandle& c) = 0;
    /**
     * list contents of a collection that fall in the range [start, end) and no more than a specified many result
     *
     * @param c collection
     * @param start list object that sort >= this value
     * @param end list objects that sort < this value
     * @param max return no more than this many results
     * @param seq return no objects with snap < seq
     * @param ls [out] result
     * @param next [out] next item sorts >= this value
     * @return zero on success, or negative error
     */
    virtual int collection_list(CollectionHandle &c,
                                const ghobject_t& start, const ghobject_t& end,
                                int max,
                                vector<ghobject_t> *ls, ghobject_t *next) = 0;
    /// OMAP
    /// Get omap contents
    virtual int omap_get(
        CollectionHandle &c,     ///< [in] Collection containing oid
        const ghobject_t &oid,   ///< [in] Object containing omap
        bufferlist *header,      ///< [out] omap header
        map<string, bufferlist> *out /// < [out] Key to value map
    ) = 0;
    /// Get omap header
    virtual int omap_get_header(
        CollectionHandle &c,     ///< [in] Collection containing oid
        const ghobject_t &oid,   ///< [in] Object containing omap
        bufferlist *header,      ///< [out] omap header
        bool allow_eio = false ///< [in] don't assert on eio
    ) = 0;
    /// Get keys defined on oid
    virtual int omap_get_keys(
        CollectionHandle &c,   ///< [in] Collection containing oid
        const ghobject_t &oid, ///< [in] Object containing omap
        set<string> *keys      ///< [out] Keys defined on oid
    ) = 0;
    /// Get key values
    virtual int omap_get_values(
        CollectionHandle &c,         ///< [in] Collection containing oid
        const ghobject_t &oid,       ///< [in] Object containing omap
        const set<string> &keys,     ///< [in] Keys to get
        map<string, bufferlist> *out ///< [out] Returned keys and values
    ) = 0;
    /// Filters keys into out which are defined on oid
    virtual int omap_check_keys(
        CollectionHandle &c,     ///< [in] Collection containing oid
        const ghobject_t &oid,   ///< [in] Object containing omap
        const set<string> &keys, ///< [in] Keys to check
        set<string> *out         ///< [out] Subset of keys defined on oid
    ) = 0;
    /**
     * Returns an object map iterator
     *
     * Warning!  The returned iterator is an implicit lock on filestore
     * operations in c.  Do not use filestore methods on c while the returned
     * iterator is live.  (Filling in a transaction is no problem).
     *
     * @return iterator, null on error
     */
    virtual ObjectMap::ObjectMapIterator get_omap_iterator(
        CollectionHandle &c,   ///< [in] collection
        const ghobject_t &oid  ///< [in] object
    ) = 0;
    virtual int flush_journal() {
        return -EOPNOTSUPP;
    }
    virtual int dump_journal(ostream& out) {
        return -EOPNOTSUPP;
    }
    virtual int snapshot(const string& name) {
        return -EOPNOTSUPP;
    }
    /**
     * Set and get internal fsid for this instance. No external data is modified
     */
    virtual void set_fsid(uuid_d u) = 0;
    virtual uuid_d get_fsid() = 0;
    /**
    * Estimates additional disk space used by the specified amount of objects and caused by file allocation granularity and metadata store
    * - num objects - total (including witeouts) object count to measure used space for.
    */
    virtual uint64_t estimate_objects_overhead(uint64_t num_objects) = 0;
    virtual void compact() {}
    virtual bool has_builtin_csum() const
    {
        return false;
    }
};