Hive中的ObjectInspector設計

ObjectInspector是Hive中一個咋一看比較令人困惑的概念，當初讀Hive源代碼時，花了很長時間才理解。當讀懂之后，發(fā)現(xiàn)ObjectInspector作用相當大，它解耦了數(shù)據(jù)使用和數(shù)據(jù)格式，從而提高了代碼的復用程度。簡單的說，ObjectInspector接口使得Hive可以不拘泥于一種特定數(shù)據(jù)格式，使得數(shù)據(jù)流 1）在輸入端和輸出端切換不同的輸入/輸出格式 2）在不同的Operator上使用不同的數(shù)據(jù)格式。

創(chuàng)新互聯(lián)致力于互聯(lián)網(wǎng)品牌建設與網(wǎng)絡營銷，包括成都做網(wǎng)站、網(wǎng)站建設、SEO優(yōu)化、網(wǎng)絡推廣、整站優(yōu)化營銷策劃推廣、電子商務、移動互聯(lián)網(wǎng)營銷等。創(chuàng)新互聯(lián)為不同類型的客戶提供良好的互聯(lián)網(wǎng)應用定制及解決方案，創(chuàng)新互聯(lián)核心團隊十年專注互聯(lián)網(wǎng)開發(fā)，積累了豐富的網(wǎng)站經(jīng)驗，為廣大企業(yè)客戶提供一站式企業(yè)網(wǎng)站建設服務，在網(wǎng)站建設行業(yè)內(nèi)樹立了良好口碑。

這是ObjectInspector interface
public interface ObjectInspector extends Cloneable {
public static enum Category {
    PRIMITIVE, LIST, MAP, STRUCT, UNION
};

String getTypeName();

Category getCategory();
}

這個interface提供了最一般的方法 getTypeName 和 getCategory。我們再來看它的子抽象類和interface：
StructObjectInspector
MapObjectInspector
ListObjectInspector
PrimitiveObjectInspector
UnionObjectInspector

其中，PrimitiveObjectInspector用來完成對基本數(shù)據(jù)類型的解析，而StructObjectInspector用了完成對一行數(shù)據(jù)的解析，它本身有一組ObjectInspector組成。由于Hive支持Nested Data Structure，所以，在StructObjectInspector中又可以（一層或多層的）嵌套任意的ObjectInspector。 Struct, Map, List, Union是Hive支持的4種集合數(shù)據(jù)類型，比如某一列的數(shù)據(jù)可以被聲明為Struct類型，這樣解析這一列的StructObjectInspector中就會嵌套了另一個StructObjectInspector。

現(xiàn)在我們可以從一個小例子看看ObjectInspector是如何工作的，這是一個Hive SerDe的測試用例代碼：

/**
   * Test the LazySimpleSerDe class.
   */
public void testLazySimpleSerDe() throws Throwable {
    try {
      // Create the SerDe
      LazySimpleSerDe serDe = new LazySimpleSerDe();
      Configuration conf = new Configuration();
      Properties tbl = createProperties();
      //用Properties初始化serDe
      serDe.initialize(conf, tbl);

      // Data
      Text t = new Text("123\t456\t789\t1000\t5.3\thive and hadoop\t1.\tNULL");
      String s = "123\t456\t789\t1000\t5.3\thive and hadoop\tNULL\tNULL";
      Object[] expectedFieldsData = {new ByteWritable((byte) 123),
          new ShortWritable((short) 456), new IntWritable(789),
          new LongWritable(1000), new DoubleWritable(5.3),
          new Text("hive and hadoop"), null, null};

      // Test
      deserializeAndSerialize(serDe, t, s, expectedFieldsData);
    } catch (Throwable e) {
      e.printStackTrace();
      throw e;
    }
}

   private void deserializeAndSerialize(LazySimpleSerDe serDe, Text t, String s,
      Object[] expectedFieldsData) throws SerDeException {
    // Get the row ObjectInspector
    StructObjectInspector oi = (StructObjectInspector) serDe
        .getObjectInspector();
    // 獲取列信息
    List<? extends StructField> fieldRefs = oi.getAllStructFieldRefs();
    assertEquals(8, fieldRefs.size());

    // Deserialize
    Object row = serDe.deserialize(t);
    for (int i = 0; i < fieldRefs.size(); i++) {
      Object fieldData = oi.getStructFieldData(row, fieldRefs.get(i));
      if (fieldData != null) {
        fieldData = ((LazyPrimitive) fieldData).getWritableObject();
      }
      assertEquals("Field " + i, expectedFieldsData[i], fieldData);
    }
    // Serialize
    assertEquals(Text.class, serDe.getSerializedClass());
    Text serializedText = (Text) serDe.serialize(row, oi);
    assertEquals("Serialized data", s, serializedText.toString());
}

//創(chuàng)建schema，保存在Properties中
private Properties createProperties() {
    Properties tbl = new Properties();

    // Set the configuration parameters
    tbl.setProperty(Constants.SERIALIZATION_FORMAT, "9");
    tbl.setProperty("columns",
        "abyte,ashort,aint,along,adouble,astring,anullint,anullstring");
    tbl.setProperty("columns.types",
        "tinyint:smallint:int:bigint:double:string:int:string");
    tbl.setProperty(Constants.SERIALIZATION_NULL_FORMAT, "NULL");
    return tbl;
}

從這個例子中，不難出，Hive將對行中列的讀取和行的存儲方式解耦和了，只有ObjectInspector清楚行和行中的列是怎樣存取的，但使用者并不知道存儲的細節(jié)。對于數(shù)據(jù)的使用者來說，只需要行的Object和相應的ObjectInspector，就能讀取出每一列的對象。

這段代碼再清晰不過了，ObjectInspector oi控制了對列的Access
for (int i = 0; i < fieldRefs.size(); i++) {
      Object fieldData = oi.getStructFieldData(row, fieldRefs.get(i));
      if (fieldData != null) {
        fieldData = ((LazyPrimitive) fieldData).getWritableObject();
      }
      assertEquals("Field " + i, expectedFieldsData[i], fieldData);
}

這段代碼的作用是把一行deserialize，然后再serialize
    Object row = serDe.deserialize(t);
    Text serializedText = (Text) serDe.serialize(row, oi);
由此不難看出，只要有了不同的SerDe對象，可以很容易的將一條數(shù)據(jù)deserialize，然后再serialize成不同的格式，從而非常方便的實現(xiàn)數(shù)據(jù)格式的切換。

理解了上面的例子，就不難理解為什么所有的Hive ExprNodeEvaluator 和 UDF，UDAF, UDTF 都需要 (Object, ObjectInspector) pair了。數(shù)據(jù)存儲細節(jié)和使用的分離，使得Hive不需要針對不同的數(shù)據(jù)格式對同一個UDF, UDAF 或UDTF實現(xiàn)不同的版本，這些函數(shù)看到的只是WritableObject！

下面是表達式evaluator的interface：
/**
* ExprNodeEvaluator.
*
*/
public abstract class ExprNodeEvaluator {

/**
   * Initialize should be called once and only once. Return the ObjectInspector
   * for the return value, given the rowInspector.
   */
public abstract ObjectInspector initialize(ObjectInspector rowInspector) throws HiveException;

/**
   * Evaluate the expression given the row. This method should use the
   * rowInspector passed in from initialize to inspect the row object. The
   * return value will be inspected by the return value of initialize.
   */
public abstract Object evaluate(Object row) throws HiveException;

}

initialize中需要初始化ObjectInspector，返回輸出數(shù)據(jù)的ObjectInspector（它負責解析evaluate method返回的對象）；而每次evaluate call傳進來一條Object數(shù)據(jù)，它的解析由ObjectInspector負責。

接下來是GenericUDF抽象類：
public abstract class GenericUDF {

/**
   * A Defered Object allows us to do lazy-evaluation and short-circuiting.
   * GenericUDF use DeferedObject to pass arguments.
   */
public static interface DeferredObject {
    Object get() throws HiveException;
};

/**
   * The constructor.
   */
public GenericUDF() {
}

/**
   * Initialize this GenericUDF. This will be called once and only once per
   * GenericUDF instance.
   *
   * @param arguments
   *          The ObjectInspector for the arguments
   * @throws UDFArgumentException
   *           Thrown when arguments have wrong types, wrong length, etc.
   * @return The ObjectInspector for the return value
   */
public abstract ObjectInspector initialize(ObjectInspector[] arguments)
      throws UDFArgumentException;

/**
   * Evaluate the GenericUDF with the arguments.
   *
   * @param arguments
   *          The arguments as DeferedObject, use DeferedObject.get() to get the
   *          actual argument Object. The Objects can be inspected by the
   *          ObjectInspectors passed in the initialize call.
   * @return The
   */
public abstract Object evaluate(DeferredObject[] arguments)
      throws HiveException;

/**
   * Get the String to be displayed in explain.
   */
public abstract String getDisplayString(String[] children);

}

它的機制與evaluator非常類似，初始化中敲定ObjectInspector數(shù)組，它們負責解析輸入，返回output數(shù)據(jù)(即evaluator method返回的Object)的ObjectInspector；每次evaluate call傳進一個Object數(shù)組，返回一條數(shù)據(jù)。

Hive支持LazySimple, LazyBinary，Thrift等不同的數(shù)據(jù)格式，同一個查詢計劃中，可以在operator上切換數(shù)據(jù)流的格式。比較常見的是在Mapper端使用LazySimpleSerDe，Mapper輸出的數(shù)據(jù)使用LazyBinarySerDe，因為binary格式比較節(jié)省空間，從而減少repartition時的網(wǎng)絡傳輸。如果你想看查詢計劃的每一步到底使用了哪一種SerDe格式，只要用"Explain Extended"就可以查清楚了。

當前標題：Hive中的ObjectInspector設計
網(wǎng)址分享：http://www.muchs.cn/article12/pieidc.html

成都網(wǎng)站建設公司_創(chuàng)新互聯(lián)，為您提供App開發(fā)、域名注冊、微信公眾號、企業(yè)建站、微信小程序、云服務器

聲明：本網(wǎng)站發(fā)布的內(nèi)容（圖片、視頻和文字）以用戶投稿、用戶轉(zhuǎn)載內(nèi)容為主，如果涉及侵權請盡快告知，我們將會在第一時間刪除。文章觀點不代表本網(wǎng)站立場，如需處理請聯(lián)系客服。電話：028-86922220；郵箱：631063699@qq.com。內(nèi)容未經(jīng)允許不得轉(zhuǎn)載，或轉(zhuǎn)載時需注明來源：創(chuàng)新互聯(lián)

猜你還喜歡下面的內(nèi)容