Hive Tutorial 閱讀記錄

Hive Tutorial

1、Concepts

1.1、What Is Hive

Hive is a data warehousing infrastructure based on Apache Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing on commodity hardware.

Hive是一個基於 Apache Hadoop 的數據倉庫基礎設施。Hadoop 為在商業硬體上存儲和處理數據提供了大規模的向外擴展和容錯能力。

Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides SQL which enables users to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive’s SQL gives users multiple places to integrate their own functionality to do custom analysis, such as User Defined Functions (UDFs).

Hive 可以方便地對大量數據進行匯總、ad-hoc 查詢和分析。

它提供了 SQL,使用戶能夠輕鬆地進行 ad-hoc 查詢、匯總和數據分析。同時,Hive 的 SQL 給用戶提供了多個地方來集成他們自己的功能來做自定義分析,比如用戶定義函數(UDFs)。

1.2、What Hive Is NOT

Hive is not designed for online transaction processing. It is best used for traditional data warehousing tasks.

Hive 不是為在線事務處理而設計的。它最適合用於傳統的數據倉庫任務。

1.3、Getting Started

For details on setting up Hive, HiveServer2, and Beeline, please refer to the GettingStarted guide.

關於 Hive、HiveServer2 和 Beeline 的詳細設置,請參考 GettingStarted guide。

Books about Hive lists some books that may also be helpful for getting started with Hive.

關於 Hive 的書中列出了一些可能對開始使用 Hive 有幫助的書。

In the following sections we provide a tutorial on the capabilities of the system. We start by describing the concepts of data types, tables, and partitions (which are very similar to what you would find in a traditional relational DBMS) and then illustrate the capabilities of Hive with the help of some examples.

在下面的章節中,我們將提供關於該系統功能的教程。我們首先描述數據類型、表和分區的概念(與傳統關係 DBMS 非常相似),然後通過一些例子說明 Hive 的功能。

1.4、Data Units

In the order of granularity – Hive data is organized into:

hive 數據有如下組織形式:

Databases: Namespaces function to avoid naming conflicts for tables, views, partitions, columns, and so on. Databases can also be used to enforce security for a user or group of users.

  • 資料庫:名稱空間函數,以避免表、視圖、分區、列等的命名衝突。資料庫還可以用於為一個用戶或一組用戶實施安全性。

Tables: Homogeneous units of data which have the same schema. An example of a table could be page_views table, where each row could comprise of the following columns (schema):

timestamp—which is of INT type that corresponds to a UNIX timestamp of when the page was viewed.
userid —which is of BIGINT type that identifies the user who viewed the page.
page_url—which is of STRING type that captures the location of the page.
referer_url—which is of STRING that captures the location of the page from where the user arrived at the current page.
IP—which is of STRING type that captures the IP address from where the page request was made.

  • 表:具有相同schema的同種數據單元。一個表的示例就是 page_views 表,表中每行都由下面的列組成:

    • timestamp:INT類型,頁面瀏覽時間
    • userid:BIGINT類型,
    • page_url:STRING類型,
    • referer_url:STRING類型,
    • IP:STRING類型,

Partitions: Each Table can have one or more partition Keys which determines how the data is stored. Partitions—apart from being storage units—also allow the user to efficiently identify the rows that satisfy a specified criteria; for example, a date_partition of type STRING and country_partition of type STRING. Each unique value of the partition keys defines a partition of the Table. For example, all “US” data from “2009-12-23” is a partition of the page_views table. Therefore, if you run analysis on only the “US” data for 2009-12-23, you can run that query only on the relevant partition of the table, thereby speeding up the analysis significantly. Note however, that just because a partition is named 2009-12-23 does not mean that it contains all or only data from that date; partitions are named after dates for convenience; it is the user’s job to guarantee the relationship between partition name and data content! Partition columns are virtual columns, they are not part of the data itself but are derived on load.

  • 分區:

每個表有一個或多個分區 key ,這些 key 決定了數據如何被存儲。

除了作為存儲單元之外,分區還允許用戶有效地標識滿足指定條件的行。例如,STRING 類型的 date_partition 和 STRING 類型的 country_partition。

每個唯一的分區 key 對應表的一個分區。例如,從”2009-12-23″開始的所有”US”下的數據都是 page_views 表的一個分區中的數據。

因此,如果基於”US”下2009-12-23的數據分析,你可以只在相關分區下執行查詢。從而,可以提高分析效率

然而,僅僅因為一個分區命名為2009-12-23,並不意味著它包含或僅包含該日期的所有數據,分區以日期命名是為了方便。

保證分區名和數據內容之間的關係是用戶的工作!分區列是虛擬列,它們不是數據本身的一部分,而是在載入時派生的。

Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table. For example the page_views table may be bucketed by userid, which is one of the columns, other than the partitions columns, of the page_view table. These can be used to efficiently sample the data.

  • 分桶:

通過計算表的某些列的 hash 值,分區中的數據再被劃分到桶中。這可以被用來高效地抽樣數據。

例如,page_views 表根據 userid 分桶,userid 是 page_view 表的列之一,而不是分區列

Note that it is not necessary for tables to be partitioned or bucketed, but these abstractions allow the system to prune large quantities of data during query processing, resulting in faster query execution.

分區或分桶並不是必要的。但是這些抽象允許系統在查詢處理期間刪除大量數據,從而加快查詢的執行。

1.5、Type System

Hive supports primitive and complex data types, as described below. See Hive Data Types for additional information.

Hive 支援基本和複雜數據類型,如下所述。有關更多資訊,請參閱 Hive Data Types。

1.5.1、Primitive Types

  • Types are associated with the columns in the tables. The following Primitive types are supported:【類型和表中的列相關,下面是支援的基本數據類型:】

  • Integers

    • TINYINT—1 byte integer
    • SMALLINT—2 byte integer
    • INT—4 byte integer
    • BIGINT—8 byte integer
  • Boolean type

    • BOOLEAN—TRUE/FALSE
  • Floating point numbers

    • FLOAT—single precision
    • DOUBLE—Double precision
  • Fixed point numbers

    • DECIMAL—a fixed point value of user defined scale and precision
  • String types

    • STRING—sequence of characters in a specified character set
    • VARCHAR—sequence of characters in a specified character set with a maximum length
    • CHAR—sequence of characters in a specified character set with a defined length
  • Date and time types

    • TIMESTAMP — A date and time without a timezone (“LocalDateTime” semantics)
    • TIMESTAMP WITH LOCAL TIME ZONE — A point in time measured down to nanoseconds (“Instant” semantics)
    • DATE—a date
  • Binary types

    • BINARY—a sequence of bytes

The Types are organized in the following hierarchy (where the parent is a super type of all the children instances):

這些類型按以下層次結構組織(父實例是所有子實例的超類型):

  • Type
    • Primitive Type
      • Number
        • DOUBLE

          • FLOAT

            • BIGINT
              • INT
                • SMALLINT
                  • TINYINT
          • STRING

        • BOOLEAN

This type hierarchy defines how the types are implicitly converted in the query language. Implicit conversion is allowed for types from child to an ancestor. So when a query expression expects type1 and the data is of type2, type2 is implicitly converted to type1 if type1 is an ancestor of type2 in the type hierarchy. Note that the type hierarchy allows the implicit conversion of STRING to DOUBLE.

這種類型層次結構定義了如何在查詢語言中隱式地轉換類型。

允許從子類型到祖先類型的隱式轉換。

因此,當查詢表達式期望類型1且數據為類型2時,如果在類型層次結構中,類型1是類型2的祖先,則類型2將隱式轉換為類型1。

請注意,類型層次結構允許隱式地將 STRING 轉換為 DOUBLE。

Explicit type conversion can be done using the cast operator as shown in the #Built In Functions section below.

顯式類型轉換可以使用強制轉換操作符完成,如下面的 #Built in Functions 一節所示。

1.5.2、Complex Types

Complex Types can be built up from primitive types and other composite types using:

複雜類型可以從基本類型和其他組合類型構建:

Structs: the elements within the type can be accessed using the DOT (.) notation. For example, for a column c of type STRUCT {a INT; b INT}, the a field is accessed by the expression c.a

  • Structs:類型中的元素可以使用點號訪問。例如,c 列的類型是 STRUCT {a INT; b INT},通過 c.a 訪問欄位。

Maps (key-value tuples): The elements are accessed using [‘element name’] notation. For example in a map M comprising of a mapping from ‘group’ -> gid the gid value can be accessed using M[‘group’]

  • Maps(鍵值元組):使用 ['element name'] 訪問元素。例如,映射 M 由 'group' -> gid 組成,可以通過 M['group'] 訪問 gid 值。

Arrays (indexable lists): The elements in the array have to be in the same type. Elements can be accessed using the [n] notation where n is an index (zero-based) into the array. For example, for an array A having the elements [‘a’, ‘b’, ‘c’], A[1] retruns ‘b’.

  • Arrays(可索引的列表):數組中的元素必須具有相同的類型。可以使用 [n] 來訪問元素,n 是索引(從0開始)。例如,數組 A 有元素 ['a', 'b', 'c'],那麼 A[1] 將返回 ‘b’。

Using the primitive types and the constructs for creating complex types, types with arbitrary levels of nesting can be created. For example, a type User may comprise of the following fields:

使用基本類型和用於創建複雜類型的構造,可以創建具有任意嵌套級別的類型。例如,一個類型用戶可能包含以下欄位:

  • gender—which is a STRING.
  • active—which is a BOOLEAN.

1.5.3、Timestamp

Timestamps have been the source of much confusion, so we try to document the intended semantics of Hive.

Timestamps 一直是很多困惑的根源,所以我們試圖記錄 Hive 的語義。

Timestamp (“LocalDateTime” semantics)

Java’s “LocalDateTime” timestamps record a date and time as year, month, date, hour, minute, and seconds without a timezone. These timestamps always have those same values regardless of the local time zone.

Java 的 「LocalDateTime」 時間戳將日期和時間記錄為年、月、日、時、分和秒,而沒有時區。

無論本地時區是什麼,這些時間戳總是具有相同的值。

For example, the timestamp value of “2014-12-12 12:34:56” is decomposed into year, month, day, hour, minute and seconds fields, but with no time zone information available. It does not correspond to any specific instant. It will always be the same value regardless of the local time zone. Unless your application uses UTC consistently, timestamp with local time zone is strongly preferred over timestamp for most applications. When users say an event is at 10:00, it is always in reference to a certain timezone and means a point in time, rather than 10:00 in an arbitrary time zone.

例如,「2014-12-12 12:34:56」 的時間戳值被分解為年、月、日、小時、分鐘和秒欄位,但是沒有時區資訊可用。

它不對應於任何特定的時刻。它將始終是相同的值,無論當地時區是什麼。除非你的應用程式一致使用 UTC,否則具有當地時區的時間戳對於大多數應用程式來說都比時間戳更受歡迎。當用戶說一個事件在 10:00 時,它總是與某個時區有關,意思是一個時間點,而不是任意時區中的 10:00。

Timestamp with local time zone (“Instant” semantics)

Java’s “Instant” timestamps define a point in time that remains constant regardless of where the data is read. Thus, the timestamp will be adjusted by the local time zone to match the original point in time.

Java 的 “Instant” 時間戳定義了一個無論從何處讀取數據都保持不變的時間點。因此,時間戳將根據當地時區調整,以匹配原始的時間點。

Type Value in America/Los_Angeles Value in America/New York
timestamp 2014-12-12 12:34:56 2014-12-12 12:34:56
timestamp with local time zone 2014-12-12 12:34:56 2014-12-12 15:34:56

Comparisons with other tools

見原文://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-BuiltInOperatorsandFunctions

1.6、Built In Operators and Functions

The operators and functions listed below are not necessarily up to date. (Hive Operators and UDFs has more current information.) In Beeline or the Hive CLI, use these commands to show the latest documentation:

下面列出的操作符和函數不一定是最新的。在 Beeline 或 Hive 命令行中,使用這些命令顯示最新的文檔:

SHOW FUNCTIONS;
DESCRIBE FUNCTION <function_name>;
DESCRIBE FUNCTION EXTENDED <function_name>;

Case-insensitive. All Hive keywords are case-insensitive, including the names of Hive operators and functions.

不區分大小寫。所有 Hive 關鍵字不區分大小寫,包括 Hive 操作符和函數的名稱。

1.6.1、Built In Operators

Relational Operators:The following operators compare the passed operands and generate a TRUE or FALSE value, depending on whether the comparison between the operands holds or not.

  • 關係操作符:下面的操作符比較傳入的操作數,生成一個 TRUE 或 FALSE 值,這取決於操作數之間的比較是否有效。

操作符表格見原文://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-BuiltInOperatorsandFunctions

Arithmetic Operators:The following operators support various common arithmetic operations on the operands. All of them return number types.

  • 算術操作符:以下操作符支援對操作數進行各種常見的算術操作。它們都返回數字類型。

操作符表格見原文://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-BuiltInOperatorsandFunctions

Logical Operators:The following operators provide support for creating logical expressions. All of them return boolean TRUE or FALSE depending upon the boolean values of the operands.

  • 邏輯操作符:以下操作符提供了對創建邏輯表達式的支援。它們都返回布爾值 TRUE 或 FALSE,這取決於操作數的布爾值。

操作符表格見原文://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-BuiltInOperatorsandFunctions

Operators on Complex Types:The following operators provide mechanisms to access elements in Complex Types

  • 在複雜類型上的操作符:下面的操作符提供了訪問複雜類型中元素的機制。

操作符表格見原文://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-BuiltInOperatorsandFunctions

1.6.2、Built In Functions

Hive supports the following built in functions:(Function list in source code: FunctionRegistry.java)

  • Hive 支援下面的內建函數:

函數表格見原文://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-BuiltInOperatorsandFunctions

The following built in aggregate functions are supported in Hive:

  • Hive 中支援下面內建的聚合函數:

函數表格見原文://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-BuiltInOperatorsandFunctions

1.7、Language Capabilities

Hive’s SQL provides the basic SQL operations. These operations work on tables or partitions. These operations are:

Hive 的 SQL 提供了基本的 SQL 操作。這些操作在表或分區上工作。這些操作是:

  • 能夠使用 WHERE 子句從表中過濾行。

  • 能夠使用 SELECT 子句從表中選擇某些列。

  • 能夠在兩個表之間進行 equi-joins。

  • 能夠對存儲在一個表中的數據的多個 「group by」 列上的聚合進行評估。

  • 能夠將查詢結果存儲到另一個表中。

  • 能夠將表的內容下載到本地目錄(例如,nfs)。

  • 能夠將查詢結果存儲在 hadoop dfs 目錄中。

  • 能夠管理表和分區(創建、刪除和修改)。

  • 能夠插入自定義腳本的語言選擇自定義映射/減少作業。

Ability to filter rows from a table using a WHERE clause.
Ability to select certain columns from the table using a SELECT clause.
Ability to do equi-joins between two tables.
Ability to evaluate aggregations on multiple “group by” columns for the data stored in a table.
Ability to store the results of a query into another table.
Ability to download the contents of a table to a local (for example,, nfs) directory.
Ability to store the results of a query in a hadoop dfs directory.
Ability to manage tables and partitions (create, drop and alter).
Ability to plug in custom scripts in the language of choice for custom map/reduce jobs.

2、Usage and Examples

NOTE: Many of the following examples are out of date. More up to date information can be found in the LanguageManual.

注意:下面的許多例子是過時的。可以在 LanguageManual 查找到更多資訊。

The following examples highlight some salient features of the system. A detailed set of query test cases can be found at Hive Query Test Cases and the corresponding results can be found at Query Test Case Results.

下面的例子突出了該系統的一些顯著特徵。

詳細的查詢測試用例集可以在 Hive Query Test Cases 中找到,相應的結果可以在 Query Test Case Results 中找到。

  • Creating, Showing, Altering, and Dropping Tables
  • Loading Data
  • Querying and Inserting Data

2.1、Creating, Showing, Altering, and Dropping Tables

See Hive Data Definition Language for detailed information about creating, showing, altering, and dropping tables.

見 Hive Data Definition Language 查看更多關於創建、展示、修改和刪除表的資訊。

2.1.1、Creating Tables

An example statement that would create the page_view table mentioned above would be like:

創建上面提到的 page_view 表的示例語句如下:

CREATE TABLE page_view(viewTime INT, userid BIGINT,
                page_url STRING, referrer_url STRING,
                ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
STORED AS SEQUENCEFILE;

In this example, the columns of the table are specified with the corresponding types. Comments can be attached both at the column level as well as at the table level. Additionally, the partitioned by clause defines the partitioning columns which are different from the data columns and are actually not stored with the data. When specified in this way, the data in the files is assumed to be delimited with ASCII 001(ctrl-A) as the field delimiter and newline as the row delimiter.

在本例中,表的列使用相應的類型指定。

可以在列級別和表級別添加註釋。

此外,partitioned by 子句定義了分區列,分區列與數據列不同,實際上並不與數據一起存儲。

當以這種方式指定時,假設文件中的數據以 ASCII 001(ctrl-A)作為欄位分隔符,以換行分隔符作為行分隔符。

The field delimiter can be parametrized if the data is not in the above format as illustrated in the following example:

如果數據不是上述格式,則可以參數化欄位分隔符:

CREATE TABLE page_view(viewTime INT, userid BIGINT,
                page_url STRING, referrer_url STRING,
                ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
ROW FORMAT DELIMITED
        FIELDS TERMINATED BY '1'
STORED AS SEQUENCEFILE;

The row delimintor currently cannot be changed since it is not determined by Hive but Hadoop delimiters.

行分隔符目前不能更改,因為它不是由 Hive 決定的,而是 Hadoop 決定的。

It is also a good idea to bucket the tables on certain columns so that efficient sampling queries can be executed against the data set. If bucketing is absent, random sampling can still be done on the table but it is not efficient as the query has to scan all the data. The following example illustrates the case of the page_view table that is bucketed on the userid column:

在某些列上對錶進行分桶是個好主意,以便對數據集執行有效的抽樣查詢。

如果不使用分桶,對錶的表隨機抽樣仍然可以做,但不是有效的查詢,必須掃描所有的數據。

下面的例子說明了 page_view 表在 userid 列上分桶的情況:

CREATE TABLE page_view(viewTime INT, userid BIGINT,
                page_url STRING, referrer_url STRING,
                ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED
        FIELDS TERMINATED BY '1'
        COLLECTION ITEMS TERMINATED BY '2'
        MAP KEYS TERMINATED BY '3'
STORED AS SEQUENCEFILE;

In the example above, the table is clustered by a hash function of userid into 32 buckets. Within each bucket the data is sorted in increasing order of viewTime. Such an organization allows the user to do efficient sampling on the clustered column—n this case userid. The sorting property allows internal operators to take advantage of the better-known data structure while evaluating queries with greater efficiency.

在上面的例子中,表被 userid 的哈希函數聚集到 32 個桶中。在每個桶中,數據按 viewTime 的遞增順序排序。

這樣的組織允許用戶在聚集列上進行有效的抽樣,在本例中為 userid。

排序屬性允許內部操作符利用已知的數據結構,同時以更高的效率計算查詢。

CREATE TABLE page_view(viewTime INT, userid BIGINT,
                page_url STRING, referrer_url STRING,
                friends ARRAY<BIGINT>, properties MAP<STRING, STRING>
                ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED
        FIELDS TERMINATED BY '1'
        COLLECTION ITEMS TERMINATED BY '2'
        MAP KEYS TERMINATED BY '3'
STORED AS SEQUENCEFILE;

In this example, the columns that comprise of the table row are specified in a similar way as the definition of types. Comments can be attached both at the column level as well as at the table level. Additionally, the partitioned by clause defines the partitioning columns which are different from the data columns and are actually not stored with the data. The CLUSTERED BY clause specifies which column to use for bucketing as well as how many buckets to create. The delimited row format specifies how the rows are stored in the hive table. In the case of the delimited format, this specifies how the fields are terminated, how the items within collections (arrays or maps) are terminated, and how the map keys are terminated. STORED AS SEQUENCEFILE indicates that this data is stored in a binary format (using hadoop SequenceFiles) on hdfs. The values shown for the ROW FORMAT and STORED AS clauses in the above, example represent the system defaults.

在本例中,由錶行組成的列的指定方式與類型的定義類似。

可以在列級和表級附加註釋。

此外,partitioned by 子句定義了分區列,分區列與數據列不同,實際上並不與數據一起存儲。

CLUSTERED BY 子句指定使用哪個列進行分桶以及創建多少個桶。

分隔的行格式指定了行在 hive 表中的存儲方式。

對於分隔格式,它指定如何終止欄位,集合(數組或映射)中的項如何終止,以及如何終止映射鍵。

STORED AS SEQUENCEFILE 表示該數據以二進位格式(使用hadoop SequenceFiles)存儲在 hdfs 上。

上面示例中顯示的 ROW FORMAT 和 STORED AS 子句的值表示系統默認值。

Table names and column names are case insensitive.

表名和列名不區分大小寫。

2.1.2、Browsing Tables and Partitions

SHOW TABLES;

To list existing tables in the warehouse; there are many of these, likely more than you want to browse.

列出倉庫中已存在的表;有很多,可能比你想瀏覽的要多。

SHOW TABLES 'page.*';

To list tables with prefix ‘page’. The pattern follows Java regular expression syntax (so the period is a wildcard).

列出以 「page」 為前綴的表。該模式遵循 Java 正則表達式語法(因此句點是通配符)。

SHOW PARTITIONS page_view;

To list partitions of a table. If the table is not a partitioned table then an error is thrown.

列出一個表的分區。如果表不是一個分區表,那麼就拋出一個錯誤。

DESCRIBE page_view;

To list columns and column types of table.

列出表的列和列類型。

DESCRIBE EXTENDED page_view;

To list columns and all other properties of table. This prints lot of information and that too not in a pretty format. Usually used for debugging.

列出表的列和所有其他的屬性。這列印很多資訊,那也不是一個好看的格式。通常用於調試。

DESCRIBE EXTENDED page_view PARTITION (ds='2008-08-08');

To list columns and all other properties of a partition. This also prints lot of information which is usually used for debugging.

列出一個分區的列和所有其他的屬性。這列印很多資訊,那也不是一個好看的格式。通常用於調試。

2.1.3、Altering Tables

To rename existing table to a new name. If a table with new name already exists then an error is returned:

將已存在的表重命名為一個新名字。如果具有新名字的表已存在,就返回一個錯誤:

ALTER TABLE old_table_name RENAME TO new_table_name;

To rename the columns of an existing table. Be sure to use the same column types, and to include an entry for each preexisting column:

重命名一個已存在表的列。確保使用相同的列類型,包含一個預先存在的列的入口

ALTER TABLE old_table_name REPLACE COLUMNS (col1 TYPE, ...);

To add columns to an existing table:

給一個已存在的表添加列:

ALTER TABLE tab1 ADD COLUMNS (c1 INT COMMENT 'a new int column', c2 STRING DEFAULT 'def val');

Note that a change in the schema (such as the adding of the columns), preserves the schema for the old partitions of the table in case it is a partitioned table. All the queries that access these columns and run over the old partitions implicitly return a null value or the specified default values for these columns.

注意,模式中的更改(如添加列)將保留表的舊分區的模式,以防它是一個分區表。

訪問這些列,並在舊分區上運行的所有查詢都會隱式地為這些列返回空值或指定的默認值。

In the later versions, we can make the behavior of assuming certain values as opposed to throwing an error in case the column is not found in a particular partition configurable.

在以後的版本中,我們可以採用假設某些值的行為,而不是在某個特定分區中找不到可配置的列時拋出錯誤。

2.1.4、Dropping Tables and Partitions

Dropping tables is fairly trivial. A drop on the table would implicitly drop any indexes(this is a future feature) that would have been built on the table. The associated command is:

刪除表非常簡單。對錶進行刪除操作將隱式地刪除在表上構建的任何索引(這是將來的特性)。相關的命令是:

DROP TABLE pv_users;

To dropping a partition. Alter the table to drop the partition.

刪除一個分區。修改表以刪除分區。

ALTER TABLE pv_users DROP PARTITION (ds='2008-08-08')

Note that any data for this table or partitions will be dropped and may not be recoverable.

注意,該表或分區的任何數據都將被刪除,並且可能無法恢復。

2.2、Loading Data

There are multiple ways to load data into Hive tables. The user can create an external table that points to a specified location within HDFS. In this particular usage, the user can copy a file into the specified location using the HDFS put or copy commands and create a table pointing to this location with all the relevant row format information. Once this is done, the user can transform the data and insert them into any other Hive table. For example, if the file /tmp/pv_2008-06-08.txt contains comma separated page views served on 2008-06-08, and this needs to be loaded into the page_view table in the appropriate partition, the following sequence of commands can achieve this:

將數據載入到 Hive 表中有多種方式。

用戶可以創建一個外部表,該表指向 HDFS 中的指定位置。在這種特殊的用法中,用戶可以使用 HDFS 的 put 或 copy 命令將文件複製到指定的位置,並創建一個指向該位置的表,其中包含所有相關的行格式資訊。

一旦完成,用戶就可以轉換數據,並將它們插入到任何其他 Hive 表中。例如,如果文件 /tmp/pv_2008-06-08.txt 包含了 2008-06-08 上的以逗號分隔的頁面視圖,並且需要將其載入到 page_view 表的相應分區中,下面的命令序列可以實現這一點:

CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT,
                page_url STRING, referrer_url STRING,
                ip STRING COMMENT 'IP Address of the User',
                country STRING COMMENT 'country of origination')
COMMENT 'This is the staging page view table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12'
STORED AS TEXTFILE
LOCATION '/user/data/staging/page_view';
 
hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view
 
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US')
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip
WHERE pvs.country = 'US';

This code results in an error due to LINES TERMINATED BY limitation

由於 LINES TERMINATED BY 限制,這個程式碼會導致錯誤

FAILED: SemanticException 6:67 LINES TERMINATED BY only supports newline '\n' right now. Error encountered near token ''12''

See [HIVE-5999](//issues.apache.org/jira/browse/HIVE-5999) - Allow other characters for LINES TERMINATED BY OPEN [HIVE-11996](//issues.apache.org/jira/browse/HIVE-11996) - Row Delimiter other than '\n' throws error in Hive. OPEN

In the example above, nulls are inserted for the array and map types in the destination tables but potentially these can also come from the external table if the proper row formats are specified.

在上面的示例中,將為目標表中的 array 和 map 類型插入 nulls,但如果指定了適當的行格式,這些 nulls 也可能來自外部表。

This method is useful if there is already legacy data in HDFS on which the user wants to put some metadata so that the data can be queried and manipulated using Hive.

如果 HDFS 中已經有一些遺留數據,用戶想要在這些數據上放一些元數據,這樣就可以使用 Hive 來查詢和操作這些數據,那麼這種方法非常有用。

Additionally, the system also supports syntax that can load the data from a file in the local files system directly into a Hive table where the input data format is the same as the table format. If /tmp/pv_2008-06-08_us.txt already contains the data for US, then we do not need any additional filtering as shown in the previous example. The load in this case can be done using the following syntax:

此外,系統還支援將本地文件系統中文件的數據直接載入到 Hive 表中,並且輸入的數據格式與表格式相同。

如果 /tmp/pv_2008-06-08_us.txt 已經包含了 US 的數據,那麼我們不需要像前面的例子中所示的任何額外的過濾。在這種情況下,可以使用以下語法進行載入:

LOAD DATA LOCAL INPATH /tmp/pv_2008-06-08_us.txt INTO TABLE page_view PARTITION(date='2008-06-08', country='US')

The path argument can take a directory (in which case all the files in the directory are loaded), a single file name, or a wildcard (in which case all the matching files are uploaded). If the argument is a directory, it cannot contain subdirectories. Similarly, the wildcard must match file names only.

path 參數可以接受一個目錄(在這種情況下,該目錄中的所有文件都被載入)、一個單個文件名或一個通配符(在這種情況下,所有匹配的文件都被上傳)。

如果參數是目錄,則不能包含子目錄。類似地,通配符必須只匹配文件名。

In the case that the input file /tmp/pv_2008-06-08_us.txt is very large, the user may decide to do a parallel load of the data (using tools that are external to Hive). Once the file is in HDFS – the following syntax can be used to load the data into a Hive table:

如果輸入文件 /tmp/pv_2008-06-08_us.txt 非常大,用戶可能會決定並行載入數據(使用 Hive 外部的工具)。一旦文件在 HDFS 中,可以使用以下語法將數據載入到 Hive 表中:

LOAD DATA INPATH '/user/data/pv_2008-06-08_us.txt' INTO TABLE page_view PARTITION(date='2008-06-08', country='US')

It is assumed that the array and map fields in the input.txt files are null fields for these examples.

對於這些例子,假設 input.txt 文件中的 array 和 map 欄位是空欄位。

See Hive Data Manipulation Language for more information about loading data into Hive tables, and see External Tables for another example of creating an external table.

2.3、Querying and Inserting Data

The Hive query operations are documented in Select, and the insert operations are documented in Inserting data into Hive Tables from queries and Writing data into the filesystem from queries.

Hive 查詢操作記錄在 Select 中,insert 操作記錄在 Inserting data into Hive Tables from queries 和 Writing data into the filesystem from queries 中。

2.3.1、Simple Query

For all the active users, one can use the query of the following form:

對於所有活躍用戶,可以使用下面形式的查詢:

INSERT OVERWRITE TABLE user_active
SELECT user.*
FROM user
WHERE user.active = 1;

Note that unlike SQL, we always insert the results into a table. We will illustrate later how the user can inspect these results and even dump them to a local file. You can also run the following query in Beeline or the Hive CLI:

注意,與 SQL 不同,我們總是將結果插入表中。我們稍後將演示用戶如何檢查這些結果,甚至將它們轉儲到本地文件。

你也可以在 Beeline 或 Hive 命令行中執行如下查詢:

SELECT user.*
FROM user
WHERE user.active = 1;

This will be internally rewritten to some temporary file and displayed to the Hive client side.

這將在內部被重寫到一些臨時文件,並顯示到 Hive 客戶端。

2.3.2、Partition Based Query

What partitions to use in a query is determined automatically by the system on the basis of where clause conditions on partition columns. For example, in order to get all the page_views in the month of 03/2008 referred from domain xyz.com, one could write the following query:

在查詢中,使用什麼分區由系統根據分區列上的 where 子句條件自動決定。

例如,為了從域名 xyz.com 獲得所有在 2008 年 3 月的 page_views,可以寫以下查詢:

INSERT OVERWRITE TABLE xyz_com_page_views
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND
      page_views.referrer_url like '%xyz.com';

Note that page_views.date is used here because the table (above) was defined with PARTITIONED BY(date DATETIME, country STRING) ; if you name your partition something different, don’t expect .date to do what you think!

注意,這裡使用 page_views.date 是因為上面的表是用 PARTITIONED BY(date DATETIME, country STRING) 定義的;如果你給你的分區起了不同的名字,別指望 .date 會像你想的那樣!

2.3.3、Joins

In order to get a demographic breakdown (by gender) of page_view of 2008-03-03 one would need to join the page_view table and the user table on the userid column. This can be accomplished with a join as shown in the following query:

為了獲得 2008-03-03 的 page_view 的人口統計分類(按性別),需要在 userid 列上連接 page_view 表和 user 表。這可以通過 join 來實現,如下面的查詢所示:

INSERT OVERWRITE TABLE pv_users
SELECT pv.*, u.gender, u.age
FROM user u JOIN page_view pv ON (pv.userid = u.id)
WHERE pv.date = '2008-03-03';

In order to do outer joins the user can qualify the join with LEFT OUTER, RIGHT OUTER or FULL OUTER keywords in order to indicate the kind of outer join (left preserved, right preserved or both sides preserved). For example, in order to do a full outer join in the query above, the corresponding syntax would look like the following query:

為了進行外部連接,用戶可以使用 LEFT OUTER、RIGHT OUTER 或 FULL OUTER 關鍵字來限定連接,以表明外部連接的類型(左保留、右保留或兩邊保留)。

例如,為了在上面的查詢中執行 full outer join,相應的語法看起來像下面的查詢:

INSERT OVERWRITE TABLE pv_users
SELECT pv.*, u.gender, u.age
FROM user u FULL OUTER JOIN page_view pv ON (pv.userid = u.id)
WHERE pv.date = '2008-03-03';

In order check the existence of a key in another table, the user can use LEFT SEMI JOIN as illustrated by the following example.

為了檢查另一個表中是否存在一個鍵,用戶可以使用 LEFT SEMI JOIN,如下面的例子所示。

INSERT OVERWRITE TABLE pv_users
SELECT u.*
FROM user u LEFT SEMI JOIN page_view pv ON (pv.userid = u.id)
WHERE pv.date = '2008-03-03';

In order to join more than one tables, the user can use the following syntax:

為了連接多個表,用戶可以使用以下語法:

INSERT OVERWRITE TABLE pv_friends
SELECT pv.*, u.gender, u.age, f.friends
FROM page_view pv JOIN user u ON (pv.userid = u.id) JOIN friend_list f ON (u.id = f.uid)
WHERE pv.date = '2008-03-03';

Note that Hive only supports equi-joins. Also it is best to put the largest table on the rightmost side of the join to get the best performance.

Hive 只支援 equi-joins。另外,最好將最大的表放在連接的最右邊,以獲得最佳性能。

2.3.4、Aggregations

In order to count the number of distinct users by gender one could write the following query:

為了按性別計算不同用戶的數量,可以編寫以下查詢:

INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;

Multiple aggregations can be done at the same time, however, no two aggregations can have different DISTINCT columns .e.g while the following is possible

多個聚合可以同時進行,但是,兩個聚合不能有不同的列。例如,以下是可能的:

INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;

however, the following query is not allowed

但是,下面的查詢是不允許的

INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip)
FROM pv_users
GROUP BY pv_users.gender;

2.3.5、Multi Table/File Inserts

The output of the aggregations or simple selects can be further sent into multiple tables or even to hadoop dfs files (which can then be manipulated using hdfs utilities). For example, if along with the gender breakdown, one needed to find the breakdown of unique page views by age, one could accomplish that with the following query:

聚合或簡單選擇的輸出可以進一步發送到多個表中,甚至發送到 hadoop dfs 文件中(然後可以使用 hdfs 實用程式操作這些文件)。

例如,如果在性別分類的同時,需要找到按年齡劃分的唯一頁面瀏覽量,可以通過以下查詢來實現:

FROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
    SELECT pv_users.gender, count_distinct(pv_users.userid)
    GROUP BY pv_users.gender
 
INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum'
    SELECT pv_users.age, count_distinct(pv_users.userid)
    GROUP BY pv_users.age;

The first insert clause sends the results of the first group by to a Hive table while the second one sends the results to a hadoop dfs files.

第一個 insert 子句將第一個 group by 的結果發送到 Hive 表中,而第二個 insert 子句將結果發送到 hadoop dfs 文件中。

2.3.6、Dynamic-Partition Insert

In the previous examples, the user has to know which partition to insert into and only one partition can be inserted in one insert statement. If you want to load into multiple partitions, you have to use multi-insert statement as illustrated below.

在前面的例子中,用戶必須知道要插入哪個分區,並且在一條 insert 語句中只能插入一個分區。

如果想要載入到多個分區,必須使用多條 insert 語句,如下所示。

FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US')
       SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US'
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='CA')
       SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'CA'
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='UK')
       SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'UK';

In order to load data into all country partitions in a particular day, you have to add an insert statement for each country in the input data. This is very inconvenient since you have to have the priori knowledge of the list of countries exist in the input data and create the partitions beforehand. If the list changed for another day, you have to modify your insert DML as well as the partition creation DDLs. It is also inefficient since each insert statement may be turned into a MapReduce Job.

為了將某一天的數據載入到所有 country 分區中,必須在輸入數據中為每個國家添加 insert 一條語句。

這非常不方便,因為你必須事先知道輸入數據中存在的國家列表,並預先創建分區。

如果列表某天更改了,則必須修改 insert DML 和創建分區 DML。這也是低效的,因為每個 insert 語句都可能變成 MapReduce Job。

Dynamic-partition insert (or multi-partition insert) is designed to solve this problem by dynamically determining which partitions should be created and populated while scanning the input table. This is a newly added feature that is only available from version 0.6.0. In the dynamic partition insert, the input column values are evaluated to determine which partition this row should be inserted into. If that partition has not been created, it will create that partition automatically. Using this feature you need only one insert statement to create and populate all necessary partitions. In addition, since there is only one insert statement, there is only one corresponding MapReduce job. This significantly improves performance and reduce the Hadoop cluster workload comparing to the multiple insert case.

動態分區插入(或多分區插入)是為了解決這個問題而設計的,它在掃描輸入表時動態地確定應該創建和填充哪些分區。

這是一個新添加的特性,僅在 0.6.0 版本中可用。

在動態分區插入中,將計算輸入列的值,以確定應該將該行插入哪個分區。如果還沒有創建該分區,它將自動創建該分區。

使用這個特性,你只需要一條 insert 語句就可以創建和填充所有必要的分區。另外,由於只有一條 insert 語句,所以對應的 MapReduce job 也只有一個。

與多次插入相比,這顯著提高了性能並減少了 Hadoop 集群的工作負載。

Below is an example of loading data to all country partitions using one insert statement:

下面是一個使用 insert 語句將數據載入到所有 country 分區的例子:

FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)
       SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country

There are several syntactic differences from the multi-insert statement:

與多條 insert 語句有幾個語法上的區別:

country appears in the PARTITION specification, but with no value associated. In this case, country is a dynamic partition column. On the other hand, ds has a value associated with it, which means it is a static partition column. If a column is dynamic partition column, its value will be coming from the input column. Currently we only allow dynamic partition columns to be the last column(s) in the partition clause because the partition column order indicates its hierarchical order (meaning dt is the root partition, and country is the child partition). You cannot specify a partition clause with (dt, country=’US’) because that means you need to update all partitions with any date and its country sub-partition is ‘US’.

  • country 出現在 PARTITION 規範中,但沒有關聯的值。

在本例中,country 是一個動態分區列。另一方面,ds 有一個與之相關聯的值,這意味著它是一個靜態分區列。

如果一個列是動態分區列,那麼它的值將來自輸入列。

目前,我們只允許動態分區列是 partition 子句中的最後一列,因為分區列的順序表明了它的層次順序(意味著 dt 是根分區,而 country 是子分區)。

你不能用 (dt, country=’US’) 指定分區子句,因為這意味著你需要用任一日期更新所有分區,而它的 country 子分區是 ‘US’。

An additional pvs.country column is added in the select statement. This is the corresponding input column for the dynamic partition column. Note that you do not need to add an input column for the static partition column because its value is already known in the PARTITION clause. Note that the dynamic partition values are selected by ordering, not name, and taken as the last columns from the select clause.

  • 在 select 語句中添加一個額外的 pvs.country 列。

這是動態分區列的相應輸入列。

注意,你不需要為靜態 partition 列添加輸入列,因為它的值在 PARTITION 子句中已知。

注意,動態分區值是按照順序選擇的,而不是名稱,並作為 select 子句的最後一列。

Semantics of the dynamic partition insert statement:

動態分區插入語句的語義:

When there are already non-empty partitions exists for the dynamic partition columns, (for example, country=’CA’ exists under some ds root partition), it will be overwritten if the dynamic partition insert saw the same value (say ‘CA’) in the input data. This is in line with the ‘insert overwrite’ semantics. However, if the partition value ‘CA’ does not appear in the input data, the existing partition will not be overwritten.

  • 當動態分區列已經存在非空的分區時(例如,country=’CA’ 存在於某些 ds 根分區下),如果動態分區插入在輸入數據中遇到相同的值(比如’CA’),那麼它將被覆蓋。這符合 ‘insert overwrite’ 語義。但是,如果分區值 ‘CA’ 沒有出現在輸入數據中,則不會覆蓋現有分區。

Since a Hive partition corresponds to a directory in HDFS, the partition value has to conform to the HDFS path format (URI in Java). Any character having a special meaning in URI (for example, ‘%’, ‘:’, ‘/’, ‘#’) will be escaped with ‘%’ followed by 2 bytes of its ASCII value.

  • 一個 Hive 分區對應 HDFS 中的一個目錄,分區值必須符合 HDFS 的路徑格式(Java為URI)。任何在 URI 中有特殊含義的字元(例如’%’,’:’,’/’,’#’)將用 ‘%’ 後跟它的 ASCII 值的 2 個位元組進行轉義。

If the input column is a type different than STRING, its value will be first converted to STRING to be used to construct the HDFS path.

  • 如果輸入列不是 STRING 類型,則首先將其值轉換為 STRING,用於構建 HDFS 路徑。

If the input column value is NULL or empty string, the row will be put into a special partition, whose name is controlled by the hive parameter hive.exec.default.partition.name. The default value is HIVE_DEFAULT_PARTITION{}. Basically this partition will contain all “bad” rows whose value are not valid partition names. The caveat of this approach is that the bad value will be lost and is replaced by HIVE_DEFAULT_PARTITION{} if you select them Hive. JIRA HIVE-1309 is a solution to let user specify “bad file” to retain the input partition column values as well.

  • 如果輸入的列值為 NULL 或空字元串,行將被放入一個特殊的分區,分區的名稱由 hive.exec.default.partition.name 參數控制。默認值為 HIVE_DEFAULT_PARTITION{}。基本上,這個分區將包含「壞」行,行的值不是有效的分區名。注意,如果選擇它們,「壞」值將丟失,並由 HIVE_DEFAULT_PARTITION{} 替換。JIRA HIVE-1309 是一個讓用戶指定 「壞文件」 來保留輸入分區列值的解決方案。

Dynamic partition insert could potentially be a resource hog in that it could generate a large number of partitions in a short time. To get yourself buckled, we define three parameters:

  • 動態分區插入可能會佔用大量資源,因為它可能會在很短的時間內生成大量分區。我們定義了三個參數:

    • hive.exec.max.dynamic.partitions.pernode(默認值是100):是每個 mapper 或 reducer 可以創建的最大動態分區。如果一個 mapper 或 reducer 創建的值超過這個閾值,一個致命錯誤將從 mapper/reducer(通過計數器)引發,整個 job 將被殺死。

    • hive.exec.max.dynamic.partitions(默認值為1000):是一個 DML 可以創建的動態分區的總數。如果每個 mapper/reducer 都沒有超過這個限制,但是動態分區的總數卻超過了,那麼在 job 結束時,中間數據被移動到最終目的地之前,會引發一個異常。

    • hive.exec.max.created.files(默認值為100000)是所有 mappers 和 reducers 創建的最大文件總數。這是通過在創建新文件時由每個 mapper/reducer 更新 Hadoop 計數器來實現的。如果總數超過了 hive.exec.max.created.files 時,一個致命錯誤將被拋出,job 將被終止。

hive.exec.max.dynamic.partitions.pernode (default value being 100) is the maximum dynamic partitions that can be created by each mapper or reducer. If one mapper or reducer created more than that the threshold, a fatal error will be raised from the mapper/reducer (through counter) and the whole job will be killed.

hive.exec.max.dynamic.partitions (default value being 1000) is the total number of dynamic partitions could be created by one DML. If each mapper/reducer did not exceed the limit but the total number of dynamic partitions does, then an exception is raised at the end of the job before the intermediate data are moved to the final destination.

hive.exec.max.created.files (default value being 100000) is the maximum total number of files created by all mappers and reducers. This is implemented by updating a Hadoop counter by each mapper/reducer whenever a new file is created. If the total number is exceeding hive.exec.max.created.files, a fatal error will be thrown and the job will be killed.

Another situation we want to protect against dynamic partition insert is that the user may accidentally specify all partitions to be dynamic partitions without specifying one static partition, while the original intention is to just overwrite the sub-partitions of one root partition. We define another parameter hive.exec.dynamic.partition.mode=strict to prevent the all-dynamic partition case. In the strict mode, you have to specify at least one static partition. The default mode is strict. In addition, we have a parameter hive.exec.dynamic.partition=true/false to control whether to allow dynamic partition at all. The default value is false prior to Hive 0.9.0 and true in Hive 0.9.0 and later.

我們希望防止動態分區插入的另一種情況是,用戶可能會意外地將所有分區指定為動態分區,而沒有指定一個靜態分區,而最初的目的只是覆蓋一個根分區的子分區。

我們定義另一個參數 hive.exec.dynamic.partition.mode=strict,防止出現全部動態分區的情況。在嚴格模式下,必須指定至少一個靜態分區。默認模式為嚴格的。

此外,我們還有一個參數 hive.exec.dynamic.partition=true/false 來控制是否允許動態分區。Hive 0.9.0 之前的默認值為 false, Hive 0.9.0 之後的默認值為 true。

In Hive 0.6, dynamic partition insert does not work with hive.merge.mapfiles=true or hive.merge.mapredfiles=true, so it internally turns off the merge parameters. Merging files in dynamic partition inserts are supported in Hive 0.7 (see JIRA HIVE-1307 for details).

在 Hive 0.6 中,動態分區插入不能與 hive.merge.mapfiles=truehive.merge.mapredfiles=true 一起使用。因此它在內部關閉了合併參數。

Hive 0.7 支援在動態分區插入中合併文件。

Troubleshooting and best practices:

故障排除和最佳實踐:

As stated above, there are too many dynamic partitions created by a particular mapper/reducer, a fatal error could be raised and the job will be killed. The error message looks something like:

如上所述,一個特定的 mapper/reducer 創建的動態分區太多了,可能會引發一個致命錯誤,job 將被終止。錯誤資訊是這樣的:

    beeline> set hive.exec.dynamic.partition.mode=nonstrict;
    beeline> FROM page_view_stg pvs
          INSERT OVERWRITE TABLE page_view PARTITION(dt, country)
                 SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip,
                        from_unixtimestamp(pvs.viewTime, 'yyyy-MM-dd') ds, pvs.country;
...
2010-05-07 11:10:19,816 Stage-1 map = 0%,  reduce = 0%
[Fatal Error] Operator FS_28 (id=41): fatal error. Killing the job.
Ended Job = job_201005052204_28178 with errors
...

The problem of this that one mapper will take a random set of rows and it is very likely that the number of distinct (dt, country) pairs will exceed the limit of hive.exec.max.dynamic.partitions.pernode. One way around it is to group the rows by the dynamic partition columns in the mapper and distribute them to the reducers where the dynamic partitions will be created. In this case the number of distinct dynamic partitions will be significantly reduced. The above example query could be rewritten to:

這樣做的問題是,一個 mapper 將取一個隨機的行集,並且很可能不同的(dt, country)對的數量將超過 hive.exec.max.dynamic.partitions.pernode 的限制。

解決這個問題的一種方法是按照 mapper 中的動態分區列對行進行分組,並將它們分發到將在其中創建動態分區的 reducers 中。

在這種情況下,不同動態分區的數量將顯著減少。上面的示例查詢可以重寫為:

beeline> set hive.exec.dynamic.partition.mode=nonstrict;
beeline> FROM page_view_stg pvs
      INSERT OVERWRITE TABLE page_view PARTITION(dt, country)
             SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip,
                    from_unixtimestamp(pvs.viewTime, 'yyyy-MM-dd') ds, pvs.country
             DISTRIBUTE BY ds, country;

This query will generate a MapReduce job rather than Map-only job. The SELECT-clause will be converted to a plan to the mappers and the output will be distributed to the reducers based on the value of (ds, country) pairs. The INSERT-clause will be converted to the plan in the reducer which writes to the dynamic partitions.

該查詢將生成一個 MapReduce job,而不是 Map-only job。

SELECT 子句將被轉換為 mappers 的計劃,並根據 (ds, country)對的值將輸出分配給 reducers。

INSERT 子句將被轉換為 reducer 中的計劃,用於寫入動態分區。

Additional documentation:

2.3.7、Inserting into Local Files

In certain situations you would want to write the output into a local file so that you could load it into an excel spreadsheet. This can be accomplished with the following command:

在某些情況下,可能希望將輸出寫入到本地文件,以便將其載入到 excel 表格中。這可以通過以下命令完成:

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/pv_gender_sum'
SELECT pv_gender_sum.*
FROM pv_gender_sum;

2.3.8、Sampling

The sampling clause allows the users to write queries for samples of the data instead of the whole table. Currently the sampling is done on the columns that are specified in the CLUSTERED BY clause of the CREATE TABLE statement. In the following example we choose 3rd bucket out of the 32 buckets of the pv_gender_sum table:

抽樣子句允許用戶對數據的樣本(而不是整個表)編寫查詢。

目前,抽樣是在 CREATE TABLE 語句的 CLUSTERED BY 子句中指定的列上進行的。

在下面的例子中,我們從 pv_gender_sum 表的 32 個桶中選擇第三個桶:

INSERT OVERWRITE TABLE pv_gender_sum_sample
SELECT pv_gender_sum.*
FROM pv_gender_sum TABLESAMPLE(BUCKET 3 OUT OF 32);

In general the TABLESAMPLE syntax looks like:

一般情況下,TABLESAMPLE 語法如下:

TABLESAMPLE(BUCKET x OUT OF y)

y has to be a multiple or divisor of the number of buckets in that table as specified at the table creation time. The buckets chosen are determined if bucket_number module y is equal to x. So in the above example the following tablesample clause

y 必須是該表在創建表時指定的桶數的倍數或除數。

如果 bucket_number 模組 y 等於 x,則確定所選擇的桶。所以在上面的例子中,下面的 tablesample 子句

TABLESAMPLE(BUCKET 3 OUT OF 16)

would pick out the 3rd and 19th buckets. The buckets are numbered starting from 0.

選出第3和第19個桶。桶從0開始編號。

On the other hand the tablesample clause

另一方面,tablesample 子句

TABLESAMPLE(BUCKET 3 OUT OF 64 ON userid)

would pick out half of the 3rd bucket.

會取出第三個桶的一半。

2.3.9、Union All

The language also supports union all, for example, if we suppose there are two different tables that track which user has published a video and which user has published a comment, the following query joins the results of a union all with the user table to create a single annotated stream for all the video publishing and comment publishing events:

語言還支援 union all,

例如,如果我們假設有兩個不同的表,跟蹤用戶發布的一個影片,和用戶發表的評論。

下面的查詢使用 user 表 join 了 union all 的結果,來創建一個帶注釋的所有影片發布和評論發布事件流:

INSERT OVERWRITE TABLE actions_users
SELECT u.id, actions.date
FROM (
    SELECT av.uid AS uid
    FROM action_video av
    WHERE av.date = '2008-06-03'
 
    UNION ALL
 
    SELECT ac.uid AS uid
    FROM action_comment ac
    WHERE ac.date = '2008-06-03'
    ) actions JOIN users u ON(u.id = actions.uid);

2.3.10、Array Operations

Array columns in tables can be as follows:

表中的數組列可以如下:

CREATE TABLE array_table (int_array_column ARRAY<INT>);

Assuming that pv.friends is of the type ARRAY (i.e. it is an array of integers), the user can get a specific element in the array by its index as shown in the following command:

假設 pv.friends 的類型為 ARRAY(即它是一個整型數組),用戶可以通過數組的索引獲取數組中的特定元素,如下所示:

SELECT pv.friends[2]
FROM page_views pv;

The select expression gets the third item in the pv.friends array.

select 表達式獲得 pv.friends 數組中的第三項。

The user can also get the length of the array using the size function as shown below:

用戶也可以使用 size 函數獲取數組的長度,如下所示:

SELECT pv.userid, size(pv.friends)
FROM page_view pv;

2.3.11、Map (Associative Arrays) Operations

Maps provide collections similar to associative arrays. Such structures can only be created programmatically currently. We will be extending this soon. For the purpose of the current example assume that pv.properties is of the type map<String, String> i.e. it is an associative array from strings to string. Accordingly, the following query:

映射提供了類似於關聯數組的集合。目前只能以編程方式創建此類結構。我們很快就會擴展這個。

對於當前的例子,假設 pv.properties 類型為 map<String, String>,即它是一個從字元串到字元串的關聯數組。因此,以下查詢:

INSERT OVERWRITE page_views_map
SELECT pv.userid, pv.properties['page type']
FROM page_views pv;

can be used to select the ‘page_type’ property from the page_views table.

可用於從 page_views 表中選擇 ‘page_type’ 屬性。

Similar to arrays, the size function can also be used to get the number of elements in a map as shown in the following query:

與數組類似,size 函數也可以用於獲取 map 中元素的數量,如下所示:

SELECT size(pv.properties)
FROM page_view pv;

2.3.12、Custom Map/Reduce Scripts

Users can also plug in their own custom mappers and reducers in the data stream by using features natively supported in the Hive language. for example, in order to run a custom mapper script – map_script – and a custom reducer script – reduce_script – the user can issue the following command which uses the TRANSFORM clause to embed the mapper and the reducer scripts.

用戶還可以使用 Hive 語言本地支援的特性,在數據流中插入自己的自定義 mappers 和 reducers。

例如,為了運行一個自定義 mapper 腳本 map_script 和一個自定義 reducer 腳本 reduce_script,用戶可以發出以下命令,該命令使用 TRANSFORM 子句來嵌入 mapper 和 reducer 腳本。

Note that columns will be transformed to string and delimited by TAB before feeding to the user script, and the standard output of the user script will be treated as TAB-separated string columns. User scripts can output debug information to standard error which will be shown on the task detail page on hadoop.

注意,在提供給用戶腳本之前,列將被轉換為字元串,並以製表符分隔,用戶腳本的標準輸出將被視為製表符分隔的字元串列。

用戶腳本可以將調試資訊輸出到標準錯誤,這些錯誤將顯示在 hadoop 上的 task detail 頁面上。

FROM (
     FROM pv_users
     MAP pv_users.userid, pv_users.date
     USING 'map_script'
     AS dt, uid
     CLUSTER BY dt) map_output
 
 INSERT OVERWRITE TABLE pv_users_reduced
     REDUCE map_output.dt, map_output.uid
     USING 'reduce_script'
     AS date, count;

Sample map script (weekday_mapper.py )

抽樣 map 腳本:

import sys
import datetime
 
for line in sys.stdin:
  line = line.strip()
  userid, unixtime = line.split('\t')
  weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
  print ','.join([userid, str(weekday)])

Of course, both MAP and REDUCE are “syntactic sugar” for the more general select transform. The inner query could also have been written as such:

當然,MAP 和 REDUCE 對於更一般的 select 轉換來說都是「語法糖」。內部查詢也可以這樣寫:

SELECT TRANSFORM(pv_users.userid, pv_users.date) USING 'map_script' AS dt, uid CLUSTER BY dt FROM pv_users;

Schema-less map/reduce: If there is no “AS” clause after “USING map_script”, Hive assumes the output of the script contains 2 parts: key which is before the first tab, and value which is the rest after the first tab. Note that this is different from specifying “AS key, value” because in that case value will only contains the portion between the first tab and the second tab if there are multiple tabs.

無模式的 map/reduce:

如果 USING map_script 後沒有 AS 子句,Hive 假設腳本輸出包含兩個部分:key 在第一個製表符前,value 在第一個製表符後。

注意,這與指定 「AS key, value」 不同,因為在這種情況下,如果有多個製表符,value 將只包含第一個製表符和第二個製表符之間的部分。

In this way, we allow users to migrate old map/reduce scripts without knowing the schema of the map output. User still needs to know the reduce output schema because that has to match what is in the table that we are inserting to.

通過這種方式,我們允許用戶在不知道 map 輸出模式的情況下遷移舊的 map/reduce 腳本。

用戶仍然需要知道 reduce 輸出模式,因為它必須與我們要插入的表中的內容相匹配。

FROM (
    FROM pv_users
    MAP pv_users.userid, pv_users.date
    USING 'map_script'
    CLUSTER BY key) map_output
 
INSERT OVERWRITE TABLE pv_users_reduced
 
    REDUCE map_output.dt, map_output.uid
    USING 'reduce_script'
    AS date, count;

Distribute By and Sort By: Instead of specifying “cluster by”, the user can specify “distribute by” and “sort by”, so the partition columns and sort columns can be different. The usual case is that the partition columns are a prefix of sort columns, but that is not required.

Distribute By 和 Sort By:用戶可以指定 Distribute By 和 Sort By,而不是指定 cluster By,因此分區列和排序列可以不同。

通常情況下,分區列是排序列的前綴,但這不是必需的。

FROM (
    FROM pv_users
    MAP pv_users.userid, pv_users.date
    USING 'map_script'
    AS c1, c2, c3
    DISTRIBUTE BY c2
    SORT BY c2, c1) map_output
 
INSERT OVERWRITE TABLE pv_users_reduced
 
    REDUCE map_output.c1, map_output.c2, map_output.c3
    USING 'reduce_script'
    AS date, count;

2.3.1、Co-Groups

Amongst the user community using map/reduce, cogroup is a fairly common operation wherein the data from multiple tables are sent to a custom reducer such that the rows are grouped by the values of certain columns on the tables. With the UNION ALL operator and the CLUSTER BY specification, this can be achieved in the Hive query language in the following way. Suppose we wanted to cogroup the rows from the actions_video and action_comments table on the uid column and send them to the ‘reduce_script’ custom reducer, the following syntax can be used by the user:

在使用 map/reduce 的用戶社區中,cogroup 是一種相當常見的操作,其中來自多個表的數據被發送到一個自定義的 reducer,這樣行就按表上某些列的值分組。

通過 UNION ALL 操作符和 CLUSTER BY 規範,Hive 查詢語言可以通過以下方式實現這一點。

假設我們想在 uid 列上對 actions_video 和 action_comments 表中的行進行分組,並將它們發送到 ‘reduce_script’ 自定義 reducer,用戶可以使用以下語法:

FROM (
     FROM (
             FROM action_video av
             SELECT av.uid AS uid, av.id AS id, av.date AS date
 
            UNION ALL
 
             FROM action_comment ac
             SELECT ac.uid AS uid, ac.id AS id, ac.date AS date
     ) union_actions
     SELECT union_actions.uid, union_actions.id, union_actions.date
     CLUSTER BY union_actions.uid) map
 
 INSERT OVERWRITE TABLE actions_reduced
     SELECT TRANSFORM(map.uid, map.id, map.date) USING 'reduce_script' AS (uid, id, reduced_val);
Tags: