42Fire Developer

2021년 4월 8일 목요일

[ Hive ] Oozie와 Sqoop을 통해 가져온 데이터시간 RDB!=HIVE 불일치시

[ 파이프라인 ]

Oozie -> Sqoop -> Hive

[ 문제 ]

RDB의 타임존이 UTC라 한국시간보다 9시간 느리다.

그래서 Oozie XML에서 Sqoop Query작성을 아래와 같이

SELECT convert_tz('${created_date} 00:00:00', '+00:00', '+09:00') as DATE FROM ...

시간을 +9하여 한국시간에 맞춰 가져온다면

Hive에서는 +9시간 더 더해져 가져오는 경우가 있다.

날짜로 파티션을 하는 하이브테이블일 경우 날짜자체가 달라져

데이터 정합성에 오류가 일어날 수 있다.

RDB 시간 : 2021-04-07 14:00:00 (UTC)
예상한 HIVE 시간 : 2021-04-07 23:00:00 (KST)
결과 HIVE 시간 : 2021-04-08 08:00:00

[ 해결 ]

Ooize Scheduler의 시간을 서울로 지정했다면

Sqoop실행시 연결된 RDB의 시간을 비교하여

자동적으로 한국시간에 맞게 +9시간을 해준다.

따라서 Sqoop쿼리에서 convert_tz를 사용하지 않고 쿼리를 작성한다.

Sqoop쿼리의 Where절과는 상관없다.

최종결과 아웃풋이 나오고 SELECT된 컬럼(DATE관련 타입)들에게 Oozie시간대에 맞게 조정하기 때문이다.

2021년 4월 7일 수요일

[ Hive ] 일주일 단위로 Group By 쿼리

Hive파티션 yymmdd=20210407

주별로 그룹핑하기 위해서 yyyy-mm-dd형식의 데이터 포맷이여야한다.

아래 함수를 써 변경

from_unixtime(unix_timestamp(orr.yymmdd,'yyyymmdd'),'yyyy-mm-dd')

WEEKOFYEAR(yyyy-mm-dd value) 를 사용해 주별로 그룹핑을 한다

Ex )

SELECT
WEEKOFYEAR(from_unixtime(unix_timestamp(yymmdd,'yyyymmdd'),'yyyy-mm-dd')) as week,
sum(col1) as `컬럼1`,
sum(col2) as `컬럼2`,

sum(col3) as `컬럼3`
FROM test_db
GROUP BY

WEEKOFYEAR(from_unixtime(unix_timestamp(yymmdd,'yyyymmdd'),'yyyy-mm-dd'))
;

week라는 컬럼이 현재 년도에서 몇번째 주인지를 나타내는 int값으로 리턴을 하기에,

특정주의 특정요일로 날짜를 표시하려면 아래와 같은 쿼리를 Group by절에 사용한다.

(일요일 기준)

...

GROUP BY

date_sub(from_unixtime(unix_timestamp(yymmdd,'yyyymmdd'),'yyyy-mm-dd'), pmod(datediff(from_unixtime(unix_timestamp(yymmdd,'yyyymmdd'),'yyyy-mm-dd'),'1900-01-07'),7))

[ Spark] 로컬환경에서 Hive Thrift접속 예시

import org.apache.spark.SparkConf

import org.apache.spark.SparkContext._

import org.apache.spark.SparkContext

import org.apache.spark.sql.SparkSession

object SimpleApp {

def main(args: Array[String]): Unit = {

val conf = new SparkConf()

.setAppName("HiveToPhoenix")

.setMaster("local[*]")

val sc = new SparkContext(conf)

val spark = SparkSession.builder()

.appName("Spark Hive Example")

.config("hive.metastore.uris","thrift://11.22.333.444:10000")

.enableHiveSupport()

.getOrCreate()

val jdbcDF = spark.read.format("jdbc")

.option("url", "jdbc:hive2://11.22.333.444:10000")

.option("dbtable", "temp.test_db")

.option("user", "hive")

.option("password", "1234")

.option("driver", "org.apache.hive.jdbc.HiveDriver")

.option("numberPartitons",5)

.load()

println("able to connect------------------")

jdbcDF.show()

jdbcDF.printSchema

spark.sql("SELECT * FROM temp.test_dbwhere yymmdd=20210322").show()

sc.stop()

}

2021년 4월 4일 일요일

[ Spark ] CDH phoenix 연동관련 설정

spark -> 구성 -> 범위(Gateway) -> 범주(고급) -> spark-conf/spark-defaults.conf에 대한 Spark클라이언트 고급구성스니펫 에

spark.executor.extraClassPath=/opt/cloudera/parcels/PHOENIX-5.0.0-cdh6.2.0.p0.1308267/lib/phoenix/phoenix-5.0.0-cdh6.2.0-client.jar

spark.driver.extraClassPath=/opt/cloudera/parcels/PHOENIX-5.0.0-cdh6.2.0.p0.1308267/lib/phoenix/phoenix-5.0.0-cdh6.2.0-client.jar

와 같이 외부 jar파일 Classpath에 인식하도록 설정

https://docs.cloudera.com/documentation/enterprise/6/6.2/topics/phoenix_spark_connector.html

[ Practice Scala ] List.map() 예시 (Update List)

문제 ]

리스트 절대값 반환

Sample Input

2
-4
3
-1
23
-4
-54

Sample Output

제출 ]

def f(arr:List[Int]):List[Int] = return arr.map(Math.abs(_))

풀이 ]

List.map을 이용하여 배열의 모든 요소에 function을 적용한다.

final def map[B](f: (A) => B): List[B]

Builds a new list by applying a function to all elements of this list.

B: the element type of the returned list.
f: the function to apply to each element.
returns: a new list resulting from applying the given function f to each element of this list and collecting the results.

Math함수의 abs()함수를 이용해여 배열의 모든 요소의 값을 절대값으로 바꾼다.

def f(arr:List[Int]):List[Int] = return arr.map(Math.abs(_))

또는 Math함수를 쓰지 않고 따로 함수를 만들어 인자값으로 사용할 수도 있다.

def f(arr:List[Int]):List[Int] = return arr.map(k)

def k(s:Int) : Int = {

if(s < 0) return s * -1

else return s

}

2021년 4월 3일 토요일

[ Practice Scala ] List합계 (Sum of Odd Elements)

문제 ]

배열 요소 중 홀수 값의 합계

Sample Input

3, 2, 4, 6, 5, 7, 8, 0, 1

Sample Output

Explanation

Sum of odd elements is 3+5+7+1 = 16

제출 ]

def f(arr:List[Int]):Int = arr.filter(_ % 2 != 0).sum

풀이 ]

filter와 sum 함수를 써서

num / 2의 나머지값이 0보다 크면 더하는 작업을 수행한다

2021년 4월 1일 목요일

[ Practice Scala ] List선언 - Range (Array Of N Elements)

문제 ]

Create an array of integers, where the value of is passed as an argument to the pre-filled function in your editor. This challenge uses a custom checker, so you can create any array of integers. For example, if , you could return , , or any other array of equal length.

Note: Code stubs are provided for almost every language in which you must either fill in a blank (i.e., ____) or write your code in the area specified by comments.

Method Signature

Number Of Parameters: 1
Parameters: [n]
Returns: List or Vector

Input Format

A single integer, .

Constraints

The members returned by the list/vector/array must be integers.

Output Format

The function must return an array, list, or vector of integers. Stub code in the editor prints this to stdout as a space, comma, or semicolon-separated list (depending on your submission language).

Note: Your output need not match the Expected Output exactly; the size of your printed list is confirmed by a custom checker, which determines whether or not you passed each test case.

Sample Input 0

Sample Output 0

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Sample Input 1

Sample Output 1

[1, 2, 3]

제출 ]

object Solution extends App {

import scala.io.StdIn.readInt

def f(num:Int) : List[Int] = {

return List.fill(num)(1)

return List.range(0,num)

}

println(f(readInt))

}

풀이 ]

num = 4

return List.fill(num)(1) : 리스트 선언시 1로 첫번째인자(num) 만큼 채운다.

ex List(1,1,1,1)

return List.range(0,num) : 리스트 선언시 0~num까지의 수로 채운다.

ex List(0,1,2,3)

세번자 인자 추가시 배열의 스텝을 정한다.

List.range(0,10,2)

ex List(0,2,4,6,8)