42Fire Developer

2021년 4월 14일 수요일

[ Practice Scala ] 재귀함수 (Evaluating e^x)

문제 ]

The series expansion of is given by:

Evaluate for given values of by using the above expansion for the first terms.

Input Format

The first line contains an integer , the number of test cases.
lines follow. Each line contains a value of for which you need to output the value of using the above series expansion. These input values have exactly decimal places each.

Output Format

Output lines, each containing the value of , computed by your program.

Constraints

Var, Val in Scala and def and defn in Clojure are blocked keywords. The challenge is to accomplish this without either mutable state or direct declaration of local variables.

Sample Input

Sample Output

2423600.1887
143.6895
1.6487
0.6065

제출 ]

import java.io._
import java.math._
import java.security._
import java.text._
import java.util._
import java.util.concurrent._
import java.util.function._
import java.util.regex._
import java.util.stream._

object Solution {

    def f(x: Double): Double ={
      f2(x, 9)
    }
 
    def fact(x: Int): Int = if (x <= 1) 1 else x * fact(x - 1)
 
    def f2(x: Double, i: Int): Double = {
      if (i == 0) 1 else Math.pow(x, i) / fact(i) + f2(x, i - 1)
    }

    def main(args: Array[String]) {
        val stdin = scala.io.StdIn

        val n = stdin.readLine.trim.toInt

        for (nItr <- 1 to n) {
            val x = stdin.readLine.trim.toDouble
            println(f(x))
        }
    }
}


풀이 ]
x제곱/i!에 대한 재귀, i팩토리얼에 대한 재귀함수

2021년 4월 12일 월요일

[ Spark ] 파티션과 셔플과의 관계

Executor Memory에 따른 셔플단계에서 파티션 갯수 구하기

https://jaemunbro.medium.com/apache-spark-partition-%EA%B0%9C%EC%88%98%EC%99%80-%ED%81%AC%EA%B8%B0-%EC%A0%95%ED%95%98%EA%B8%B0-3a790bd4675d

Spark 최적화 튜닝관련

https://nephtyws.github.io/data/spark-optimization-part-1/

2021년 4월 8일 목요일

[ Hive ] Oozie와 Sqoop을 통해 가져온 데이터시간 RDB!=HIVE 불일치시

[ 파이프라인 ]

Oozie -> Sqoop -> Hive

[ 문제 ]

RDB의 타임존이 UTC라 한국시간보다 9시간 느리다.

그래서 Oozie XML에서 Sqoop Query작성을 아래와 같이

SELECT convert_tz('${created_date} 00:00:00', '+00:00', '+09:00') as DATE FROM ...

시간을 +9하여 한국시간에 맞춰 가져온다면

Hive에서는 +9시간 더 더해져 가져오는 경우가 있다.

날짜로 파티션을 하는 하이브테이블일 경우 날짜자체가 달라져

데이터 정합성에 오류가 일어날 수 있다.

RDB 시간 : 2021-04-07 14:00:00 (UTC)
예상한 HIVE 시간 : 2021-04-07 23:00:00 (KST)
결과 HIVE 시간 : 2021-04-08 08:00:00

[ 해결 ]

Ooize Scheduler의 시간을 서울로 지정했다면

Sqoop실행시 연결된 RDB의 시간을 비교하여

자동적으로 한국시간에 맞게 +9시간을 해준다.

따라서 Sqoop쿼리에서 convert_tz를 사용하지 않고 쿼리를 작성한다.

Sqoop쿼리의 Where절과는 상관없다.

최종결과 아웃풋이 나오고 SELECT된 컬럼(DATE관련 타입)들에게 Oozie시간대에 맞게 조정하기 때문이다.

2021년 4월 7일 수요일

[ Hive ] 일주일 단위로 Group By 쿼리

Hive파티션 yymmdd=20210407

주별로 그룹핑하기 위해서 yyyy-mm-dd형식의 데이터 포맷이여야한다.

아래 함수를 써 변경

from_unixtime(unix_timestamp(orr.yymmdd,'yyyymmdd'),'yyyy-mm-dd')

WEEKOFYEAR(yyyy-mm-dd value) 를 사용해 주별로 그룹핑을 한다

Ex )

SELECT
WEEKOFYEAR(from_unixtime(unix_timestamp(yymmdd,'yyyymmdd'),'yyyy-mm-dd')) as week,
sum(col1) as `컬럼1`,
sum(col2) as `컬럼2`,

sum(col3) as `컬럼3`
FROM test_db
GROUP BY

WEEKOFYEAR(from_unixtime(unix_timestamp(yymmdd,'yyyymmdd'),'yyyy-mm-dd'))
;

week라는 컬럼이 현재 년도에서 몇번째 주인지를 나타내는 int값으로 리턴을 하기에,

특정주의 특정요일로 날짜를 표시하려면 아래와 같은 쿼리를 Group by절에 사용한다.

(일요일 기준)

...

GROUP BY

date_sub(from_unixtime(unix_timestamp(yymmdd,'yyyymmdd'),'yyyy-mm-dd'), pmod(datediff(from_unixtime(unix_timestamp(yymmdd,'yyyymmdd'),'yyyy-mm-dd'),'1900-01-07'),7))

[ Spark] 로컬환경에서 Hive Thrift접속 예시

import org.apache.spark.SparkConf

import org.apache.spark.SparkContext._

import org.apache.spark.SparkContext

import org.apache.spark.sql.SparkSession

object SimpleApp {

def main(args: Array[String]): Unit = {

val conf = new SparkConf()

.setAppName("HiveToPhoenix")

.setMaster("local[*]")

val sc = new SparkContext(conf)

val spark = SparkSession.builder()

.appName("Spark Hive Example")

.config("hive.metastore.uris","thrift://11.22.333.444:10000")

.enableHiveSupport()

.getOrCreate()

val jdbcDF = spark.read.format("jdbc")

.option("url", "jdbc:hive2://11.22.333.444:10000")

.option("dbtable", "temp.test_db")

.option("user", "hive")

.option("password", "1234")

.option("driver", "org.apache.hive.jdbc.HiveDriver")

.option("numberPartitons",5)

.load()

println("able to connect------------------")

jdbcDF.show()

jdbcDF.printSchema

spark.sql("SELECT * FROM temp.test_dbwhere yymmdd=20210322").show()

sc.stop()

}

2021년 4월 4일 일요일

[ Spark ] CDH phoenix 연동관련 설정

spark -> 구성 -> 범위(Gateway) -> 범주(고급) -> spark-conf/spark-defaults.conf에 대한 Spark클라이언트 고급구성스니펫 에

spark.executor.extraClassPath=/opt/cloudera/parcels/PHOENIX-5.0.0-cdh6.2.0.p0.1308267/lib/phoenix/phoenix-5.0.0-cdh6.2.0-client.jar

spark.driver.extraClassPath=/opt/cloudera/parcels/PHOENIX-5.0.0-cdh6.2.0.p0.1308267/lib/phoenix/phoenix-5.0.0-cdh6.2.0-client.jar

와 같이 외부 jar파일 Classpath에 인식하도록 설정

https://docs.cloudera.com/documentation/enterprise/6/6.2/topics/phoenix_spark_connector.html

[ Practice Scala ] List.map() 예시 (Update List)

문제 ]

리스트 절대값 반환

Sample Input

2
-4
3
-1
23
-4
-54

Sample Output

제출 ]

def f(arr:List[Int]):List[Int] = return arr.map(Math.abs(_))

풀이 ]

List.map을 이용하여 배열의 모든 요소에 function을 적용한다.

final def map[B](f: (A) => B): List[B]

Builds a new list by applying a function to all elements of this list.

B: the element type of the returned list.
f: the function to apply to each element.
returns: a new list resulting from applying the given function f to each element of this list and collecting the results.

Math함수의 abs()함수를 이용해여 배열의 모든 요소의 값을 절대값으로 바꾼다.

def f(arr:List[Int]):List[Int] = return arr.map(Math.abs(_))

또는 Math함수를 쓰지 않고 따로 함수를 만들어 인자값으로 사용할 수도 있다.

def f(arr:List[Int]):List[Int] = return arr.map(k)

def k(s:Int) : Int = {

if(s < 0) return s * -1

else return s

}