2020년 10월 20일 화요일

[Algorithm] 스트림 입력값 리스트로 비교 ( Compare the Triplets )

Compare the Triplets

문제 ]

Alice and Bob each created one problem for HackerRank. A reviewer rates the two challenges, awarding points on a scale from 1 to 100 for three categories: problem clarity, originality, and difficulty.

The rating for Alice's challenge is the triplet a = (a[0], a[1], a[2]), and the rating for Bob's challenge is the triplet b = (b[0], b[1], b[2]).

The task is to find their comparison points by comparing a[0] with b[0], a[1] with b[1], and a[2] with b[2].

If a[i] > b[i], then Alice is awarded 1 point.
If a[i] < b[i], then Bob is awarded 1 point.
If a[i] = b[i], then neither person receives a point.

Comparison points is the total points a person earned.

Given a and b, determine their respective comparison points.

Example

a = [1, 2, 3]
b = [3, 2, 1]

For elements *0*, Bob is awarded a point because a[0] .

For the equal elements a[1] and b[1], no points are earned.

Finally, for elements 2, a[2] > b[2] so Alice receives a point.

The return array is [1, 1] with Alice's score first and Bob's second.

Function Description

Complete the function compareTriplets in the editor below.

compareTriplets has the following parameter(s):

int a[3]: Alice's challenge rating
int b[3]: Bob's challenge rating

Return

int[2]: Alice's score is in the first position, and Bob's score is in the second.

Input Format

The first line contains 3 space-separated integers, a[0], a[1], and a[2], the respective values in triplet a.
The second line contains 3 space-separated integers, b[0], b[1], and b[2], the respective values in triplet b.

Constraints

1 ≤ a[i] ≤ 100
1 ≤ b[i] ≤ 100

Sample Input 0

5 6 7
3 6 10

Sample Output 0

1 1

Explanation 0

In this example:

Now, let's compare each individual score:

, so Alice receives point.
, so nobody receives a point.
, so Bob receives point.

Alice's comparison score is , and Bob's comparison score is . Thus, we return the array .

Sample Input 1

17 28 30
99 16 8

Sample Output 1

2 1

Explanation 1

Comparing the elements, so Bob receives a point.
Comparing the and elements, and so Alice receives two points.
The return array is .

제출 JAVA ]

import java.io.*;
import java.math.*;
import java.security.*;
import java.text.*;
import java.util.*;
import java.util.concurrent.*;
import java.util.function.*;
import java.util.regex.*;
import java.util.stream.*;
import static java.util.stream.Collectors.joining;
import static java.util.stream.Collectors.toList;

public class Solution {

    // Complete the compareTriplets function below.
    static List<Integer> compareTriplets(List<Integer> a, List<Integer> b) {
        List<Integer> result = new ArrayList<Integer>();
        result.add(0);
        result.add(0);
        for(int i=0; i < a.size(); i++) {
            if(a.get(i)>b.get(i)) result.set(0, result.get(0)+1);
            else if(a.get(i)==b.get(i)) continue;
            else result.set(1, result.get(1)+1);
        }    
        return result;

    }

    public static void main(String[] args) throws IOException {
        BufferedReader bufferedReader = 
            new BufferedReader(new InputStreamReader(System.in));
        BufferedWriter bufferedWriter = 
            new BufferedWriter(new OutputStreamWriter(System.out));

        List<Integer> a = Stream.of(bufferedReader.readLine()
            .replaceAll("\\s+$", "").split(" "))
            .map(Integer::parseInt).collect(toList());

        List<Integer> b = Stream.of(bufferedReader.readLine()
            .replaceAll("\\s+$", "").split(" "))
            .map(Integer::parseInt).collect(toList());

        List<Integer> result = compareTriplets(a, b);

        bufferedWriter.write(
            result.stream().map(Object::toString).collect(joining(" "))+ "\n"
        );

        bufferedReader.close();
        bufferedWriter.close();
    }
}

풀이 ]

1. BufferedReader와 BufferedWriter를 활용하여 Input, Output데이터를 처리한다.

InputSteamReader(System.in)과 OutputStreamWriter(System.out)은 콘솔로 데이터를 받고 내보내기 위함이다.

2. Stream은 자바8에서 추가된 기능으로, Array/Collection 자료형들의 요소를 하나씩 참조해 람다를 이용해 반복적으로 처리 가능하게 한다.

.split.map.collection 함수를 통해 콘솔로 받은 String을 " "을 토큰으로 나누고, Integer형으로 값을 변환 후, 스트림을 리스트로 변환해 제네릭타입 변수에 넣는다.

즉 콘솔에 11 22 33 란 값을 입력하면 List [11, 22, 33]으로 변환한다.

3. compareTriplets(a, b) 함수를 통해 a[i] 와 b[i] ( i는 반복문 변수)를 비교해 결과 값을 새로운 List변수 result에 담는다.

a[i]가 크면 result[0] 값에 +1이 되고 b[i]가 크면 result[1] 값에 +1이 된다.

2020년 10월 18일 일요일

Apache Hive [2] - CDH Hive (using Spark) 환경설정 최적화

[ 사전 준비 ]

1. 모든 설정은 CDH->Yarn / Hive->구성에서 값을 수정한다.

2. spark executor는 하나의 yarn container가지며, spark task는 하나의 core(yarn vcore)에서 실행된다. 하나의 spark executor가 여러개의 spark task를 동시에 부릴 수 있다.

3. 하이브 엔진으로 spark를 사용하기 위해선 위 설정을 해주어야 한다. Default는 MapReduce이다.

[ 예제 환경 ]

예시로 실행되는 환경은 40개의 host가 있는 YARN클러스터이며, 각 호스트는 32개의 Core와 120GB메모리가 할당되어있다고 가정한다.

[ YARN Configuration ]

1. yarn.nodemanager.resource.cpu-vcores

일반적으로 Yarn NodeManager과 HDFS DataNode에서 Core를 하나씩 할당하고, OS사용을 위해 코어 2개를 추가로 할당하여, 최대 28개의 코어를 Yarn에서 사용할 수 있다.

2. yarn.nodemanager.resource.memory-mb

각 호스트의 메모리가 120GB이므로 여유있게 100GB로 설정한다.

[ Spark Configuration ]

- 고려사항

executor메모리를 설정할 때 메모리가 크면 쿼리,조인 성능이 좋아지지만, 가비지 수집으로 오버헤드가 증가할 수 있다.

executor 코어 또한 높게 설정하면 성능이 좋지만, 너무 많이 설정하면 코어와 메모리가 할당 될때까지 spark job이 실행이 안되거나(Race condition) 다른 어플리케이션 성능이 떨어진다.

1. spark.executor.cores

사용하지 않는 코어 수를 최소화하기 위해 YARN에 할당된 코어 수에 따라 3, 4,5,6으로 설정할 것을 권장한다. 예시를 든 YARN Core(28개) 일때 4로 설정해야 사용하지 않는 코어가 남지 않는다. 28 % 4 =0

2. spark.executor.memory

spark.executor.cores가 4로 설정된 경우 호스트에서 동시에 실행할 수 있는 executor의 최대 수는 7개(28/4)이다. 따라서 각 실행기는 100GB / 7, 약 14GB 메모리를 할당 될 수 있다.

3. spark.executor.memoryOverhead

executor.memoryOverhead는 VM오버헤드, 문자열 및 기타 오버헤드에 사용된다.

executor에 할당된 총 메모리에는 오버헤드 메모리도 포함된다.

즉 executor memory = spark.executor.memory + spark.executor.memoryOverhead

spark.executor.memoryOverhead 기본값은 executor메모리 * 0.1이며 최소 384(MB)이다.

따라서 각 실행기는 14GB 메모리가 할당시

spark.executor.memory = 12GB

spark.executor.memoryOverhead = 2G

로 설정할 수 있다.

또한 spark.executor.memory + spark.executor.memoryOverhead의 합계가 yarn.scheduler.maximum-allocation-mb보다 작아야한다.

[ Spark Driver Memory Configuring ]

스파크 드라이버 메모리 또한 설정해야 하는데 관련 설정값은 아래와 같다.

spark.driver.memory : 하이브가 스파크에서 실행 중일 때 스파크 드라이버에 할당된 최대 자바 힙 메모리 크기.

spark.yarn.memoryOverhead : 드라이버당 YARN에 요청할 수 있는 추가 off-힙메모리

yarn.nodemanager.resource.memory-mb = x, spark driver 메모리 = y라 가정하면

x가 50GB보다 큰경우 y=12GB
x가 12GB ~ 50GB 일 경우 y=4GB
x가 1GB ~ 12GB 일 경우 y=1GB
x가 1GB 미만일 때 y=256MB

yarn.nodemanager.resource.memory-mb = 100GB이므로 spark driver 메모리 = 12GB이다. 그 결과,

spark.driver.memory=10.5GB,

spark.yarn.memoryOverhead=1.5GB (spark driver 총 메모리의 10-15%로 설정)

로 설정 할 수 있다.

[ Executor 수 선택 ]

클러스터 executor수는 각 호스트의 executor수와 각 호스트의 woker에 의해 결정된다. 클러스터에 40개의 worker 호스트가 있는경우 executor의 최대 수는 160개 (40 * 4(코어수)) 이다. 드라이버가 코어 1개와 메모리 12GB를 사용하기 때문에 최대치는 이보다 작다.

하이브 성능은 쿼리를 실행하는데 사용되는 executor수와 직접적인 관련이 있다. 따라서 최대 실행자 수의 절반정도로 설정한다.

spark.executor.instance=80

그러나 spark.executor.instance를 최적화 값으로 하면 성능이 극대화되지만, 여러 사용자가 하이브쿼리를 실행하는 운영환경인 경우 기능저하가 온다. 따라서 클라우데라에서는 spark.executor.instance값을 동적으로 할당할 것은 권장한다. ( spark.executor.instance의 기본값은 동적할당 )

참조

: https://docs.cloudera.com/documentation/enterprise/latest/topics/admin_hos_tuning.html

: https://clouderatemp.wpengine.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/

: https://blog.cloudera.com/how-to-tune-your-apache-spark-jobs-part-2/

Apache Sqoop [4] - 유저가이드3 [ sqoop-import 2 ]

1. 최신행 Import

스쿱은 최신행만 가져올 수 있는 기능을 제공한다. 예를 들어 2020-10-14일 이후의 열들을 가져오고 싶을 경우 사용될 수 있다.

인자	설명
`--check-column (col)`	가져올 행을 결정할 기준 열을 지정 (CHAR/NCHAR/VARCHAR/VARNCHAR/ LONGVARCHAR/LONGNVARCHAR 타입은 기준열로 지정불가)
`--incremental (mode)`	가져올 행을 결정하는 방법을 지정
`--last-value (value)`	가져올 열의 값에 대한 최대값을 지정

--incremental 인자는 두가지 타입을 지정할 수 있는데 하나는 append와 lastmodified이다.

[append]

append사용시 --check-column을 지정하고(예시에선 ID로 지정) --last-value를 500으로 지정한다. 즉 id가 500보다 큰 행을 가져오는 결과 값을 가진다.

sqoop import --connect jdbc:mysql://localhost:3306/dbname --table tt1 --username root -P --check-column id --incremental append --last-value 500

[lastmodified]

lastmodified사용시 수정된 열의 값이 --last-value로 지정된 타임스탬프보다 최근인 열을 가져온다.

sqoop import --connect jdbc:mysql://localhost:3306/dbname --table tt1 --username root -P --check-column update_date --incremental lastmodified --last-value '2020-08-24 22:04:56.0'

2. 파일포맷

일반적으로 'delimited text'와 'sequenceFiles' 포맷을 지원한다.

[ Delimited text ]

Delimited text형식이 default이며 --as-textfile 인자로 명시적으로 지정할 수 도 있다. 'ROW FORMAT' 옵션을 활용, delimited를 설정해 HIVE에서도 활용 할 수 있다.

1,here is a message,2010-05-01
2,happy new year!,2010-01-01
3,another message,2009-11-12

[ SequenceFiles ]

SequenceFiles는 이진 형식 저장타입이다. 모든 데이터의 정확한 저장을 지원하며 Java클래스로 표현할 수 있어 MapReduce프로그램에서 사용되는 데이터를 저장하기에 적합하다. 그 중 대표적인게 다른 프로그래밍 언어로 작성된 데이터도 확장가능하여 효율적인 Avro데이터타입이다.

[ 압축 ]

default는 비압축형식인데 압축을 위해선 --compress인수를 활용할 수 있으며, 하둡압축코덱을 위한 --compression-codec도 지원한다.

3. Large Data 처리

스쿱은 큰 데이터(BLOB, CLOB)를 처리하기 적합하다. BLOB는 바이너리 데이터로 RDB외부에 저장하기 위한 데이터타입이다. CLOB는 문자열 데이터를 저장하기 위한 타입.

스쿱은 이를 처리하기 위해 메모리에 전부 올려놓치 않고 스트리밍 방식으로 처리할 수 있게 인라인으로 저장한다. 인라인 데이터를 통해 모든 데이터를 액세스 할 수 있다.

사용할 수 인자는 아래와 같다.

인자	설명
`--enclosed-by <char>`	필수 포함 필드 문자 설정
`--escaped-by <char>`	이스케이프 문자 설정
`--fields-terminated-by <char>`	필드 구분 문자 설정
`--lines-terminated-by <char>`	줄 끝 문자(개행) 설정
`--mysql-delimiters`	mysql 기본 구분자 사용 ( 줄:\n, 이스케이프:\, 선택적 포함:' )
`--optionally-enclosed-by <char>`	필드 동봉 문자 설정

escape character지원 문자: \b, \n, \r, \t, \", \\' \\

예제 데이터셋

Some string, with a comma.
Another "string with quotes"

스쿱명령어

$ sqoop import --fields-terminated-by , --escaped-by \\ --enclosed-by '\"' ...

결과 값

"Some string, with a comma.","1","2","3"...
"Another \"string with quotes\"","4","5","6"...

2020년 10월 12일 월요일

SQL - Oracle[1] SELECT문

1. 개요

sql문은 대소문자를 구분하지 않는다.
sql문은 한 줄 또는 여러 줄에 입력할 수 있다.
sql문은 선택적으로 세미콜론(;)으로 끝날 수 있습니다. 세미콜론은 여러 sql문을 실행하는 경우에 필요하다.

2. SELECT

[ 산술연산자 사용 ]

SELECT last_name, salary, salary + 300

FROM employees;

곱하기와 나누기는 더하기와 빼기보다 먼저 수행
동일한 우선 순위를 갖는 연산자는 왼쪾에서 오른쪽으로 평가됨
괄호는 기본 우선 순위를 재정의하거나 명령문을 명확히 하기 위해 사용

[ Null 값 ]

SELECT last_name, job_id, salary, commission_pct

FROM employees;

Null은 사용할 수 없거나, 할당되지 않았거나, 알 수 없거나,적용할 수 없는 값이다.
Null은 0이나 공백과는 다릅니다. 0은 숫자이고 공백은 문자이다.
모든 데이터 유형의 열은 null을 포함 할 수 있으나 primary key열은 null을 사용할 수 없다.
null값을 포함하는 산술식은 null이다. null*500 = null

[ 연결 연산자 ]

SELECT last_name || job_id AS "Employees"

FROM employees;

last_name=kim이고 job_id는 123이라면 kim123 으로 출력된다.

[ 리터널 문자열 사용 ]

SELECT last_name || ' is a ' || job_id AS "Employee Details"

FROM employees;

출력 : kim is a 123

[ 중복 행 ]

SELECT DISTINCT department_id

FROM employees;

[ 테이블 구조 표시 ]

DESCRIBE employees;

DESCRIBE 명령을 사용하여 테이블의 구조를 표시합니다.