"빅데이터 분석 프로젝트" - 거주자우선주차구역 부정주차 해소

"빅데이터 분석 프로젝트" - 거주자우선주차구역 부정주차 해소

2022. 11. 1. 11:11ㆍ빅데이터 분석 프로젝트

주제 : 거주자우선주차구역 부정주차 해소를 위한 계도활동 구간 선정

개요 : 효율적인 계도 활동 노선을 선정해 거주자 우선 주차구역의 활성화로 시민들의 생활 안정 도모

▶ 데이터 분석 과정

분석프로세스 정립
울산광역시 중구 데이터 수집
수집한 데이터 정제
데이터 정제 후 주요 변수 추출
주요 변수들을 이용한 군집분석
최적 군집 선정
선정된 행정구에 대한 Qgis 격자 생성
순위에 따른 Folium 지도 맵핑
최종 계도 구간 및 노선 선정

▶ 데이터 분석프로세스

▶ 데이터 수집 ( 기관 데이터와 공공데이터 위주로 파악하고 필요에 따라 민간데이터도 수집)

구분	활용 데이터	기간	제공기관
현황 데이터	울산광역시 중구 도시관리공단 거주자 우선 주차정보	2021	울산광역시 중구 도시관리공단
	울산광역시 중구 도시관리공단 부정주차 단속 건수 (연도별)	2022	울산광역시 중구 도시관리공단
	울산광역시 중구 도시관리공단 부정주차 단속 건수 (월별)	2022	울산광역시 중구 도시관리공단
	울산광역시 중구 도시관리공단 부정주차 단속 건수 (시간대별)	2022	울산광역시 중구 도시관리공단
	울산광역시 자동차등록대수 (2017~2022)	2022	KOSIS 국가통계포털
	울산광역시 중구 읍면동별 자동차등록대수 현황	2022	울산광역시 차량등록사업소
위치 데이터	울산광역시 중구 공영주차장 개수 및 주차면수	2022	울산광역시 중구 도시관리공단
	울산광역시 중구 주택현황.shp	2022	국가공간정보포털
	울산광역시 중구 음식점 현황	2022	지방행정인허가데이터개방
인구 특성	울산광역시 중구 읍·면·동별_세대_및_인구	2022	KOSIS 국가통계포털
인구 특성	울산광역시 중구 격자별 생산 인구 수.shp	2022	국토정보플랫폼

▶ 데이터 정제

● 데이터 정제 과정

Geo - Coding
필요한 데이터 칼럼 추출 및 삭제
데이터 실수화
중복값 및 결측값 처리
데이터 병합 (카운트, 합계)

● 데이터 정제 일부 예시 코드

- 데이터 정제(그룹화 및 개수 결합)

#필요 라이브러리 
import pandas as pd
import numpy as np

#데이터셋 불러오기
df = pd.read_csv("./단속처리(정제).csv", encoding = 'cp949', engine = 'python')
df

#단속장소별 건수 확인

단속위치_groupby = df.groupby('단속장소').count()[df.columns[0]]
단속위치_groupby

#단속위치별 병합
df1 = pd.DataFrame(단속위치_groupby)
df1

#인덱스 재정렬
단속장소별_건수 = df1.reset_index()
단속장소별_건수

# 가장 단속이 많았던 위치 확인

단속장소df = 단속장소별_건수.iloc[단속장소별_건수["단속동"].idxmax()]
단속장소df

단속장소별_건수.to_csv("단속장소별_건수.csv", index = False, encoding = 'cp949')

- 데이터 병합

#데이터셋 불러오기
import pandas as pd
import numpy as np

df = pd.read_csv("./거주자우선주차구역(위치).csv", encoding = 'cp949', engine = 'python')
df

#데이터셋 불러오기
df1 = pd.read_csv("./단속장소별_건수.csv", encoding = 'cp949', engine = 'python')
df1

#데이터 병합
전처리_주차구역 = pd.merge(df, df1, on = '구획', how = 'left')
전처리_주차구역

#결측값 처리
전처리_주차구역.fillna(0, inplace = True)
전처리_주차구역

#데이터 타입 변경
전처리_주차구역 = 전처리_주차구역.astype({'단속건수' : 'int'})
전처리_주차구역

#데이터 저장
전처리_주차구역.to_csv("주차구역_결합.csv", index = False, encoding = 'cp949')

▶ 변수 추출 ( 상관관계분석 및 변수 중요도 )

● 상관관계 분석 (동별)

import pandas as pd
import numpy as np
from matplotlib import pyplot
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
import seaborn as sns

mpl.rcParams['axes.unicode_minus'] = False
mpl.rc("font", family = "Malgun Gothic")

df = pd.read_csv("./동별 전처리 데이터 개수결합.csv", engine = 'python', encoding = 'cp949')
df

df.corr(method='pearson')

df_corr = df
df_corr.corr()

sns.heatmap(df.corr(), annot=True, cmap = 'Reds')

● 변수 중요도 (동별)

# 훈련 세트, 테스트 세트 분리 8:2 비율
import os
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from matplotlib import pyplot
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
mpl.rcParams['axes.unicode_minus'] = False
mpl.rc("font", family = "Malgun Gothic")

df = pd.read_csv("./동별 전처리 데이터 개수결합.csv", engine = 'python', encoding = 'cp949')
df.head()

train_set, test_set = train_test_split(df, test_size=0.2, random_state=42)

senti_prepared = train_set[["구획개수","건물개수","주차면수합","인구수","자동차대수","음식점개수"]]
senti_label = train_set["단속건수"]

test_set_X = test_set[["구획개수","건물개수","주차면수합","인구수","자동차대수","음식점개수"]]
test_set_y = test_set["단속건수"]

from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)
forest_reg.fit(senti_prepared, senti_label)

answer = pd.DataFrame(test_set_y[:100].reset_index())
del answer["index"]

plt.plot(answer, label="answer")
plt.plot(forest_reg.predict(test_set_X[:100]), label="predict")
plt.legend()

from sklearn.metrics import mean_squared_error

senti_predictions = forest_reg.predict(senti_prepared)
forest_mse = mean_squared_error(senti_label, senti_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse


# 62.536199756620974

def plot_feature_importance(model):
    n_features = senti_prepared.shape[1]
    plt.barh(np.arange(n_features), sorted(model.feature_importances_), align="center")
    plt.yticks(np.arange(n_features), senti_prepared.columns)
    plt.xlabel("Random Forest Feature Importance")
    plt.ylabel("Feature")
    plt.ylim(-1, n_features)
    
plot_feature_importance(forest_reg)

▶ 군집분석 ( K-means )

● 비계층적 군집분석 ( K-means 분석 )

=> 6개의 변수를 2차원으로 차원축소하여 군집분석 진행, 군집개수는 K = 2로 확정

import pandas as pd
import numpy as np

np.random.seed(42)

from matplotlib import pyplot
%matplotlib inline

import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm

mpl.rcParams['axes.unicode_minus'] = False
mpl.rc("font", family = "Malgun Gothic")

df = pd.read_csv("./동별 전처리 데이터 개수결합.csv", encoding = "euc-kr")
df

df_raw = df[['소속동', '단속건수', '구획개수', '건물개수', '자동차대수', '주차면수합', '인구수', '음식점개수']]
df_raw

x = df_raw.drop(["소속동"], axis=1) # 독립변인 추출

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

df_x = scaler.fit_transform(x) # 설명변수

from sklearn.cluster import KMeans

cluster_range = [i+1 for i in range(5)]
clus_error = []

for i in cluster_range:
    clus = KMeans(i)
    clus.fit(df_x)
    clus_error.append(clus.inertia_)
    
ds_error = pd.DataFrame({"NumberofCluster":cluster_range, "Error":clus_error})
ds_error

plt.figure(figsize = (10,5))
plt.plot(ds_error["NumberofCluster"], ds_error["Error"])
plt.title("Sum of squared distance")
plt.xlabel("Clusters")
plt.ylabel("Sum of squared distance")

np.random.seed(42)

kmeans = KMeans(n_clusters = 2).fit(df_x)
cluster_kmeans = [i+1 for i in kmeans.labels_]
df_raw["ClusterKmeans"] = cluster_kmeans
df_raw

print(cluster_kmeans)

kmeans.cluster_centers_

kmeans.labels_

df_raw["cluster"] = kmeans.labels_
df_raw.head()

from sklearn.decomposition import PCA
import plotly.express as px

#remember to scale your data if the ranges are too broad

# scaled_features= scaler.fit_transform(df) == df_x
# kmeans_model= KMeans(n_clusters=3, max_iter=500, random_state=42)

kmeans_model = KMeans(n_clusters = 2, random_state = 42)

y_km= kmeans_model.fit_predict(df_x)

pca_model= PCA(n_components=2, random_state=42)
transformed= pca_model.fit_transform(df_x)
centers= pca_model.transform(kmeans_model.cluster_centers_)

fig = px.scatter(x = transformed[:, 0], y = transformed[:, 1], color = y_km, text=df_raw["소속동"], title = "K-Means")
fig.add_scatter(
    x=centers[:, 0],
    y=centers[:, 1],
    marker=dict(size=10, color="red"), name="Centers",
)
fig.show()

* 위의 분석결과에서 추가적으로 복산2동 우정동 포함 (단속건수 빈도수에 가중치를 두고 주관적인 결정)

▶ 선정된 행정구에 대한 Qgis 격자 생성

● 동별 격자 생성 및 시각화 결과 1~4순위 구간 선정

▶ 순위에 따른 Folium 지도 맵핑

● 1~4순위 Folium 지도 시각화

▶ 최종 계도 구간 선정

● 1~4순위 계도 구간 시각화

▶ 이번 프로젝트는 데이터 부족과 실현가능성이 많이 낮아서 분석의 신뢰도는 높지 않다.

하지만, 단속건수별로 데이터를 분석하여 정리해둔다면, 민원이 발생하기 전에 사전에 조치하여

중구민들의 삶을 증진하고 민원 발생률을 낮출 수 있을 것이다.

'빅데이터 분석 프로젝트' 카테고리의 다른 글

" 빅데이터 분석 프로젝트(2) " - 데이터 분석, 결과 시각화 (0)	2022.08.24
" 빅데이터 분석 프로젝트(1) " - 데이터 정제, 변수추출 (2)	2022.08.22

나의 첫 빅데이터 공부 기록