← Documents

Data Analysis Solution

📊

One line — An analysis-driven process management platform that goes beyond monitoring-only SPC: users upload raw data and run statistical analyses (normality, t-test, ANOVA) directly through the UI, powered by a Python statistical engine exposed as APIs.

Period
August 2024 – March 2025
Role
Design & Development of the Data Analysis Engine and UI/UX
Scope
Solution Enhancement  ·  (Dev result: Confidential)

Overview

Traditional SPC(Statistical Process Control) systems focus on simple monitoring. This platform lets users upload raw data and perform statistical analyses directly. Core methods — normality tests, t-tests, ANOVA — are implemented as APIs with a UI for visualizing results, and the Python statistical engine, backend server and web UI are wired into one integrated architecture.

Data sources DB (Oracle…) CSV JSON Preprocess 검증·결측·변환 FastAPI backend Process Manager REST analysis API 분석 엔진 (subprocess) Normality test t-test ANOVA Web UI (Vue.js) 결과 시각화 기존 ifacts SPC 솔루션에 분석 기능을 내장
Integrated architecture — sources → preprocess → FastAPI(Process Manager) → analysis subprocess → Vue UI

Impact

  • Ensured consistency and automation of analysis results through standardized analysis APIs.
  • Established an automated analysis framework inside the platform without external tools (e.g., Minitab).
  • Enabled proactive response to process anomalies by embedding analysis into the existing ifacts SPC solution.
  • Supported multiple file formats (CSV/JSON) and DB inputs, letting users analyze directly via the UI.

Tech stack

Data Analytics SolutionSPC Python 3.11FastAPIJavaScriptVue.js OracleMariaDBSQLAlchemy Linux (Ubuntu/CentOS)AWSOn-premise

Key roles & achievements

⊙ UI & API design for analysis

  • Benchmarked commercial tools (Minitab, JMP) to design an intuitive UI.
  • Implemented a FastAPI-based REST API server to handle analysis requests.
  • Exposed core statistical methods (normality tests, t-tests, ANOVA) as APIs.

⊙ Stable analysis-engine architecture

  • Built a Process Manager to overcome Python GIL — state management + multi-core utilization.
  • Ran each analysis request in a dedicated subprocess for stable parallelism.
  • Developed inter-process health-check functionality.

⊙ Flexible data ingestion & preprocessing

  • Parsers for multiple input formats (DB, JSON, CSV).
  • Automated preprocessing — validation, missing-value handling, transformation.

Troubleshooting 1 — Python GIL & multi-core

Problem: Python’s GIL(Global Interpreter Lock) restricts CPU-bound work, so a multi-process approach is forced. When many users analyze simultaneously, CPU contention causes bottlenecks. But in a multi-process setup it is structurally hard to keep per-user state (Uvicorn just adds workers with round-robin) — uploading data and then requesting analysis may hit different processes, disconnecting the user’s state. Unlike Java, Python is weak at memory-shared session management across processes. And due to solution-team policy, external state stores (e.g., Redis) were not allowed — an internalized approach was required.

User A User B User … Process Manager gateway · 사용자 state import+analysis = 1 unit health check Dedicated subprocess / core subprocess · CPU core 1 subprocess · CPU core 2 subprocess · CPU core N 외부 저장소(Redis) 없이 stateful 유지 · GIL 병목 회피
Process Manager가 게이트웨이처럼 사용자 state·프로세스 연결을 직접 관리 → 외부 저장소 없이 다중코어 활용

Solution: Implemented a Process Manager that centrally controls processes and manages per-user state. Acting like a gateway, it handles inter-process state delivery and connectivity, keeping a stateful context without external storage. By managing data import + analysis as a single unit in the DB keyed by user ID, session continuity is preserved — yielding a scalable, reliable distributed-computation structure that avoids the GIL bottleneck while using multi-core resources.

Troubleshooting 2 — Visualization CPU overload

Problem: Rendering 100,000+ points in the browser (e.g., a scatter plot of all points) overwhelmed it; transmitting large datasets risked heavy server load; and client PCs were overloaded — visualization became the biggest performance challenge.

100,000+ points → overload Random sample → 같은 분포 ECDF → CDF (Glivenko–Cantelli) : 분포 차이 거의 없음
전수(100k) 대신 random sampling — 추세 파악이 목적이므로 표본 분포가 거의 동일해 효율적 시각화

Solution (theoretical, domain-specific): Commercial tools (Minitab, JMP) are Windows executables that process data locally or connect directly to a DB, overcoming browser limits — a clear recognition that this is a system-level limitation. Since a full system-level fix was infeasible, I applied a statistical-theoretical approach: the Glivenko–Cantelli theorem guarantees the empirical CDF (ECDF) almost surely converges to the true CDF. Because visualization aims to capture overall trends (not outliers), random sampling was adopted; sampled datasets showed no significant distributional difference, enabling efficient visualization.