About Me
Name: Shinwoo Park
Position: Ph.D. Candidate in Artificial Intelligence
Affiliation: Yonsei University, Seoul, South Korea
Expected Graduation: February 2026
Research Interests
My research focuses on ensuring the safety, transparency, and accountability of large language models (LLMs) through detection and watermarking techniques. I develop multi-modal and multi-lingual systems that identify or trace LLM-generated content across natural language and source code domains.
Specifically, I explore two complementary directions: (1) linguistic and stylistic feature–based detection, which analyzes morphological, syntactic, and stylistic patterns to distinguish human- and LLM-generated text or code; and (2) LLM watermarking, which embeds imperceptible yet verifiable statistical or structural signals into generated outputs.
My recent works include KatFishNet, the first linguistic feature–based detector for Korean text; LPcodedec, a coding-style-driven detector for paraphrased code; STELA, a syntactic-predictability watermark enabling model-free detection; and WaterMod, a probability-balanced modular watermarking framework supporting multi-bit payloads.
Broadly, my goal is to build trustworthy generative systems that are interpretable, regulation-compliant, and resistant to misuse.
Research Summary
My research aims to promote responsible and verifiable AI generation by developing reliable methods for detecting and attributing LLM-generated text and code. I pursue two main directions that mutually reinforce each other:
- Linguistic/Stylistic Feature-based Detection: Models such as KatFishNet and LPcodedec analyze linguistic or coding-style cues—word spacing, part-of-speech diversity, punctuation patterns, naming and indentation consistency—to capture distributional differences between human and LLM authors.
- LLM Watermarking: Frameworks such as STELA and WaterMod embed imperceptible signals during generation using linguistically or probabilistically adaptive mechanisms, enabling publicly verifiable and multi-bit attribution without harming fluency.
These systems demonstrate strong multilingual (English, Korean) and multimodal (text + code) generalization, advancing interpretable and regulation-aligned AI provenance research.
Research Statement
My long-term research vision is to establish a unified framework for provenance-aware and interpretable AI that spans both language and programming modalities. To achieve this, I combine linguistic insight, statistical modeling, and watermark design to construct transparent interfaces between human communication and generative models.
Linguistic / Stylistic Feature-based Detection:
My work on KatFishNet introduces the first benchmark and detector for LLM-generated Korean text,
leveraging word-spacing irregularities, POS n-gram diversity, and comma usage to expose cross-morphological differences between human and machine writing.
Extending this idea to source code, LPcodedec identifies LLM-paraphrased code by quantifying coding-style features such as naming consistency, indentation regularity, and comment ratio.
LLM Watermarking:
My research advances from distribution-based watermarking to linguistically adaptive and probability-balanced methods.
STELA modulates watermark strength according to syntactic predictability modeled by POS n-gram entropy, enabling model-free public detection.
WaterMod generalizes this concept through modular token-rank partitioning that guarantees at least one high-probability token per class,
supporting zero-bit and multi-bit watermarking with minimal quality loss.
Together, these studies form a coherent agenda for trustworthy and interpretable generative AI, bridging linguistic analysis and information-theoretic watermark design to meet emerging transparency and safety requirements.
Projects
- AI for Issue-Fact Mapping (2021–2022):
Knowledge graph entity retrieval from unstructured text using topic modeling. - Medical Text Mining (2022–2025):
Clinical insight extraction from medical records using topic modeling. - Human-AI Programming Lab (2023–2025):
Code search, QA, and time complexity prediction for collaborative coding systems. - Research on Effective Watermarking Techniques for AI-generated Codes (2025--):
Conducted research on a watermarking technique that intervenes in the code generation process of LLMs, embedding watermarks while preserving the code’s quality and functionality.
Professional Services
- Reviewer, ACL Rolling Review (2023--)
Skills
- Programming: Python, Java, C, C++, Bash
- ML Frameworks: PyTorch, scikit-learn, Hugging Face Transformers
- NLP: SpaCy, NLTK, KoNLPy, KiwiPiePy
- LLM APIs: OpenAI API, Gemini API, Ollama
- Data Analysis: Pandas, NumPy, SciPy
- Visualization: Matplotlib, Seaborn
- Version Control: Git
- Writing: LaTeX
- Languages: Korean (native), English (fluent)