CV
Name: Shinwoo Park
Position: Ph.D. Candidate in Artificial Intelligence
Affiliation: Yonsei University, Seoul, South Korea
Expected Graduation: February 2026
Research Interests
My research focuses on ensuring the safety, transparency, and accountability of large language models (LLMs) through detection and watermarking techniques. I develop multi-modal and multi-lingual systems that identify or trace LLM-generated content across natural language and source code domains.
Specifically, I explore two complementary directions: (1) linguistic and stylistic feature–based detection, which analyzes morphological, syntactic, and stylistic patterns to distinguish human- and LLM-generated text or code; and (2) LLM watermarking, which embeds imperceptible yet verifiable statistical or structural signals into generated outputs.
My recent works include KatFishNet, the first linguistic feature–based detector for Korean text; LPcodedec, a coding-style-driven detector for paraphrased code; STELA, a syntactic-predictability watermark enabling model-free detection; and WaterMod, a probability-balanced modular watermarking framework supporting multi-bit payloads.
Broadly, my goal is to build trustworthy generative systems that are interpretable, regulation-compliant, and resistant to misuse.
Research Summary
My research aims to promote responsible and verifiable AI generation by developing reliable methods for detecting and attributing LLM-generated text and code. I pursue two main directions that mutually reinforce each other:
- Linguistic/Stylistic Feature-based Detection: Models such as KatFishNet and LPcodedec analyze linguistic or coding-style cues—word spacing, part-of-speech diversity, punctuation patterns, naming and indentation consistency—to capture distributional differences between human and LLM authors.
- LLM Watermarking: Frameworks such as STELA and WaterMod embed imperceptible signals during generation using linguistically or probabilistically adaptive mechanisms, enabling publicly verifiable and multi-bit attribution without harming fluency.
These systems demonstrate strong multilingual (English, Korean) and multimodal (text + code) generalization, advancing interpretable and regulation-aligned AI provenance research.
Research Statement
My long-term research vision is to establish a unified framework for provenance-aware and interpretable AI that spans both language and programming modalities. To achieve this, I combine linguistic insight, statistical modeling, and watermark design to construct transparent interfaces between human communication and generative models.
Linguistic / Stylistic Feature-based Detection:
My work on KatFishNet introduces the first benchmark and detector for LLM-generated Korean text,
leveraging word-spacing irregularities, POS n-gram diversity, and comma usage to expose cross-morphological differences between human and machine writing.
Extending this idea to source code, LPcodedec identifies LLM-paraphrased code by quantifying coding-style features such as naming consistency, indentation regularity, and comment ratio.
LLM Watermarking:
My research advances from distribution-based watermarking to linguistically adaptive and probability-balanced methods.
STELA modulates watermark strength according to syntactic predictability modeled by POS n-gram entropy, enabling model-free public detection.
WaterMod generalizes this concept through modular token-rank partitioning that guarantees at least one high-probability token per class,
supporting zero-bit and multi-bit watermarking with minimal quality loss.
Together, these studies form a coherent agenda for trustworthy and interpretable generative AI, bridging linguistic analysis and information-theoretic watermark design to meet emerging transparency and safety requirements.
Publications
† First author †* Co-first author (marked with *)
To Appear / Published
DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation
EACL, Main Conference, to appear, 2026.
Marking Code Without Breaking It: Code Watermarking for Detecting LLM-Generated Code
EACL, Findings, to appear, 2026.
* Equal contribution
WaterMod: Modular Token-Rank Partitioning for Probability-Balanced LLM Watermarking
AAAI, to appear (Oral), 2026.
EnCur: Curriculum-Based In-Context Learning with Structural Encoding for Code Time Complexity Prediction
Expert Systems with Applications, Vol. 296, 129094, January 2026.
Detecting Code Paraphrased by Large Language Models using Coding Style Features
Engineering Applications of Artificial Intelligence, Vol. 162, December 2025.
Mondrian: A Framework for Logical Abstract (Re)Structuring
EMNLP 2025 (Main Conference), pp. 33663--33678.
TrapDoc: Deceiving LLM Users by Injecting Imperceptible Phantom Tokens into Documents
Findings of EMNLP 2025, pp. 18881--18897.
Advanced Code Time Complexity Prediction Approach Using Contrastive Learning
Engineering Applications of Artificial Intelligence, Vol. 151, July 2025.
KatFishNet: Detecting LLM-Generated Korean Text through Linguistic Feature Analysis
ACL 2025 (Main Conference), pp. 21189–21222.
ConPrompt: Pre-training a Language Model with Machine-Generated Data for Implicit Hate Speech Detection
Findings of EMNLP 2023, pp. 10964–10980.
Contrastive Learning with Keyword-based Data Augmentation for Code Search and Code Question Answering
EACL 2023 (Main Conference), pp. 3609–3619.
Generalizable Implicit Hate Speech Detection using Contrastive Learning
COLING 2022, pp. 6667–6679.
Under Review
From Intuition to Expertise: Rubric-Based Cognitive Calibration for Human Detection of LLM-Generated Korean Text
Steering Language Models Before They Speak: Logit-Level Interventions
A Linguistics-Aware LLM Watermarking via Syntactic Predictability
Select then MixUp: Improving Out-of-Distribution Natural Language Code Search
Projects
- AI for Issue-Fact Mapping (2021–2022):
Knowledge graph entity retrieval from unstructured text using topic modeling. - Medical Text Mining (2022–2025):
Clinical insight extraction from medical records using topic modeling. - Human-AI Programming Lab (2023–2025):
Code search, QA, and time complexity prediction for collaborative coding systems. - Research on Effective Watermarking Techniques for AI-generated Codes (2025--):
Conducted research on a watermarking technique that intervenes in the code generation process of LLMs, embedding watermarks while preserving the code’s quality and functionality.
Professional Services
- Reviewer, ACL Rolling Review (2023--)
Skills
- Programming: Python, Java, C, C++, Bash
- ML Frameworks: PyTorch, scikit-learn, Hugging Face Transformers
- NLP: SpaCy, NLTK, KoNLPy, KiwiPiePy
- LLM APIs: OpenAI API, Gemini API, Ollama
- Data Analysis: Pandas, NumPy, SciPy
- Visualization: Matplotlib, Seaborn
- Version Control: Git
- Writing: LaTeX
- Languages: Korean (native), English (fluent)