CV
Name: Shinwoo Park
Degree: Ph.D. in Artificial Intelligence
Affiliation: Yonsei University, Seoul, South Korea
Graduation Date: February 2026
Research Interests
My research focuses on ensuring the safety, transparency, and accountability of large language models (LLMs) through detection and watermarking techniques. I develop multi-modal and multi-lingual systems that identify or trace LLM-generated content across natural language and source code domains.
Specifically, I explore two complementary directions: (1) linguistic and stylistic feature–based detection, which analyzes morphological, syntactic, and stylistic patterns to distinguish human- and LLM-generated text or code; and (2) LLM watermarking, which embeds imperceptible yet verifiable statistical or structural signals into generated outputs.
My recent works include KatFishNet, the first linguistic feature–based detector for Korean text; LPcodedec, a coding-style-driven detector for paraphrased code; STELA, a syntactic-predictability watermark enabling model-free detection; and WaterMod, a probability-balanced modular watermarking framework supporting multi-bit payloads.
Broadly, my goal is to build trustworthy generative systems that are interpretable, regulation-compliant, and resistant to misuse.
Research Summary
My research aims to promote responsible and verifiable AI generation by developing reliable methods for detecting and attributing LLM-generated text and code. I pursue two main directions that mutually reinforce each other:
- Linguistic/Stylistic Feature-based Detection: Models such as KatFishNet and LPcodedec analyze linguistic or coding-style cues—word spacing, part-of-speech diversity, punctuation patterns, naming and indentation consistency—to capture distributional differences between human and LLM authors.
- LLM Watermarking: Frameworks such as STELA and WaterMod embed imperceptible signals during generation using linguistically or probabilistically adaptive mechanisms, enabling publicly verifiable and multi-bit attribution without harming fluency.
These systems demonstrate strong multilingual (English, Korean) and multimodal (text + code) generalization, advancing interpretable and regulation-aligned AI provenance research.
Research Statement
My long-term research vision is to establish a unified framework for provenance-aware and interpretable AI that spans both language and programming modalities. To achieve this, I combine linguistic insight, statistical modeling, and watermark design to construct transparent interfaces between human communication and generative models.
Linguistic / Stylistic Feature-based Detection:
My work on KatFishNet introduces the first benchmark and detector for LLM-generated Korean text,
leveraging word-spacing irregularities, POS n-gram diversity, and comma usage to expose cross-morphological differences between human and machine writing.
Extending this idea to source code, LPcodedec identifies LLM-paraphrased code by quantifying coding-style features such as naming consistency, indentation regularity, and comment ratio.
LLM Watermarking:
My research advances from distribution-based watermarking to linguistically adaptive and probability-balanced methods.
STELA modulates watermark strength according to syntactic predictability modeled by POS n-gram entropy, enabling model-free public detection.
WaterMod generalizes this concept through modular token-rank partitioning that guarantees at least one high-probability token per class,
supporting zero-bit and multi-bit watermarking with minimal quality loss.
Together, these studies form a coherent agenda for trustworthy and interpretable generative AI, bridging linguistic analysis and information-theoretic watermark design to meet emerging transparency and safety requirements.
Publications
† First author †* Co-first author (marked with *)
To Appear / Published
DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation
EACL, Main Conference, to appear, 2026.
Marking Code Without Breaking It: Code Watermarking for Detecting LLM-Generated Code
EACL, Findings, to appear, 2026.
* Equal contribution
WaterMod: Modular Token-Rank Partitioning for Probability-Balanced LLM Watermarking
AAAI 2026 (Main Technical Track, Oral Presentation).
EnCur: Curriculum-Based In-Context Learning with Structural Encoding for Code Time Complexity Prediction
Expert Systems with Applications, Vol. 296, 129094, January 2026.
Detecting Code Paraphrased by Large Language Models using Coding Style Features
Engineering Applications of Artificial Intelligence, Vol. 162, December 2025.
Mondrian: A Framework for Logical Abstract (Re)Structuring
EMNLP 2025 (Main Conference), pp. 33663--33678.
TrapDoc: Deceiving LLM Users by Injecting Imperceptible Phantom Tokens into Documents
Findings of EMNLP 2025, pp. 18881--18897.
Advanced Code Time Complexity Prediction Approach Using Contrastive Learning
Engineering Applications of Artificial Intelligence, Vol. 151, July 2025.
KatFishNet: Detecting LLM-Generated Korean Text through Linguistic Feature Analysis
ACL 2025 (Main Conference), pp. 21189–21222.
ConPrompt: Pre-training a Language Model with Machine-Generated Data for Implicit Hate Speech Detection
Findings of EMNLP 2023, pp. 10964–10980.
Contrastive Learning with Keyword-based Data Augmentation for Code Search and Code Question Answering
EACL 2023 (Main Conference), pp. 3609–3619.
Generalizable Implicit Hate Speech Detection using Contrastive Learning
COLING 2022, pp. 6667–6679.
Under Review
From Intuition to Calibrated Judgment: A Rubric-Based Expert-Panel Study of Human Detection of LLM-Generated Korean Text
Steering Language Models Before They Speak: Logit-Level Interventions
A Linguistics-Aware LLM Watermarking via Syntactic Predictability
Select then MixUp: Improving Out-of-Distribution Natural Language Code Search
Projects
-
Topic Modeling for Entity Retrieval (2021–2022, Funded by Ministry of Science and ICT, Korea):
Developed a knowledge-graph entity search module from unstructured text using LDA-based topic modeling. Designed a query-aware retrieval mechanism to align user queries with semantically relevant entities. -
Medical Text Mining (2022–2024, Industry-funded by Soldoc):
Built an NLP-based clinical support framework from psychiatrist–patient dialogues. Applied BERT embedding clustering to discover latent topics and extract suspected mental health conditions with interpretable evidence keywords from conversational data. -
Human-AI Programming Lab (2023–2025, Funded by National Research Foundation of Korea):
Conducted research on AI techniques for collaborative programming systems, including code search, code question answering, worst-case time complexity prediction, and detection of LLM-generated code.
Publications:- EACL 2023 (Main Track, First Author): Natural language-based code search and code QA for developer assistance.
- Engineering Applications of Artificial Intelligence (EAAI), 2025 (First Author, 2 papers): (1) Contrastive learning-based worst-case time complexity prediction for source code, (2) Detection of LLM-generated code using coding-style features.
- Expert Systems with Applications (ESWA), 2025 (Co-author): In-context learning approach for worst-case time complexity prediction.
-
Research on Effective Watermarking Techniques for AI-generated Codes (2025, Funded by National Research Foundation of Korea):
Investigated watermarking methods for LLM-generated code by embedding identifiable signals during the generation process while preserving functionality and code quality.
Publication:- Findings of EACL 2026 (Co-first Author): A watermarking framework for AI-generated code enabling reliable attribution while maintaining syntactic and semantic integrity.
Professional Services
-
Reviewer, ACL Rolling Review (ARR) (2023--)
-
Reviewer, ACL Student Research Workshop (SRW) (2026--)
Contributed to mentoring and evaluating early-stage research in NLP.
Skills
- Programming: Python, Bash
- ML Frameworks: PyTorch, scikit-learn, Hugging Face Transformers
- NLP: SpaCy, NLTK, KoNLPy, KiwiPiePy
- LLM APIs: OpenAI API, Gemini API, Ollama
- Data Analysis: Pandas, NumPy, SciPy
- Visualization: Matplotlib, Seaborn
- Version Control: Git
- Writing: LaTeX
- Languages: Korean (native), English (fluent)