CV
Name: Shinwoo Park
Current Position: Postdoctoral Researcher, University of Luxembourg
Affiliation: Software Verification and Validation (SVV) Group, Interdisciplinary Centre for Security, Reliability and Trust (SnT), University of Luxembourg
PI: Prof. Domenico Bianculli
Degree: Ph.D. in Artificial Intelligence, Yonsei University
Graduation Date: February 2026
Appointments
-
Postdoctoral Researcher, University of Luxembourg, SnT/SVV Group (June 2026–May 2027):
Working with Prof. Domenico Bianculli on an EIB-collaborative project at the intersection of NLP/NLU, Requirements Engineering, and financial and regulatory document analysis. The project focuses on applying language technologies to high-stakes financial and regulatory-domain problems, including financial contracts and legal documents. -
Ph.D. Researcher, Yonsei University, Theory of Computation Lab (2020–2026):
Conducted research on LLM-generated content detection, LLM watermarking, code intelligence, and trustworthy generative AI under the supervision of Prof. Yo-Sub Han.
Research Interests
My research focuses on ensuring the safety, transparency, and accountability of large language models (LLMs) through detection, watermarking, and domain-specific language understanding techniques. I develop multi-modal and multi-lingual systems that identify, trace, structure, and analyze AI-generated or domain-specific content across natural language, source code, and financial/regulatory documents.
Specifically, I explore three complementary directions: (1) linguistic and stylistic feature–based detection, which analyzes morphological, syntactic, and stylistic patterns to distinguish human- and LLM-generated text or code; (2) LLM watermarking, which embeds imperceptible yet verifiable statistical or structural signals into generated outputs; and (3) NLP/NLU and Requirements Engineering for financial and regulatory domains, which aims to support the analysis, structuring, and verification of complex financial contracts and legal documents.
My recent works include KatFishNet, the first linguistic feature–based detector for Korean text; LPcodedec, a coding-style-driven detector for paraphrased code; STELA, a syntactic-predictability watermark enabling model-free detection; and WaterMod, a probability-balanced modular watermarking framework supporting multi-bit payloads.
Broadly, my goal is to build trustworthy generative and language-understanding systems that are interpretable, regulation-compliant, and robust enough for high-stakes domains.
Research Summary
My research aims to promote responsible and verifiable AI generation by developing reliable methods for detecting, attributing, and interpreting LLM-generated or domain-specific text and code. I pursue three main directions that mutually reinforce each other:
- Linguistic/Stylistic Feature-based Detection: Models such as KatFishNet and LPcodedec analyze linguistic or coding-style cues—word spacing, part-of-speech diversity, punctuation patterns, naming and indentation consistency—to capture distributional differences between human and LLM authors.
- LLM Watermarking: Frameworks such as STELA and WaterMod embed imperceptible signals during generation using linguistically or probabilistically adaptive mechanisms, enabling publicly verifiable and multi-bit attribution without harming fluency.
- Domain-specific NLP/NLU and Requirements Engineering: My postdoctoral research extends my work on trustworthy NLP toward high-stakes financial and regulatory settings, focusing on the analysis of financial contracts, legal documents, and regulatory texts through language understanding and requirements-oriented methods.
These systems demonstrate strong multilingual (English, Korean) and multimodal (text + code) generalization, advancing interpretable, regulation-aligned, and domain-aware AI research.
Research Statement
My long-term research vision is to establish a unified framework for provenance-aware, interpretable, and domain-grounded AI that spans language, programming, and high-stakes document analysis. To achieve this, I combine linguistic insight, statistical modeling, watermark design, and requirements-oriented analysis to construct transparent interfaces between human communication, domain knowledge, and generative models.
Linguistic / Stylistic Feature-based Detection:
My work on KatFishNet introduces the first benchmark and detector for LLM-generated Korean text,
leveraging word-spacing irregularities, POS n-gram diversity, and comma usage to expose cross-morphological differences between human and machine writing.
Extending this idea to source code, LPcodedec identifies LLM-paraphrased code by quantifying coding-style features such as naming consistency, indentation regularity, and comment ratio.
LLM Watermarking:
My research advances from distribution-based watermarking to linguistically adaptive and probability-balanced methods.
STELA modulates watermark strength according to syntactic predictability modeled by POS n-gram entropy, enabling model-free public detection.
WaterMod generalizes this concept through modular token-rank partitioning that guarantees at least one high-probability token per class,
supporting zero-bit and multi-bit watermarking with minimal quality loss.
NLP/NLU and Requirements Engineering for Financial and Regulatory Domains:
My postdoctoral work at the University of Luxembourg extends this agenda toward trustworthy language technologies for high-stakes institutional settings.
In collaboration with the European Investment Bank (EIB), I work on applying NLP, NLU, and Requirements Engineering techniques to financial and regulatory-domain problems,
including the analysis of financial contracts, legal documents, and regulatory texts.
Together, these studies form a coherent agenda for trustworthy and interpretable AI, bridging linguistic analysis, information-theoretic watermark design, and domain-specific language understanding to meet emerging transparency, safety, and regulatory requirements.
Publications
† First author †* Co-first author (marked with *)
To Appear / Published
A Linguistics-Aware LLM Watermarking via Syntactic Predictability
ACL 2026 (Main Conference), to appear.
DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation
EACL 2026 (Main Conference), pp. 4922--4936.
Marking Code Without Breaking It: Code Watermarking for Detecting LLM-Generated Code
Findings of EACL 2026, pp. 3990–-4002.
* Equal contribution
WaterMod: Modular Token-Rank Partitioning for Probability-Balanced LLM Watermarking
AAAI 2026 (Main Technical Track, Oral Presentation).
EnCur: Curriculum-Based In-Context Learning with Structural Encoding for Code Time Complexity Prediction
Expert Systems with Applications, Vol. 296, 129094, January 2026.
Detecting Code Paraphrased by Large Language Models using Coding Style Features
Engineering Applications of Artificial Intelligence, Vol. 162, December 2025.
Mondrian: A Framework for Logical Abstract (Re)Structuring
EMNLP 2025 (Main Conference), pp. 33663--33678.
TrapDoc: Deceiving LLM Users by Injecting Imperceptible Phantom Tokens into Documents
Findings of EMNLP 2025, pp. 18881--18897.
Advanced Code Time Complexity Prediction Approach Using Contrastive Learning
Engineering Applications of Artificial Intelligence, Vol. 151, July 2025.
KatFishNet: Detecting LLM-Generated Korean Text through Linguistic Feature Analysis
ACL 2025 (Main Conference), pp. 21189–21222.
ConPrompt: Pre-training a Language Model with Machine-Generated Data for Implicit Hate Speech Detection
Findings of EMNLP 2023, pp. 10964–10980.
Contrastive Learning with Keyword-based Data Augmentation for Code Search and Code Question Answering
EACL 2023 (Main Conference), pp. 3609–3619.
Generalizable Implicit Hate Speech Detection using Contrastive Learning
COLING 2022, pp. 6667–6679.
Under Review
Linguistics-Aware Non-Distortionary LLM Watermarking
Sequential Behavioral Watermarking for LLM Agents
Scalable Semantic Code Clone Retrieval via Module-Level Graph Aggregation
From Intuition to Calibrated Judgment: A Rubric-Based Expert-Panel Study of Human Detection of LLM-Generated Korean Text
Steering Language Models Before They Speak: Logit-Level Interventions
Select then MixUp: Improving Out-of-Distribution Natural Language Code Search
Projects
-
NLP/NLU and Requirements Engineering for Financial and Regulatory Documents (2026–2027, University of Luxembourg / EIB-collaborative project):
Conducting postdoctoral research in the SVV group at SnT, University of Luxembourg, on language technologies for financial and regulatory-domain problems. The project applies NLP, NLU, and Requirements Engineering techniques to the analysis, structuring, and verification of financial contracts, legal documents, and regulatory texts. -
Topic Modeling for Entity Retrieval (2021–2022, Funded by Ministry of Science and ICT, Korea):
Developed a knowledge-graph entity search module from unstructured text using LDA-based topic modeling. Designed a query-aware retrieval mechanism to align user queries with semantically relevant entities. -
Medical Text Mining (2022–2024, Industry-funded by Soldoc):
Built an NLP-based clinical support framework from psychiatrist–patient dialogues. Applied BERT embedding clustering to discover latent topics and extract suspected mental health conditions with interpretable evidence keywords from conversational data. -
Human-AI Programming Lab (2023–2025, Funded by National Research Foundation of Korea):
Conducted research on AI techniques for collaborative programming systems, including code search, code question answering, worst-case time complexity prediction, and detection of LLM-generated code.
Publications:- EACL 2023 (Main Track, First Author): Natural language-based code search and code QA for developer assistance.
- Engineering Applications of Artificial Intelligence (EAAI), 2025 (First Author, 2 papers): (1) Contrastive learning-based worst-case time complexity prediction for source code, (2) Detection of LLM-generated code using coding-style features.
- Expert Systems with Applications (ESWA), 2025 (Co-author): In-context learning approach for worst-case time complexity prediction.
-
Research on Effective Watermarking Techniques for AI-generated Codes (2025, Funded by National Research Foundation of Korea):
Investigated watermarking methods for LLM-generated code by embedding identifiable signals during the generation process while preserving functionality and code quality.
Publication:- Findings of EACL 2026 (Co-first Author): A watermarking framework for AI-generated code enabling reliable attribution while maintaining syntactic and semantic integrity.
Professional Services
-
Reviewer, ACL Rolling Review (ARR) (2023--)
-
Reviewer, ACL Student Research Workshop (SRW) (2026--)
Contributed to mentoring and evaluating early-stage research in NLP. -
Reviewer, NeurIPS (2026--)
Skills
- Programming: Python, Bash
- ML Frameworks: PyTorch, scikit-learn, Hugging Face Transformers
- NLP: SpaCy, NLTK, KoNLPy, KiwiPiePy
- LLM APIs: OpenAI API, Gemini API, Ollama
- Data Analysis: Pandas, NumPy, SciPy
- Visualization: Matplotlib, Seaborn
- Version Control: Git
- Writing: LaTeX
- Languages: Korean (native), English (fluent)