Institute of Language & Information Studies | Contents

Introduce

‘Corpus’ is a large and structured set of digitalized linguistic data, which is necessary in various academic fields studying languages since it presents a comprehensive view of linguistic variation. The project for Yonsei Corpus started in 1986 with the start of the Korean Dictionary Society. We started by building a corpus for compiling dictionaries in 1988. Later, we extended the scope of the corpus to incorporate more various types of linguistic data for studies in Korean linguistics, Korean education, Human Linguistics, or Teaching Korean as a Foreign Language.

Lists

Number	Name	Size
1	Yonsei Corpus 1	2,900,000
2	Yonsei Corpus 2	1,100,000
3	Yonsei Corpus3	5,980,000
4	Yonsei Corpus 4	770,000
5	Yonsei Corpus 5	8,00,000
6	Yonsei Corpus 6	7,230,000
7	Yonsei Corpus 7	13,670,000
8	Yonsei Corpus 8	870,000
9	Yonsei Corpus 9	1500,000
10	Yonsei Corpus 10	780,000
11	Yonsei Corpus 11	730,000
12	Yonsei Corpus of Korean in the 20th Century	150,378,870
13	Corpus of Korean Textbooks (Complete)	724,856
14	Corpus of Korean Textbooks (Conversation)	119,598
15	Yonsei Korean Learner Corpus	278,542
16	Korean Elementary Textbook Corpus after Independence	1,496,280
17	The 6th and 7th Korean Elementary Textbook Corpus	1,681,769
18	Yonsei Balanced Corpus of Written Discourse	1,054,362
19	Yonsei Balanced Corpus of Spoken Discourse	998,934
20	Yonsei Corpus of Polysemy	1,165,224
21	Yonsei Corpus of Hangul tripitaka	386,472
22	Corpus of <Tongnip Sinmun> Newspaper	144,309
23	Corpus of Popular Songs in the Modern Era	29,339
24	Yonsei Corpus of Multimodal Data	18,986
25	Twitter Corpus	945,175,620
26	Political Discourse corpus	306,681
	Total	1,148,089,842