https://reu.cs.mu.edu/api.php?action=feedcontributions&user=AdityaSubramanian&feedformat=atomREU@MU - User contributions [en]2021-10-19T03:11:51ZUser contributionsMediaWiki 1.23.13https://reu.cs.mu.edu/index.php/Text_Mining_in_Keyword_ExtractionText Mining in Keyword Extraction2016-08-05T17:25:50Z<p>AdityaSubramanian: </p>
<hr />
<div><br />
== Project Description and Goal ==<br />
'''Student:''' Phuc Nguyen and Aditya Subramanian<br />
<br />
'''Mentor:''' Dr. Thomas Kaczmarek<br />
<br />
'''Description:''' Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keyword extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keyword extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier. <br />
<br />
This project attempts to analyze several algorithms described above to determine the strength and weakness of each method by comparing the results with a sample keywords list generated by humans and based on the existing methods propose a new approach. First, we obtain the corpus and sample literature within that corpus. We then generate the keyword lists using all the testing algorithms and a list generated by the author of each literature. Finally, we consult an expert in the field to rank all the lists. We repeat the experiment with several documents and based on the outcomes gauge the potential of each method. It is worth noticing that in order to avoid biasedness from the expert, all the lists shouldn't be labelled and the order in which the lists occur should be shuffled randomly each time. <br />
<br />
Phuc and Aditya are working on two seperate projects within this field. Aditya's project has two aspects. The first is creating a machine learning algorithm to identify keywords. While this has already been done[4], we are considering many more factors than what has been done in the past. The second aspect is to create a better co-occurrence matrix. The model considers the distance between two words and compares it to the expected distance between the two words given the number of instances of the words, the positions of the first word, and the length of the document. On the other hand, Phuc's approach attempts to improve TextRank's performance (reduce runtime complexity, improve keyword score, etc) by integrating the model from the PageRank algorithm used by Google. His current problem is to refine the new approach even further by using tools from numerical analysis to reduce runtime complexity. <br />
<br />
'''Background References'''<br />
<br />
1) [https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf TextRank: Bringing Orders into Texts] by Rada Mihalcea and Paul Tarau.<br />
<br />
2) [http://www.cs.unm.edu/~pdevineni/papers/Lott.pdf Survey of Keyword Extraction Techniques] by Brian Lott.<br />
<br />
3) [https://www.researchgate.net/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents Automatic Keyword Extraction from Individual Documents] by Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley.<br />
<br />
4) [http://dl.acm.org/citation.cfm?id=1119383 Improved Automatic Keyword Extraction Given More Linguistic Knowledge] by Annette Hulth<br />
<br />
5) [http://www.aaai.org/Papers/FLAIRS/2003/Flairs03-076.pdf Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information] by Yutaka Matsuo and Mitsuru Ishizuka<br />
<br />
== Weekly Log ==<br />
'''Week 1 (5/31 to 6/3)'''<br />
* Attend REU orientation activities, fill out forms and paperworks<br />
* Meet with Dr. Kaczmarek to discuss goals and scopes of the project<br />
* Read articles related to text mining to find the potential research topic<br />
* Come up with a research topic and discuss with Dr. Kaczmarek<br />
<br />
'''Week 2 (6/6 to 6/10)'''<br />
* Continue reading about research already done in the field<br />
* Finalize research topic<br />
* Create list of factors for machine learning algorithm<br />
* Outline proposed algorithm for new co-occurrence matrix <br />
* Start looking at RAKE, TextRank, and TF-IDF<br />
<br />
'''Week 3 (6/13 to 6/17)'''<br />
* Complete code for machine learning algorithm<br />
* Complete code to test proposed algorithm with other algorithms<br />
* Review tools from linear algebra and mathematical analysis<br />
* Attempt to prove convergence and find runtime complexity for the original TextRank method<br />
<br />
'''Week 4 (6/20 to 6/24)'''<br />
* Find dataset for machine learning algorithm<br />
* Begin work on co-occurence matrix algorithm<br />
* Meet with Dr. Kaczmarek to explain the new approach to perform TextRank<br />
* Meet with Aditya to talk about collaboration between 2 projects<br />
* Begin to look at the proofs of the Perron-Frobenius Theorem and the Power Method Convergence Theorem<br />
* Review graph theory and theory of probability<br />
<br />
'''Week 5 (6/27 to 7/1)'''<br />
* Complete code for co-occurrence matrix creator<br />
* Attempt to understand prior implementations of algorithms using such matrices<br />
* Integrate PageRank to improve TextRank performance<br />
* Begin to write a formal report paper<br />
* Presentation about what we have come up with so far<br />
<br />
'''Week 6 (7/4 to 7/8)'''<br />
* Attempt to run machine learning algorithm on corpus<br />
* Begin gathering data<br />
* Begin to look at the rate of convergence for the new approach<br />
* Review theory of convergence from numerical analysis<br />
<br />
'''Week 7 (7/11 to 7/15)'''<br />
* Complete data collection for multiple methods<br />
* Begin data analysis<br />
* Start to look at some numerical techniques used to approximate eigenvectors<br />
* Review complexity theory involving big-O notation and the theory of Markov chain<br />
* Begin to write code to test the new approach, conclude that the theory is valid and the new approach can be reasonably implemented to run in practice<br />
<br />
'''Week 8 (7/18 to 7/22)'''<br />
* Finalize model<br />
* Write code to test model on other half of corpus<br />
* Prepare poster to be submitted by the end of the week<br />
* Meet with Dr. Kaczmarek to discuss the plan for the remaining 2 weeks and begin to wrap everything up<br />
<br />
'''Week 9 (7/25 to 7/29)'''<br />
* Run model on half of corpus<br />
* Gather results on model preformance<br />
* Prepare for poster presentation<br />
* Prepare for final presentation next week<br />
* Meet with Dr. Kaczmarek to finalize report paper<br />
<br />
'''Week 10 (8/1 to 8/5)'''<br />
* Submit report paper<br />
* Deliver final presentation<br />
* Complete remaining paperwork and survey</div>AdityaSubramanianhttps://reu.cs.mu.edu/index.php/Text_Mining_in_Keyword_ExtractionText Mining in Keyword Extraction2016-07-04T00:31:59Z<p>AdityaSubramanian: </p>
<hr />
<div><br />
== Project Description and Goal ==<br />
'''Student:''' Phuc Nguyen and Aditya Subramanian<br />
<br />
'''Mentor:''' Dr. Thomas Kaczmarek<br />
<br />
'''Description:''' Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keyword extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keyword extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier. <br />
<br />
This project attempts to analyze several algorithms described above to determine the strength and weakness of each method by comparing the results with a sample keywords list generated by humans and based on the existing methods propose a new approach. First, we obtain the corpus and sample literature within that corpus. We then generate the keyword lists using all the testing algorithms and a list generated by the author of each literature. Finally, we consult an expert in the field to rank all the lists. We repeat the experiment with several documents and based on the outcomes gauge the potential of each method. It is worth noticing that in order to avoid biasedness from the expert, all the lists shouldn't be labelled and the order in which the lists occur should be shuffled randomly each time. <br />
<br />
Phuc and Aditya are working on two seperate projects within this field. Aditya's project has two aspects. The first is creating a machine learning algorithm to identify keywords. While this has already been done[4], we are considering many more factors than what has been done in the past. The second aspect is to create a better co-occurrence matrix. The model considers the distance between two words and compares it to the expected distance between the two words given the number of instances of the words, the positions of the first word, and the length of the document.<br />
<br />
'''References'''<br />
<br />
1) [https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf TextRank: Bringing Orders into Texts] by Rada Mihalcea and Paul Tarau.<br />
<br />
2) [http://www.cs.unm.edu/~pdevineni/papers/Lott.pdf Survey of Keyword Extraction Techniques] by Brian Lott.<br />
<br />
3) [https://www.researchgate.net/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents Automatic Keyword Extraction from Individual Documents] by Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowly.<br />
<br />
4) [http://dl.acm.org/citation.cfm?id=1119383 Improved Automatic Keyword Extraction Given More Linguistic Knowledge] by Annette Hulth<br />
<br />
5) [http://www.aaai.org/Papers/FLAIRS/2003/Flairs03-076.pdf Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information] by Yutaka Matsuo and Mitsuru Ishizuka<br />
<br />
== Weekly Log ==<br />
'''Week 1 (5/31 to 6/3)'''<br />
* Attend REU orientation activities, fill out forms and paperworks<br />
* Meet with Dr. Kaczmarek to discuss goals and scopes of the project<br />
* Read articles related to text mining to find the potential research topic<br />
* Come up with a research topic and discuss with Dr. Kaczmarek<br />
<br />
'''Week 2'''<br />
* Continue reading about research already done in the field<br />
* Finalize research topic<br />
* Create list of factors for machine learning algorithm<br />
* Outline proposed algorithm for new co-occurrence matrix <br />
<br />
'''Week 3'''<br />
* Complete code for machine learning algorithm<br />
* Complete code to test proposed algorithm with other algorithms<br />
<br />
'''Week 4'''<br />
*Complete code for co-occurrence matrix creator</div>AdityaSubramanianhttps://reu.cs.mu.edu/index.php/Summer_2016_ProjectsSummer 2016 Projects2016-06-15T19:59:09Z<p>AdityaSubramanian: </p>
<hr />
<div>[[Game Engine for Serious Educational Games]]. Student Researchers: <br />
[[User:Dcronce|Daniel Cronce]] and [[User:Mjbaker4|Michael Baker]]. Mentors [http://www.marquette.edu/ctl/about/staff.shtml Dr. Shaun Longstreet], [http://www.utdallas.edu/~kcooper/ Dr. Kendra Cooper], and [[User:Brylow|Dr. Dennis Brylow]].<br />
<br />
[[Predicting Relative 'Cleanability' from Geometry]]. [[User:Asisk|Anna Sisk]]. Mentors: [http://www.mscs.mu.edu/~stevem/ Dr. Stephen Merrill] and Casey O'Brien <br />
<br />
[[Sudoku Distances]]. [[User:Jbeilke|Julia Beilke]] and [[User:Jmiller|Joel Miller]]. Mentor: Dr. Kim Factor.<br />
<br />
[[Comparing Two Models of HMPAO Uptake in the Lungs: Is a Compartmental Model Sufficient to Reliably Estimate Key Physiological Parameters]]. [[kskamp | Kim Sommerkamp]]. Mentor: Dr. Anne Clough<br />
<br />
[[Text Mining in Keyword Extraction | Text Mining in Keyword Extraction]]. Students: [[Phuc Nguyen | Phuc Nguyen]] and [[AdityaSubramanian | Aditya Subramanian]]. Mentor: [http://www.marquette.edu/mscs/facstaff-kaczmarek.shtml Dr. Thomas Kaczmarek].<br />
<br />
[[Applied Probabilistic Forecasting Methods in Energy Consumption]]. Dr. George Corliss, students [[User:Scloew|Stephen Loew]] and [[User:ARuiz|Alberto Ruiz]]<br />
<br />
[[Statistical Analysis of PARC Near West Side]] . [[ghong|Gina Hong]]. Mentor: [http://www.marquette.edu/mscs/facstaff-krenz.shtml Dr. Gary Krenz].<br />
<br />
[[Development of Authentication and Management Systems for Systems Administration Offices]]. [[User:Cmorley|Charlie Morley]]. Mentors: [http://www.marquette.edu/mscs/facstaff-staff.shtml Steve Goodman] and [[User:Brylow|Dr. Dennis Brylow]].<br />
<br />
== Mathematics and Computer Science Education ==<br />
* [[MUzECS:Chrome|A browser-based IDE for the MUzECS platform.]] [[User:Omokolade.Hunpatin|Omokolade Hunpatin]] and [[User:Rthomas|Ryan Thomas]]. Mentor: [[User:Brylow|Dr. Dennis Brylow]].</div>AdityaSubramanianhttps://reu.cs.mu.edu/index.php/Summer_2016_ProjectsSummer 2016 Projects2016-06-15T19:58:00Z<p>AdityaSubramanian: </p>
<hr />
<div>[[Game Engine for Serious Educational Games]]. Student Researchers: <br />
[[User:Dcronce|Daniel Cronce]] and [[User:Mjbaker4|Michael Baker]]. Mentors [http://www.marquette.edu/ctl/about/staff.shtml Dr. Shaun Longstreet], [http://www.utdallas.edu/~kcooper/ Dr. Kendra Cooper], and [[User:Brylow|Dr. Dennis Brylow]].<br />
<br />
[[Predicting Relative 'Cleanability' from Geometry]]. [[User:Asisk|Anna Sisk]]. Mentors: [http://www.mscs.mu.edu/~stevem/ Dr. Stephen Merrill] and Casey O'Brien <br />
<br />
[[Sudoku Distances]]. [[User:Jbeilke|Julia Beilke]] and [[User:Jmiller|Joel Miller]]. Mentor: Dr. Kim Factor.<br />
<br />
[[Comparing Two Models of HMPAO Uptake in the Lungs: Is a Compartmental Model Sufficient to Reliably Estimate Key Physiological Parameters]]. [[kskamp | Kim Sommerkamp]]. Mentor: Dr. Anne Clough<br />
<br />
[[Text Mining in Keyword Extraction | Text Mining in Keyword Extraction]]. Students: [[Phuc Nguyen | Phuc Nguyen]] and [[User:AdityaSubramanian | Aditya Subramanian]]. Mentor: [http://www.marquette.edu/mscs/facstaff-kaczmarek.shtml Dr. Thomas Kaczmarek].<br />
<br />
[[Applied Probabilistic Forecasting Methods in Energy Consumption]]. Dr. George Corliss, students [[User:Scloew|Stephen Loew]] and [[User:ARuiz|Alberto Ruiz]]<br />
<br />
[[Statistical Analysis of PARC Near West Side]] . [[ghong|Gina Hong]]. Mentor: [http://www.marquette.edu/mscs/facstaff-krenz.shtml Dr. Gary Krenz].<br />
<br />
[[Development of Authentication and Management Systems for Systems Administration Offices]]. [[User:Cmorley|Charlie Morley]]. Mentors: [http://www.marquette.edu/mscs/facstaff-staff.shtml Steve Goodman] and [[User:Brylow|Dr. Dennis Brylow]].<br />
<br />
== Mathematics and Computer Science Education ==<br />
* [[MUzECS:Chrome|A browser-based IDE for the MUzECS platform.]] [[User:Omokolade.Hunpatin|Omokolade Hunpatin]] and [[User:Rthomas|Ryan Thomas]]. Mentor: [[User:Brylow|Dr. Dennis Brylow]].</div>AdityaSubramanian