R
is a popular language and environment for statistical computing and graphics, https://www.r-project.org/about.html, specially suited for working with arrays and matrices. Hundreds of manuals can be found for different levels of users. Just to name one, the following GitHub link learningRresources, covers a large set of resources of different starting levels. In the cheatsheets section of this repository, you can find a visual review of caret
. My experience also tells that the best way is to play and practice, nothing else.
swirl
package provides an introduction to R
by means of interactive courses. Type ?InstallCourses
at the R
prompt for a list of functions that will help you to select a course fitted to your knowledge level and specific interests.
The language has a compact “core” and, by means of different packages programmed by the community, it is highly extensible. Thus, it nowadays offers tools (packages) to do specific computation and analysis in different domains such as bioinformatics, chemistry, computational physics, etc. Among those, the tm
(text mining) [2, 1] package offers an interesting set of state-of-the-art functions to process text documents in an effective manner: creating a corpus, preprocessing-transformation operators, creating document-term matrices, etc. This tutorial is limited to these operations. Further, over these matrices, similar terms or documents can be clustered, and machine learning specialized packages (e.g. caret
[4]) allow to train models to classify new documents in pre-defined classes-annotations.
Function library("packageName")
loads the functions of a package. Prior to these, it is needed to install it by install.packages("packageName")
. In order to install the needed packages, you will need full permission access in your computer. If you are working in the computers of our Faculty, you have root-full permission in the Windows version. Be careful, this does not happen in the Linux version.
R
’s working directory can be consulted by getwd()
and fixed by setwd()
.
The 20 NewsGroup dataset (LINK-ClickHere) is a popular NLP benchmark collection of approximately 20, 000 newsgroup documents, partitioned in 20 predefined thematic newsgroups. This tutorial is completed with the “20news-bydate.tar.gz” compressed file: you can find its link in the middle of the webpage. Unzip the compressed file: training and testing folders are created. Each of them contains 20 folders: each containing the text documents belonging to one newsgroup. We focus the tutorial on training “sci.electronics” and “talk.religion.misc” groups-folders. Focus in ‘training folders’ of “sci.electronics” and “talk.religion.misc”. We are going to treat them as documents belonging to two different classes-topics: science or religion.
Help for a function in R is obtained by invoking ?VCorpus
or help(VCorpus)
. A good way to consult the different functions (and parameters) of each package is the https://www.rdocumentation.org/ community. In our case, the functions of the tm package, grouped alphabetically, can be consulted in https://www.rdocumentation.org/packages/tm/. I consider it as mandatory to develop your own text-preprocessing pipeline for your text and corpus. Searching in the web for the specific function, followed by the R term, usually produces good helping results.
We start by reading the documents of each subdirectory (annotated document class) and loading them in a volatile corpora structure. Function inspect displays detailed information of the corpus. It is first needed to fix R in the working directory which saves the data-corpus, as previously exposed.
library(tm)
Loading required package: NLP
sci.elec <- VCorpus(DirSource("../data/sci.electronics"), readerControl = list(language = "en"))
talk.religion <- VCorpus(DirSource("../data/talk.religion.misc"), readerControl = list(language = "en"))
sci.elec # dimension of the corpus
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 591
inspect(sci.elec[1]) # first document of the corpus, or sci.elec.train[[1]]
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 1
[[1]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 1994
inspect(sci.elec[1:3]) # first three documents of the corpus
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 3
[[1]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 1994
[[2]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 5406
[[3]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 2788
Transformations operators to the corpus are applied via tm_map
function, which applies (maps) a function to all elements of the corpus. As the same transformations will be applied to both “science” and “religion” newsgroups, both corpus are merged using base function c()
: this c()
operator concatenates objects. This operation raises a collection of 968 documents. Function tm map applies transformations to corpus objects. The list of available transformations can be obtained consulting the help of ?getTransformations. Function content transformer is used to apply customized transformations. We apply several transformations. As NLP practitioner: consult the help and parameters of each transformation’s functions, of course, use them in a different and richer way then me.
sci.rel <- c(sci.elec, talk.religion) # merge, concatenate both groups-corpuses
?getTransformations
sci.rel.trans <- tm_map(sci.rel, removeNumbers)
sci.rel.trans <- tm_map(sci.rel.trans, removePunctuation)
sci.rel.trans <- tm_map(sci.rel.trans, content_transformer(tolower)) # convert to lowercase
stopwords("english") # list of english stopwords
[1] "i" "me" "my" "myself" "we"
[6] "our" "ours" "ourselves" "you" "your"
[11] "yours" "yourself" "yourselves" "he" "him"
[16] "his" "himself" "she" "her" "hers"
[21] "herself" "it" "its" "itself" "they"
[26] "them" "their" "theirs" "themselves" "what"
[31] "which" "who" "whom" "this" "that"
[36] "these" "those" "am" "is" "are"
[41] "was" "were" "be" "been" "being"
[46] "have" "has" "had" "having" "do"
[51] "does" "did" "doing" "would" "should"
[56] "could" "ought" "i'm" "you're" "he's"
[61] "she's" "it's" "we're" "they're" "i've"
[66] "you've" "we've" "they've" "i'd" "you'd"
[71] "he'd" "she'd" "we'd" "they'd" "i'll"
[76] "you'll" "he'll" "she'll" "we'll" "they'll"
[81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
[86] "haven't" "hadn't" "doesn't" "don't" "didn't"
[91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
[96] "cannot" "couldn't" "mustn't" "let's" "that's"
[101] "who's" "what's" "here's" "there's" "when's"
[106] "where's" "why's" "how's" "a" "an"
[111] "the" "and" "but" "if" "or"
[116] "because" "as" "until" "while" "of"
[121] "at" "by" "for" "with" "about"
[126] "against" "between" "into" "through" "during"
[131] "before" "after" "above" "below" "to"
[136] "from" "up" "down" "in" "out"
[141] "on" "off" "over" "under" "again"
[146] "further" "then" "once" "here" "there"
[151] "when" "where" "why" "how" "all"
[156] "any" "both" "each" "few" "more"
[161] "most" "other" "some" "such" "no"
[166] "nor" "not" "only" "own" "same"
[171] "so" "than" "too" "very"
sci.rel.trans <- tm_map(sci.rel.trans, removeWords, stopwords("english"))
sci.rel.trans <- tm_map(sci.rel.trans, stripWhitespace)
library(SnowballC) # to access Porter's word stemming algorithm
sci.rel.trans <- tm_map(sci.rel.trans, stemDocument)
After corpus set transformation, a common approach in text mining is to create a document-term matrix from a corpus. Its transpose operator creates a term-document matrix. This document-term matrix is the starting point to apply machine-learning modelization techniques such as classification, clustering, etc. Different operations can be applied over this matrix. We can obtain the terms that occur at least, let say, 15 times; or consult the terms that associate with at least, for example, by a 0.7 correlation degree with the term “young”. After all, it is easy no note the huge degree of sparsity of this matrix: a low amount of non-zero elements. Thus, one of the most important operations is to remove sparse terms, i.e., terms occurring in very few documents. The sparse parameter in the removeSparseTerms function refers to the maximum sparseness allowed: the smaller its proportion, fewer terms (but more common) will be retained. A “trial-anerror” approach will finally return a proper number of terms. This matrix will be the starting point for building further machine learning models (in the next tutorial).
sci.rel.dtm <- DocumentTermMatrix(sci.rel.trans)
dim(sci.rel.dtm)
[1] 968 13565
inspect(sci.rel.dtm[15:25, 1040:1044]) # inspecting a subset of the matrix
<<DocumentTermMatrix (documents: 11, terms: 5)>>
Non-/sparse entries: 0/55
Sparsity : 100%
Maximal term length: 8
Weighting : term frequency (tf)
Sample :
Terms
Docs bargain bargin bargraph bari barn
52729 0 0 0 0 0
52730 0 0 0 0 0
52731 0 0 0 0 0
52732 0 0 0 0 0
52733 0 0 0 0 0
52734 0 0 0 0 0
52735 0 0 0 0 0
52736 0 0 0 0 0
52737 0 0 0 0 0
52738 0 0 0 0 0
52739 0 0 0 0 0
findFreqTerms(sci.rel.dtm, 15)
[1] "aaron" "abandon" "abil"
[4] "abl" "abort" "absolut"
[7] "accept" "access" "accord"
[10] "account" "accur" "accuraci"
[13] "accus" "acknowledg" "across"
[16] "act" "action" "activ"
[19] "actual" "adcom" "add"
[22] "addit" "address" "admit"
[25] "advanc" "advertis" "advic"
[28] "affect" "age" "agent"
[31] "ago" "agre" "air"
[34] "alan" "alicea" "alink"
[37] "aliv" "allah" "allow"
[40] "almost" "alon" "along"
[43] "alreadi" "also" "altern"
[46] "although" "aluminum" "alungmegatestcom"
[49] "alway" "amateur" "america"
[52] "american" "among" "amorc"
[55] "amount" "amp" "amper"
[58] "amplifi" "analog" "ancient"
[61] "andor" "andrew" "angel"
[64] "anim" "anoth" "answer"
[67] "antenna" "anthoni" "anybodi"
[70] "anyon" "anyth" "anyway"
[73] "apart" "apolog" "appar"
[76] "appear" "appl" "appli"
[79] "applic" "appreci" "approach"
[82] "appropri" "apr" "april"
[85] "arcad" "archer" "area"
[88] "arent" "argu" "argument"
[91] "around" "art" "articl"
[94] "articleid" "ask" "aspect"
[97] "assembl" "assert" "assist"
[100] "associ" "assum" "assumpt"
[103] "assur" "athen" "atom"
[106] "attach" "attack" "attempt"
[109] "attent" "aucun" "audio"
[112] "australia" "author" "avail"
[115] "avoid" "awar" "away"
[118] "babak" "babb" "back"
[121] "bad" "balanc" "ball"
[124] "band" "baptist" "bare"
[127] "base" "basi" "basic"
[130] "batf" "batteri" "bbs"
[133] "beam" "bear" "beast"
[136] "becam" "becom" "begin"
[139] "behavior" "behind" "belief"
[142] "believ" "bell" "benefit"
[145] "best" "better" "beyond"
[148] "bibl" "biblic" "big"
[151] "bill" "bit" "black"
[154] "blame" "blank" "blind"
[157] "blood" "board" "bob"
[160] "bodi" "boiler" "book"
[163] "born" "bother" "bottom"
[166] "bought" "box" "branch"
[169] "brand" "break" "breaker"
[172] "brian" "brianlplarizonaedu" "brigham"
[175] "bright" "bring" "british"
[178] "broadcast" "brother" "brought"
[181] "bruce" "bskendignetcomcom" "btw"
[184] "bubblejet" "bug" "build"
[187] "built" "bulb" "bull"
[190] "bunch" "bureau" "buri"
[193] "burn" "bus" "busi"
[196] "button" "buy" "bzawutarlgutaedu"
[199] "cabl" "calcul" "california"
[202] "caligiuri" "call" "cambridg"
[205] "came" "can" "canada"
[208] "canon" "cant" "cap"
[211] "capabl" "capacitor" "car"
[214] "card" "care" "carri"
[217] "carrier" "case" "catalog"
[220] "cathol" "caus" "cec"
[223] "ceccarelli" "center" "centuri"
[226] "certain" "ceux" "challeng"
[229] "chanc" "chang" "channel"
[232] "chapter" "charact" "character"
[235] "charg" "chclevelandfreenetedu" "cheap"
[238] "cheaper" "check" "cheer"
[241] "child" "children" "chip"
[244] "choic" "choos" "chorion"
[247] "chosen" "chris" "christ"
[250] "christian" "christoph" "church"
[253] "circl" "circuit" "cite"
[256] "citi" "claim" "clarif"
[259] "clean" "clear" "cleveland"
[262] "click" "clock" "close"
[265] "cmos" "code" "coil"
[268] "cold" "collect" "colleg"
[271] "color" "colorado" "columbia"
[274] "combin" "come" "comm"
[277] "command" "comment" "commit"
[280] "common" "communic" "communiti"
[283] "compani" "compar" "compass"
[286] "compat" "compet" "complet"
[289] "compon" "comput" "concept"
[292] "concern" "conclud" "conclus"
[295] "concret" "condens" "condit"
[298] "conduct" "conductor" "conduit"
[301] "conform" "confus" "connect"
[304] "connector" "conserv" "consid"
[307] "consider" "consist" "constant"
[310] "construct" "consult" "consum"
[313] "contact" "contain" "content"
[316] "context" "continu" "contradict"
[319] "contrari" "control" "conveni"
[322] "convent" "convert" "convinc"
[325] "cookamunga" "cool" "copi"
[328] "copper" "corp" "corpor"
[331] "correct" "cost" "couldnt"
[334] "count" "countri" "coupl"
[337] "cours" "court" "cover"
[340] "coverag" "covington" "crack"
[343] "creat" "creation" "critic"
[346] "crystal" "csa" "cult"
[349] "cultur" "curious" "current"
[352] "custom" "cut" "cycl"
[355] "cylind" "dale" "damag"
[358] "dan" "danger" "daniel"
[361] "dark" "data" "date"
[364] "dave" "david" "davidian"
[367] "day" "dayton" "dead"
[370] "deal" "dean" "death"
[373] "decenso" "decent" "decid"
[376] "decis" "decod" "defin"
[379] "definit" "degre" "deiti"
[382] "delet" "demand" "demonstr"
[385] "deni" "depart" "depend"
[388] "dept" "deriv" "describ"
[391] "descript" "design" "desir"
[394] "destroy" "detail" "detect"
[397] "detector" "determin" "develop"
[400] "devic" "devil" "diagram"
[403] "dial" "didnt" "die"
[406] "differ" "differenti" "difficult"
[409] "digit" "diod" "direct"
[412] "disagre" "discharg" "disclaim"
[415] "discov" "discuss" "disk"
[418] "display" "distanc" "distinct"
[421] "distribut" "divid" "divin"
[424] "divis" "doctrin" "document"
[427] "doesnt" "done" "dont"
[430] "door" "doubt" "dram"
[433] "draw" "drill" "drive"
[436] "driver" "drop" "dsp"
[439] "dtmedincatbytebingrcom" "due" "duti"
[442] "earli" "earlier" "earth"
[445] "easi" "easier" "easili"
[448] "east" "eat" "edg"
[451] "edit" "educ" "effect"
[454] "either" "electr" "electron"
[457] "element" "els" "email"
[460] "enclos" "end" "energi"
[463] "enforc" "engin" "england"
[466] "english" "enjoy" "enough"
[469] "enter" "entir" "environ"
[472] "equal" "equip" "equival"
[475] "eric" "error" "especi"
[478] "establish" "etc" "etern"
[481] "ethic" "even" "event"
[484] "eventu" "ever" "everi"
[487] "everyon" "everyth" "evid"
[490] "evil" "exact" "exampl"
[493] "except" "excus" "exist"
[496] "expect" "expens" "experi"
[499] "expert" "explain" "explan"
[502] "explicit" "explod" "expos"
[505] "express" "extern" "extra"
[508] "extrem" "eye" "ezekiel"
[511] "face" "fact" "fail"
[514] "failur" "fair" "fait"
[517] "faith" "fall" "fals"
[520] "famili" "fan" "faq"
[523] "far" "fast" "father"
[526] "fault" "fax" "fbi"
[529] "fear" "feder" "feed"
[532] "feel" "feet" "field"
[535] "figur" "file" "fill"
[538] "filter" "final" "find"
[541] "fine" "fire" "first"
[544] "fish" "fit" "five"
[547] "fix" "fixtur" "flame"
[550] "flash" "floor" "flow"
[553] "fluke" "folk" "follow"
[556] "food" "forc" "forget"
[559] "form" "former" "fossil"
[562] "found" "four" "fpu"
[565] "frank" "frankdsuucp" "free"
[568] "freedom" "freemasonri" "frequenc"
[571] "frequent" "fri" "friend"
[574] "front" "fuel" "fulfil"
[577] "full" "function" "fundamentalist"
[580] "fuse" "futur" "gain"
[583] "game" "gas" "gate"
[586] "gave" "gay" "general"
[589] "generat" "gentil" "georgia"
[592] "germani" "get" "gfci"
[595] "gfcis" "give" "given"
[598] "gmt" "god" "goe"
[601] "gone" "good" "gordon"
[604] "gospel" "got" "gotten"
[607] "govern" "grace" "grade"
[610] "grant" "graphic" "great"
[613] "greek" "green" "ground"
[616] "group" "guarante" "guess"
[619] "guest" "guid" "gun"
[622] "guy" "hal" "halat"
[625] "half" "ham" "hand"
[628] "handl" "hang" "happen"
[631] "happi" "hard" "hardwar"
[634] "hare" "harvey" "hate"
[637] "havent" "head" "health"
[640] "hear" "heard" "heart"
[643] "heat" "heaven" "heck"
[646] "held" "hell" "hello"
[649] "help" "henc" "henri"
[652] "henryzootorontoedu" "here" "herring"
[655] "hes" "hewlettpackard" "high"
[658] "higher" "highest" "histor"
[661] "histori" "historian" "hit"
[664] "hold" "hole" "holi"
[667] "home" "homosexu" "honest"
[670] "honor" "hook" "hope"
[673] "host" "hot" "hour"
[676] "hous" "howev" "hudson"
[679] "human" "hundr" "hung"
[682] "huntsvill" "hurt" "hypocrit"
[685] "ibm" "idea" "ideal"
[688] "ident" "ignor" "ill"
[691] "illeg" "imag" "imagin"
[694] "immedi" "immor" "imped"
[697] "implement" "impli" "import"
[700] "impress" "improv" "inc"
[703] "inch" "includ" "increas"
[706] "inde" "independ" "indic"
[709] "individu" "induct" "industri"
[712] "inerr" "info" "inform"
[715] "iniqu" "initi" "ink"
[718] "innoc" "input" "inreplyto"
[721] "insert" "insid" "inspect"
[724] "inspector" "instal" "instanc"
[727] "instead" "institut" "instruct"
[730] "instrument" "insul" "insur"
[733] "intel" "intellig" "intend"
[736] "intent" "interest" "interfac"
[739] "interfer" "intergraph" "intern"
[742] "internet" "interpret" "involv"
[745] "iran" "iranian" "islam"
[748] "isnt" "isol" "israelit"
[751] "issu" "ive" "jack"
[754] "jame" "jason" "jeff"
[757] "jehcmkrnlcom" "jesus" "jew"
[760] "jewish" "jim" "job"
[763] "joel" "john" "johnson"
[766] "join" "jose" "joseph"
[769] "josephus" "joslin" "joslinpogoisppittedu"
[772] "joystick" "juda" "judg"
[775] "jump" "just" "justic"
[778] "justifi" "keep" "keith"
[781] "ken" "kendig" "kent"
[784] "kevin" "key" "keyboard"
[787] "keyword" "khz" "kid"
[790] "kill" "kind" "king"
[793] "kingdom" "kit" "klero"
[796] "knew" "know" "knowledg"
[799] "known" "kolstad" "koresh"
[802] "ksand" "lab" "label"
[805] "laboratori" "lack" "lake"
[808] "lamp" "land" "languag"
[811] "larg" "larri" "laser"
[814] "last" "latch" "late"
[817] "later" "latest" "latter"
[820] "law" "lay" "lds"
[823] "lead" "leader" "learn"
[826] "least" "leav" "led"
[829] "lee" "left" "legal"
[832] "less" "let" "letter"
[835] "level" "lewi" "lexicon"
[838] "lie" "life" "light"
[841] "like" "limit" "line"
[844] "link" "list" "listen"
[847] "liter" "littl" "live"
[850] "load" "local" "locat"
[853] "log" "logic" "long"
[856] "longer" "look" "loop"
[859] "lord" "lose" "lost"
[862] "lot" "loui" "love"
[865] "low" "lower" "lucif"
[868] "luck" "luke" "lung"
[871] "mac" "machin" "made"
[874] "magazin" "magi" "magic"
[877] "mail" "main" "maintain"
[880] "major" "make" "malcolm"
[883] "man" "manag" "mani"
[886] "mankind" "manual" "manufactur"
[889] "mark" "martin" "mason"
[892] "master" "mat" "materi"
[895] "mathew" "matt" "matter"
[898] "matthew" "max" "may"
[901] "mayb" "mayhew" "mcconki"
[904] "mean" "measur" "mechan"
[907] "media" "medicin" "medin"
[910] "meet" "melt" "member"
[913] "memori" "men" "mention"
[916] "mere" "meritt" "messag"
[919] "messeng" "metal" "meter"
[922] "method" "mhz" "michael"
[925] "microcontrol" "microphon" "middl"
[928] "might" "mike" "militari"
[931] "mind" "mine" "ministri"
[934] "minor" "minut" "miracl"
[937] "miss" "mistak" "mit"
[940] "mithra" "mix" "mixer"
[943] "mleepostroyalroadsca" "mmwunixmitreorg" "mobil"
[946] "mode" "model" "modem"
[949] "modern" "modifi" "modul"
[952] "moment" "money" "monitor"
[955] "month" "moral" "mormon"
[958] "most" "mother" "motor"
[961] "motorola" "mount" "move"
[964] "movement" "much" "muhammad"
[967] "multipl" "murder" "muslim"
[970] "must" "mysteri" "name"
[973] "nasa" "nation" "natur"
[976] "nazi" "near" "neat"
[979] "nec" "necessari" "necessarili"
[982] "need" "negat" "neighbor"
[985] "neither" "net" "netcom"
[988] "network" "neutral" "never"
[991] "new" "news" "newsgroup"
[994] "newssoftwar" "next" "nice"
[997] "night" "nntppostinghost" "nois"
[1000] "none"
[ reached getOption("max.print") -- omitted 637 entries ]
findAssocs(sci.rel.dtm, term = "young", corlimit = 0.7)
$young
brigham antip apost
0.90 0.77 0.77
aptitud auxiliari balloney
0.77 0.77 0.77
bequeath carthag categor
0.77 0.77 0.77
claimant cosi cowderi
0.77 0.77 0.77
doityourself dukaki excommun
0.77 0.77 0.77
exot fbis forbad
0.77 0.77 0.77
glacier heir hoover
0.77 0.77 0.77
hyram hyrum intensifi
0.77 0.77 0.77
jail largescal mammon
0.77 0.77 0.77
mcclari mcclarycsnpqkbnetcomcom mcclarynetcomcom
0.77 0.77 0.77
militarist nauvoo penal
0.77 0.77 0.77
personnel pledg plural
0.77 0.77 0.77
preposter pseudographia purif
0.77 0.77 0.77
quorum rekhabit reorgan
0.77 0.77 0.77
resili retort rlds
0.77 0.77 0.77
selfprotect sorenson staf
0.77 0.77 0.77
tabloid tobacco tought
0.77 0.77 0.77
underag undertak unifi
0.77 0.77 0.77
upbring vacanc vikingiastateedu
0.77 0.77 0.77
wedg zdanexnetiastateedu latterday
0.77 0.77 0.75
salt sect isscckvmbyuedu
0.75 0.74 0.73
splinter smith lds
0.73 0.71 0.70
sci.rel.dtm.70 <- removeSparseTerms(sci.rel.dtm, sparse = 0.7)
sci.rel.dtm.70 # or dim(sci.rel.dtm.70)
<<DocumentTermMatrix (documents: 968, terms: 14)>>
Non-/sparse entries: 7191/6361
Sparsity : 47%
Maximal term length: 15
Weighting : term frequency (tf)
# note that the term-document matrix needs to be transformed (casted)
# to a matrix form in the following barplot command
barplot(as.matrix(sci.rel.dtm.70),
xlab = "terms", ylab = "number of occurrences",
main = "Most frequent terms (sparseness=0.7)"
)
sci.rel.dtm.80 <- removeSparseTerms(sci.rel.dtm, sparse = 0.8)
sci.rel.dtm.80
<<DocumentTermMatrix (documents: 968, terms: 32)>>
Non-/sparse entries: 11219/19757
Sparsity : 64%
Maximal term length: 15
Weighting : term frequency (tf)
sci.rel.dtm.90 <- removeSparseTerms(sci.rel.dtm, sparse = 0.9)
sci.rel.dtm.90
<<DocumentTermMatrix (documents: 968, terms: 118)>>
Non-/sparse entries: 22176/92048
Sparsity : 81%
Maximal term length: 15
Weighting : term frequency (tf)
save(list = c("sci.rel.dtm.80", "sci.rel.dtm.90"), file = "../data/sci.rel.dtm.Rdata")
Different functionalities of R allow to convert these matrices to the file format demanded by other software tools. For example, the function write.arff of the foreign package converts a data matrix or data frame to the well-known arff (“attribute relation format file”) of WEKA software. Note that a class vector is appended to the document-term matrix, labeling the type of each document. In my case (total numbers can be different in your corpus), the first 591 documents cover the “science-electronics” newgroup
data <- data.frame(as.matrix(sci.rel.dtm.90)) # convert corpus to dataFrame format
type <- c(rep("science", 591), rep("religion", 377)) # create the type vector to be appended
# install the package for apply the conversion function
install.packages("foreign")
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:
https://cran.rstudio.com/bin/windows/Rtools/
Installing package into ‘C:/Users/julet/Documents/R/win-library/4.1’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.1/foreign_0.8-81.zip'
Content type 'application/zip' length 332324 bytes (324 KB)
downloaded 324 KB
package ‘foreign’ successfully unpacked and MD5 sums checked
Warning in install.packages :
cannot remove prior installation of package ‘foreign’
Warning in install.packages :
problem copying C:\Users\julet\Documents\R\win-library\4.1\00LOCK\foreign\libs\x64\foreign.dll to C:\Users\julet\Documents\R\win-library\4.1\foreign\libs\x64\foreign.dll: Permission denied
Warning in install.packages :
restored ‘foreign’
The downloaded binary packages are in
C:\Users\julet\AppData\Local\Temp\RtmpiY4vdf\downloaded_packages
library(foreign)
write.arff(cbind(data, type), file = "../data/term-document-matrix-weka-format.arff")
# you can try to open the new .arff file in WEKA...
Before starting learning the exposed machine learning models, let’s build a wordcloud with the following package [3]. Its wordcloud() command needs the list of words and their frequencies as parameters. As the words appear in columns in the document-term matrix, the colSums command is used to calculate the word frequencies. In order to complete the needed calculations, note that the term-document matrix needs to be transformed (casted) to a matrix form with the as.matrix cast-operator. It is built for the “talk.religion” newsgroup: it covers the 591-968 range of samples-documents (rows) in the document-term matrix. Let’s play yourself with the options-parameters of the wordcloud function: it offers many options to become the wordcloud more attractive, discover by yourself.
library(wordcloud)
Loading required package: RColorBrewer
# calculate the frequency of words and sort in descending order.
wordFreqs <- sort(colSums(as.matrix(sci.rel.dtm.90)[591:968, ]), decreasing = TRUE)
wordcloud(words = names(wordFreqs), freq = wordFreqs)
Warning in wordcloud(words = names(wordFreqs), freq = wordFreqs) :
one could not be fit on page. It will not be plotted.
Warning in wordcloud(words = names(wordFreqs), freq = wordFreqs) :
organ could not be fit on page. It will not be plotted.
Warning in wordcloud(words = names(wordFreqs), freq = wordFreqs) :
can could not be fit on page. It will not be plotted.
Warning in wordcloud(words = names(wordFreqs), freq = wordFreqs) :
univers could not be fit on page. It will not be plotted.
Warning in wordcloud(words = names(wordFreqs), freq = wordFreqs) :
make could not be fit on page. It will not be plotted.
Warning in wordcloud(words = names(wordFreqs), freq = wordFreqs) :
world could not be fit on page. It will not be plotted.
Warning in wordcloud(words = names(wordFreqs), freq = wordFreqs) :
nntppostinghost could not be fit on page. It will not be plotted.
Warning in wordcloud(words = names(wordFreqs), freq = wordFreqs) :
thing could not be fit on page. It will not be plotted.
Warning in wordcloud(words = names(wordFreqs), freq = wordFreqs) :
know could not be fit on page. It will not be plotted.
Warning in wordcloud(words = names(wordFreqs), freq = wordFreqs) :
articl could not be fit on page. It will not be plotted.
Warning in wordcloud(words = names(wordFreqs), freq = wordFreqs) :
just could not be fit on page. It will not be plotted.
Warning in wordcloud(words = names(wordFreqs), freq = wordFreqs) :
point could not be fit on page. It will not be plotted.
Warning in wordcloud(words = names(wordFreqs), freq = wordFreqs) :
good could not be fit on page. It will not be plotted.
Warning in wordcloud(words = names(wordFreqs), freq = wordFreqs) :
year could not be fit on page. It will not be plotted.
Warning in wordcloud(words = names(wordFreqs), freq = wordFreqs) :
consid could not be fit on page. It will not be plotted.
Warning in wordcloud(words = names(wordFreqs), freq = wordFreqs) :
interest could not be fit on page. It will not be plotted.
Warning in wordcloud(words = names(wordFreqs), freq = wordFreqs) :
distribut could not be fit on page. It will not be plotted.
Warning in wordcloud(words = names(wordFreqs), freq = wordFreqs) :
said could not be fit on page. It will not be plotted.
Warning in wordcloud(words = names(wordFreqs), freq = wordFreqs) :
probabl could not be fit on page. It will not be plotted.
Warning in wordcloud(words = names(wordFreqs), freq = wordFreqs) :
case could not be fit on page. It will not be plotted.
Warning in wordcloud(words = names(wordFreqs), freq = wordFreqs) :
mani could not be fit on page. It will not be plotted.
Warning in wordcloud(words = names(wordFreqs), freq = wordFreqs) :
now could not be fit on page. It will not be plotted.
Warning in wordcloud(words = names(wordFreqs), freq = wordFreqs) :
john could not be fit on page. It will not be plotted.
Warning in wordcloud(words = names(wordFreqs), freq = wordFreqs) :
number could not be fit on page. It will not be plotted.
Warning in wordcloud(words = names(wordFreqs), freq = wordFreqs) :
mean could not be fit on page. It will not be plotted.
Warning in wordcloud(words = names(wordFreqs), freq = wordFreqs) :
fact could not be fit on page. It will not be plotted.
Warning in wordcloud(words = names(wordFreqs), freq = wordFreqs) :
think could not be fit on page. It will not be plotted.
Warning in wordcloud(words = names(wordFreqs), freq = wordFreqs) :
place could not be fit on page. It will not be plotted.
Warning in wordcloud(words = names(wordFreqs), freq = wordFreqs) :
follow could not be fit on page. It will not be plotted.
Warning in wordcloud(words = names(wordFreqs), freq = wordFreqs) :
peopl could not be fit on page. It will not be plotted.
It is also easy to retrieve text in HTML format from the web by means of R functionalities. A simple example can be to retrieve the “Indemnifications” sections of a “Terms of Services” webpage of GoogleAnalytics. The readLines command outputs a vector of character strings, each component storing a line. The text of interest exists from line 302 through 325. The HTML tags are removed. Now, by means of the tm package, the vector of character strings can be converted into a corpus of text using consecutively the VectorSource and the VCorpus functions.
indemnifications.HTML.page <- readLines("https://www.google.com/analytics/terms/us.html")
length(indemnifications.HTML.page)
[1] 1786
text.data <- indemnifications.HTML.page[302:325] # 24 lines of desired text
text.data <- gsub(pattern = "<p>", replacement = "", x = text.data)
text.data <- gsub(pattern = "</p>", replacement = "", x = text.data)
text.data <- gsub(pattern = "</h2>", replacement = "", x = text.data)
text.data <- gsub(pattern = "<h2>", replacement = "", x = text.data)
text.data
[1] " <li class=\"dd-menu__nav-item\" aria-level=\"2\">"
[2] " <a data-g-event=\"terms: us\" data-g-action="
[3] " \"for enterprises: analytics 360\" data-g-label="
[4] " \"global nav\" href=\"/about/analytics-360/\""
[5] " class=\"dd-menu__nav-item__link\"><span class="
[6] " \"dd-menu__nav-item__title\">Analytics 360</span>"
[7] " <p class=\"dd-menu__nav-item__desc\">"
[8] " Use advanced tools to get a deeper"
[9] " understanding of your customers so you can"
[10] " deliver better experiences."
[11] " </a>"
[12] " </li>"
[13] " <li class=\"dd-menu__nav-item\" aria-level=\"2\">"
[14] " <a data-g-event=\"terms: us\" data-g-action="
[15] " \"for enterprises: data studio\" data-g-label="
[16] " \"global nav\" href=\"/about/data-studio/\" class="
[17] " \"dd-menu__nav-item__link\"><span class="
[18] " \"dd-menu__nav-item__title\">Data Studio</span>"
[19] " <p class=\"dd-menu__nav-item__desc\">"
[20] " Unlock insights from your data with engaging,"
[21] " customizable reports."
[22] " </a>"
[23] " </li>"
[24] " <li class=\"dd-menu__nav-item\" aria-level=\"2\">"
text.data <- VectorSource(text.data) # interpreting each element of the vector as a document
text.data.corpus <- VCorpus(text.data) # from document list to corpus
Other common practice to retrieve text is by means of the examples stored by R packages. In this case, when installing the tm package a set of document collections are stored in hard disk. Among them, a subset of the popular “Reuters-21578” collection. The system.file function points the subdirectory-tree that has the documents of interest in our package. When reading the text, an specific XML reader developed by the community, known as “readReut21578XMLasPlain”, is needed.
reut21578 <- system.file("texts", "crude", package = "tm")
reut21578
[1] "C:/Users/julet/Documents/R/win-library/4.1/tm/texts/crude"
reuters <- VCorpus(DirSource(reut21578), readerControl = list(reader = readReut21578XMLasPlain))
A fashion way to retrieve text is via Twitter posts. The twitteR library provides access to Twitter data. Twitter marks its use as the ‘official’ way to download its tweets. (In case you have problems with twitter package, I show you an alternative at the end of this subsection)
Second, I try to explain you the ‘official’ way offered by the twitteR library. Twitter API requires identification-authentification: follow these instructions: http://thinktostart.com/twitter-authentification-with-r/.
Pay attention, the current Twitter’s link to create Twitter applications is https://developer.twitter.com/en/apps. You need to be logged. It is needed to “create a new app”: this will provide you a set of 4 items related to the application called “consumerKey”, “consumerSecret”, “accessToken” and “accessSecret”. Both ’accessToken” and “accessSecret” need to be activated after receiving the “consumerKey” and “consumerSecret”. Four parameters need to be used in the final authentification function call in R, setup twitter oauth().
An alternative explanation, by twitter, of the exposed process. The following https://dev.twitter.com/oauth/overview/application-owner-access-tokens also explains the exposed Twitter’s identification-authentification process, which hangs from the general Twitter Developer Documentation https://developer.twitter.com/en/docs/authentication/overview
At a first glance, this process can be felt as cumbersome-complex. Be patient, and after several trials sure that you will be able to authentificate. If the process fails, try installing the httr and base64enc packages; if the error continues, install the twitteR package from GitHub, in this way:
library(twitteR)
source("twitter_oauth.R")
setup_twitter_oauth(consumer_key, consumer_secret, access_key, access_secret)
[1] "Using direct authentication"
While it is true that you can find many short guides to establish the connection with twitter in order to start downloading its text, I just offer you another pointer: https://www.kdnuggets.com/2017/11/extracting-tweets-r.html
Once the authentification is done, tweets of any user or hashtag can be retrieved and converted to a corpus. The functions provided by the twitteR package evolve continuously and sure that you find interesting functions for your NLP analysis objectives.
Just an idea. To further apply machine learning techniques to learn supervised classifiers, our corpus needs to have documents with different annotations-types: confronting two antagonistic users and downloading their tweets, an annotated, binary corpus can be constructed. After preprocessing and converting the corpus to a data frame format, supervised classification techniques can be applied over it.
# consult the different ways to retrieve tweets: from an user or from a hashtag
?userTimeline
?searchTwitteR
LadyGagaTweets <- userTimeline(user = "ladygaga", n = 100)
CristianoRonaldoTweets <- userTimeline(user = "Cristiano", n = 100)
# beerTweets = searchTwitteR(searchString = "#beer", n=100)
length(LadyGagaTweets)
[1] 87
LadyGagaTweets[1:4]
[[1]]
[1] "ladygaga: Last chance to shop the @hauslabs end-of-year sale with promo code BYE2021 💕 https://t.co/16HKLF1qsu https://t.co/E5UHLEaBIa"
[[2]]
[1] "ladygaga: The @hauslabs end-of-year sale is happening now! Use promo code BYE2021 on my favorites like the gel pencil eyeline… https://t.co/heqFIwkCXo"
[[3]]
[1] "ladygaga: My friend Dr. Paul Conti wrote a great book about trauma, and I was proud to write the foreword for him. This book… https://t.co/dB8XqkMJRa"
[[4]]
[1] "ladygaga: Nightmare Alley is an amazing film with an amazing cast, congratulations to @RealGDT. Bradley is spectacular, Cate… https://t.co/DrGVhx7WI5"
# convert each tweet to a data frame and combine them by rows
LadyGagaTweetsDataFrames <- do.call("rbind", lapply(LadyGagaTweets, as.data.frame))
CristianoRonaldoTweetsDataFrames <- do.call("rbind", lapply(CristianoRonaldoTweets, as.data.frame))
# consult the different attributes of these object using \£ symbol
LadyGagaTweetsDataFrames$text
[1] "Last chance to shop the @hauslabs end-of-year sale with promo code BYE2021 💕 https://t.co/16HKLF1qsu https://t.co/E5UHLEaBIa"
[2] "The @hauslabs end-of-year sale is happening now! Use promo code BYE2021 on my favorites like the gel pencil eyeline… https://t.co/heqFIwkCXo"
[3] "My friend Dr. Paul Conti wrote a great book about trauma, and I was proud to write the foreword for him. This book… https://t.co/dB8XqkMJRa"
[4] "Nightmare Alley is an amazing film with an amazing cast, congratulations to @RealGDT. Bradley is spectacular, Cate… https://t.co/DrGVhx7WI5"
[5] "There's glam, then there's #ItalianGlam 🇮🇹 The @hauslabs Casa Gaga Highlighter, All-Over Rouge & Mini Lipsticks are… https://t.co/dtRznNpVH6"
[6] "🎺🎹🤍 https://t.co/tOsDb2mIWP"
[7] "Thank you @MTV for such a special night celebrating Love For Sale ❤️✨ If you weren’t able to tune in to our… https://t.co/B9IDlWtEfR"
[8] ". @BTWFoundation 💕💕💕https://t.co/Ro58UCkzZd"
[9] "Tune in tonight to celebrate #LoveForSale with @itstonybennett and me during our #MTVUnplugged special! Join us at… https://t.co/iB9GfveFNr"
[10] "#ItalianGlam 🇮🇹❤️ @hauslabs https://t.co/tPWgGhWLLa https://t.co/JEz4HHtw40"
[11] "The brand new Lady Gaga Music Pack is available now on @BeatSaber for @Oculus Quest 2! Experience handmade levels f… https://t.co/vqbSsriKUV"
[12] "Thank you @RadioCity! 🥰❤️✨ @itstonybennett #LoveForSale https://t.co/XP4GZgp8Nt"
[13] "I brought the @hauslabs Tanti Baci Extreme Creme Mini Lipsticks to Italy while filming #HouseofGucci 💋 5 creamy & l… https://t.co/6xUlWNPaNK"
[14] "The limited edition “Love For Sale” cream vinyl is now available, only at Amazon! @itstonybennett @amazonmusic ❤️… https://t.co/B2jr4f1O51"
[15] "@hauslabs @NikkieTutorials 😆"
[16] "Join me in donating to @BTWFoundation for #GivingTuesday. Your donation will help support youth mental wellness and… https://t.co/RWSClpBcIv"
[17] "Last chance to shop the best @hauslabs sale of the year! Every purchase comes with a free gift 💓… https://t.co/HMUp25PxWH"
[18] "My favorite @hauslabs products are on sale now for Cyber Monday 💕 Each purchase comes with a free LE MONSTER MATTE… https://t.co/g6BeRuvglQ"
[19] "The real “Lady” of the hour is Susan Benedetto, Tony’s remarkable wife . Thank you Susan for how you love… https://t.co/oPgO07wXdp"
[20] "Join @itstonybennett and me tonight at 8pm ET/PT on CBS! #OneLastTime 🤍 https://t.co/9EUdIlJr3m"
[21] "I just want to say how grateful I am for every person on earth today, the good the bad the in between. We all need… https://t.co/AeUIoZSzvQ"
[22] "Happy Thanksgiving from the Guccis. #HouseOfGucci is now playing in movie theaters in the US 🎟🎥 Get your tickets at… https://t.co/DoFDhOi0R0"
[23] "Domani, sweeties 💜 Tickets available now at https://t.co/RnqTUU7KOk 🎟 #HouseOfGucci https://t.co/cCO3k99mdJ"
[24] "Tune in tonight at 11:35/10:35c! 🥰 @itstonybennett @colbertlateshow @stephenathome https://t.co/LFmMKuSmVL"
[25] "Thank you ❤️ @RecordingAcad @itstonybennett https://t.co/yXcv9r34c5"
[26] "Pre-order the limited edition “Love For Sale” cream vinyl, available only at Amazon! @amazonmusic 🎷🤍… https://t.co/kc1mFKll0S"
[27] "Experience Italian Glam with the @hauslabs Tutti Gel-Powder All Over Rouge 🇮🇹 Warm up the gel-to-powder formula and… https://t.co/kJqEvEnEPd"
[28] "We designed our @hauslabs ITALIAN GLAM HIGHLIGHTER BRUSH to work expertly with our TUTTI GEL-POWDER HIGHLIGHTER 🥰 S… https://t.co/v21dYgGbZP"
[29] "Systemic oppression is evil and destroys the world #Rittenhouse"
[30] "Final #HouseOfGucci premiere last night in Los Angeles 💖 https://t.co/9jHxkL4cX2"
[31] ".@THR November 2021 https://t.co/Y6rlTxKEj1 https://t.co/aQkw2dD5mn"
[32] "Famiglia. #HouseOfGucci https://t.co/pUTuxzZXHr"
[33] "New York #HouseOfGucci premiere at Jazz at Lincoln Center, a few blocks from where this Italian-American girl grew… https://t.co/NUJGmvk39m"
[34] "Nuova York https://t.co/OsrX8HppR4"
[35] "I cried all day doing press in Milan. I am so grateful and humbled to be in our movie #HouseOfGucci. Coming Thanksg… https://t.co/MBdCVojPQV"
[36] "@ScarletEnvyNYC I love you legend!!!⚡️🚨🚨🚨"
[37] "I have loved @britneyspears her whole career. I looked up to her, admired her strength—she empowered so many people… https://t.co/Qgr4Dvew8f"
[38] "Stream now on Facebook Watch and find ways to channel kindness toward yourself and others at https://t.co/4BRgCgh1jx"
[39] "In honor of World Kindness Day, meet a powerful group of young people as they join @BTWFoundation, @DrAlfiee, and m… https://t.co/n7hPu4DIxs"
[40] "#HouseOfGucci 🇬🇧💜 https://t.co/HSSmRWKjFu"
[41] "WELCOME TO CASA GAGA. New, luxurious formulas designed to celebrate Italian Glam 🇮🇹 The full collection is availabl… https://t.co/6S10Qpmr0M"
[42] "Watch my #LifeInLooks to see 19 of my looks over time with @britishvogue 🤍 https://t.co/6UIj8opTOg https://t.co/XncyZ79SoW"
[43] "Stay tuned Friday for our full conversation, to be shared on my official Facebook page and https://t.co/4BRgCgh1jx … https://t.co/jMr4yHiQJI"
[44] "I had the honor of spending time with young people who were vulnerable with me and each other about their journeys… https://t.co/U0tvIA0ikB"
[45] "Introducing the @hauslabs TUTTI GEL-POWDER ALL OVER ROUGE. A gel to powder formula that feels velvety & weightless… https://t.co/wdueEexyKR"
[46] "The Gucci family had it all. #HouseOfGucci Only in theaters this Thanksgiving. https://t.co/2yr5LyHNEe"
[47] ".@Vogue_Italia November 2021 https://t.co/8AdwYlv3AS https://t.co/7ufW883Z0L"
[48] ".@BritishVogue December 2021 https://t.co/bU9j5PbBdA https://t.co/WYfmtCQpY1"
[49] "Introducing the brand-new, high-performance, luxe formulas of the CASA GAGA ITALIAN GLAM COLLECTION for eyes, lips… https://t.co/kGsLo8hpF1"
[50] ".@itstonybennett and I took over @AppleMusic’s Singer’s Delight playlist! Listen to some of our favorite songs now… https://t.co/mEY1PxS8fB"
[51] "Watch the “Night And Day” music video, out now! 🎺🎹 @itstonybennett https://t.co/uu1nirDKJg https://t.co/eUeArfT0Sz"
[52] "One Last Time: An Evening With Tony Bennett & Lady Gaga 🤍 Join @itstonybennett and me on Sunday, November 28th at 8… https://t.co/MV3o0Amr8V"
[53] "Patrizia always gets what she wants. 🙏🔮 #HouseOfGucci https://t.co/mZoap1wKC8"
[54] "A new trailer, sweeties 💋 #HouseOfGucci – only in theaters This Thanksgiving https://t.co/4zVhazUo2N"
[55] "Something new tomorrow... sogni d'oro 🌙 #HouseOfGucci https://t.co/EPbcihIg4H"
[56] "Backstage last night at Jazz & Piano, wearing the @hauslabs Love For Sale Shadow Palette 🥰🎺🎶 https://t.co/8UmHyFf9EE https://t.co/tb0t7uSHnJ"
[57] "Thank you @MTVEMA! 💓 https://t.co/HIdnvCrEdr https://t.co/Uljyi7J3X2"
[58] "🥰🎺🎹 #GagaVegas #LoveForSale https://t.co/8zO46rZf5t"
[59] "Dream dancing with you ✨ @itstonybennett #LoveForSale https://t.co/PhOwFEOdvJ https://t.co/OeUlRJqM9k"
[60] "Shop “Love For Sale” vinyl, CDs, and cassettes at your local record store to support indie retailers ❤️ Find your n… https://t.co/zsNHfLQpEZ"
[61] "It was an honor to speak with @andersoncooper on @60Minutes about my beloved friend @itstonybennett. The full story… https://t.co/uNAQM1usLI"
[62] "Tune in to @60Minutes tonight at 7pm ET/PT on CBS for my conversation with @andersoncooper about Love For Sale and… https://t.co/wFFy7S5W2J"
[63] "For a limited time, the “Love For Sale” digital album is available on @amazonmusic for $5 😊 https://t.co/PpIRO3cdRJ https://t.co/fmtAnRLmlb"
[64] "Download our new album “Love For Sale” on iTunes! @itstonybennett 🎺🎹🎶 https://t.co/yiM7Bnvr5Q https://t.co/ioAMSvdHfM"
[65] "Shop the limited edition vinyl + CD bundle, only available until Monday! Includes the exclusive yellow “Love For Sa… https://t.co/ywIvLeEKUG"
[66] "An exclusive picture disc vinyl is available only at @Walmart 🎹 @itstonybennett https://t.co/aUGmnG94jU https://t.co/rJyFzwryaP"
[67] "Shop the #LoveForSale vinyl and CD at @Target that come with an exclusive cover 🎶 @itstonybennett… https://t.co/SbgFx2U3U7"
[68] "#LoveForSale is available now at @BNBuzz online and in stores 🥰 @itstonybennett https://t.co/mn536C5g1c https://t.co/eFl2Tflh6s"
[69] "Listen to “I’ve Got You Under My Skin” on @AppleMusic! @itstonybennett https://t.co/4CYme1yowi https://t.co/n6Knfw5DS5"
[70] "#LoveForSale 🎺 @itstonybennett @amazonmusic https://t.co/JzTxjqlGKK https://t.co/BmmvTTYElO"
[71] "#LoveForSale 🥳 @itstonybennett @Spotify https://t.co/UY6ihTzesw https://t.co/wqHztv7Vvf"
[72] "In case you weren’t able to tune in to my performance online or at a Westfield fan zone yesterday, it’s available t… https://t.co/zJzwoHbKHL"
[73] "New “Love For Sale” merch is available in my official store to shop with the album on CD, vinyl, and more 🎺❤… https://t.co/54J7IRGtua"
[74] "“Love For Sale” is available everywhere now! ❤️ And the music video for “I’ve Got You Under My Skin” is out tomorro… https://t.co/YVxtyJC3tD"
[75] "Watch #FirstListenGagaBennett live now on @applemusic! 🥰 @itstonybennett https://t.co/xfWmnfajnL https://t.co/DTJhpTmiqU"
[76] "The fourth and final exclusive “Love For Sale” alternate CD cover is available to pre-order in limited quantities!… https://t.co/29kgyNeYuF"
[77] "#FirstListenGagaBennett hosted by @zanelowe is tonight! I can’t wait for you to hear our new songs from “Love For S… https://t.co/0pCiy9sLXB"
[78] "i love you tony and susan. in just under 5 hrs the world can listen to \"Love For Sale,\" our last album. my greates… https://t.co/LkrpEQiAe3"
[79] "Join me today for my exclusive online performance brought to you by Westfield in celebration of “Love For Sale”! Tu… https://t.co/tJfXwcu4RT"
[80] "This Thanksgiving, Join The Family. #HouseOfGucci https://t.co/BhLTchstXe"
[81] "Less than 2 days until “Love For Sale” is available everywhere!! 🥳🥳🥳 @itstonybennett The third limited-edition alte… https://t.co/PF6Mbc4SNB"
[82] "Pre-add “Love For Sale” on @AppleMusic to listen to the album in Spatial Audio on Friday 🥰 @itstonybennett… https://t.co/v9qXtoZask"
[83] "Join us in celebrating the magic of jazz and iconic glam. I am proud to announce that $1 from every palette sold on… https://t.co/zgpUCix7n3"
[84] "My new limited-edition @hauslabs LOVE FOR SALE SHADOW PALETTE is available now globally on https://t.co/lfX4GLwMiF,… https://t.co/rETXKDoqC3"
[85] "Our new album “Love For Sale” is only 3 days away! 🎺❤️ A full album trailer is out now on my YouTube channel and fe… https://t.co/sRkiZbYnZI"
[86] "Tune in to #FirstListenGagaBennett with @AppleMusic this Thursday hosted by @zanelowe to hear new songs from Love F… https://t.co/Ljva6LtqaK"
[87] "The very special, limited-edition LOVE FOR SALE SHADOW PALETTE is available NOW, exclusively on IG Shop. I am so ex… https://t.co/vZC6t70PNE"
# combine both frames in a single, binary, annotated set
annotatedTweetsDataFrames <- rbind(LadyGagaTweetsDataFrames, CristianoRonaldoTweetsDataFrames)
# interpreting each element of the annotated vector as a document
annotatedDocuments <- VectorSource(annotatedTweetsDataFrames$text)
# convert to a corpus: supervised classification to be applied in future steps
annotatedCorpus <- VCorpus(annotatedDocuments)
In case you had problems with twitter package, an attractive and ‘easy-to-use’ alternative to Twitter’s ‘official rules’ is based on the use of the rtweet package. The following link seems to be a ‘more updated’ package. I think it is needed to have a Twitter account (username and password). This set of slides offers an easy-to-follow tutorial, showing the pipeline that you need.
[1] Ingo Feinerer. tm: Text Mining Package, 2012. R package version 0.5-7.1.
[2] Ingo Feinerer, Kurt Hornik, and David Meyer. Text mining infrastructure in R. Journal of Statistical Software, 25(5):1–54, 3 2008.
[3] Ian Fellows. wordcloud: Word Clouds, 2014. R package version 2.5.
[4] M. Kuhn and K. Johnson. Applied Predictive Modeling. Springer, 2013.