German lecturer with Ph.D, ..., I started using Python and R, .and stumbled about emojis ...

Friday, October 28, 2022

Punctuation in tweets compared, with R: ! 👏 👏 👏

We might think in chats the full stop sign were needed less than in written texts, as the end of a sentence may coincide with the end of a statement, sufficiently defined by the "send" button. 

The ellipsis ("...") instead could, as a means for being fast and allusive, when communicating within a social group, be more frequent. 

First try: looking for tweets with full stops

Searching tweets with full stops, we might get a first impression of the distribution of punctuation marks. I submitted tweet researches (22-28/10 2022, n =1000) in German, in Italian, and in Polish

German

From the general feature frequency table,

textstat_frequency(matrix, n=20)

feature frequency docfreq 

.            783          427 


Does this mean that among 1000 tweets searched with keyword = ".", only 427 documents really contain the mark? Something  strange is happening here, in the punctuation signs count (see below, "technically"). 


At least, as a result, we clearly see: the full stop sign is still there. by far not all the "."s are absorbed by "...".


The rest of the table gives an impression of the situation. Nearly a fourth of the "." tweets are making use of the ellipsis sign as well, usually once in a single tweet.


 ,         568    354 

 :         502    412 

 rt        350    350 

…       227     225 


In Italian, full stops are less frequent, and so are "...", although the relation "."/"..." is quite similar (783/227 = 3.45 against 537/195 = 2.75).


.          537     281

:         431     375

 ,        345     220

    195     183

 !       121     77


In Polish, we have less full stops (500) than colons (555), while 

“…” does not appear among the first twelve. 


Obviously, having looked for tweets with ".", we do not get a view on the real frequency of full stops in tweets. 


A little surprise, though, could be seen when considering skipgrams (4, 2:4). 


In German, the most frequent ones are


1 \U{01faf6} \U{01faf6} \U{01faf6} \U{01faf6} 

2 🌹 🌹 🌻 🌻


In Italian, the first one is

! 👏 👏 👏 194 


And in Polish, we see

🌱 ✨ 💚 ✨


The scene is dominated by emojis. In the Italian result, it may even seem the exclamation mark was soaked up by emojis, becoming an emoji by itself. 


For further investigation, I will not use R, but Python, because I prefer controlling directly what we are counting. 


Technically

Search command

fund <- search_tweets(".", n=1000, retryonratelimit = TRUE, include_rts=TRUE, lang="de") 

Punctuation mark count

keeping <- c(".","...",",","!","?",":","-",";")

nmatrix <- tokens_select(fund_toks, keeping, selection = "keep")

schau <- dfm(nmatrix)

textstat_frequency(schau)


kwic search 

kontext1 <- kwic(fund_toks, ".", valuetype = "glob", window = 10)

kontext2 <- kwic(fund_toks, pattern= ":", window=10)

kontext3 <- kwic(fund_toks), pattern="...", window=10)


The last command does not give any results, as I have posted on stackoverflow, without getting response. 

Stackoverflow post







Thursday, October 20, 2022

Doubling Emojis

In Italian Tweets, the signs of love rarely appear alone. Love prefers showing up as bigrams. Even trigrams are quite frequent. To be exact, among the (2:4)grams in 100 documents, they are dominating. Only from the fourth position onwards, we get some hashtags.

[1] "😍"

[1] 100

feature frequency rank

1 😍_😍 147           1

2 😍_😍_😍 106     2

3 😍_😍_😍_😍 71 3


Looking for the same emoji in German tweets: 

tokens_ngrams(n = 2:4)


         feature frequency rank docfreq group

1          😍_😍        26    1      10   all

2       😍_😍_😍        16    2       9   all

3   guten_morgen        15    3      15   all


Looks like a cultural difference. But, anyway, the most frequent bigrams still are these 😍 couples. 


The general rule could be: 

Searching tweets for emojis, you will get other emojis as most frequent bigrams. Other examples: 


[1] "🤥"

[1] 100

feature frequency rank docfreq group

1 🤥_🤥 165         1     31 all

2 🤥_🤥_🤥 134     2     22 all

3 🤥_🤥_🤥_🤥 112 3 12 all


[1] "😂"

[1] 100

feature frequency rank docfreq

1 😂_😂 113            1     48

2 😂_😂_😂 65         2     32


Laughter ("Rolling ...") only in 55% of the cases comes alone. 

[1] "🤣"

[1] 100

feature frequency rank docfreq group

1 🤣_🤣 143           1         45     all

2 🤣_🤣_🤣 96         2         32 all

3 🤣_🤣_🤣_🤣 62     3     13     all

On 100 documents, there are 246 🤣. Looks like an echo. 


Again, on German tweets, the tendency is weaker. 

       feature frequency rank docfreq group

1        🤣_🤣        78    1      35   all

2     🤣_🤣_🤣        42    2      28   all

3          ._.        38        3      10   all

12 ?_🤣 5 12 5 all

This circumstance will be explored later. 


Anger seems to be contagious as well. Take a look at

[1] "😡"

[1] 100

feature frequency rank docfreq

1 😡_😡 114         1         49

2 ._.         75         2             20

3 😡_😡_😡 65     3         36


And, uhm 

[1] 100

feature frequency rank docfreq group

1 💩_💩 102         1         35 all

2 ._. 73                 2         21 all

3 💩_💩_💩  67     3         27 all


Washing it away:

[1] "💦"

[1] 100

feature frequency rank docfreq

1 💦_💦 137         1     60

2 💦_💦_💦 77     2     39


A rather strange guy:

[1] "👺"

[1] 100

feature frequency rank

1 👺_👺 88         1

2-4 user names

5 👺_👺_👺 37     2

6 :_👺 20             6


Number six is the combination of the tengu or leprechaun with a colon. Consider "!_😍       131"!


We know that punctuation signs, in text messages, behave strangely. They often appear in couples or triples. We are getting used to phenomenons like "!!!!". In the meantime, the simple full stop is weakened. What if punctuation signs lost their grammatical meaning, and became emojis of their own right? 


A second hint for further research are "evoking emojis". They are not doubled, but complemented by other emojis, seemingly according to certain rules. 


The general rule, for now, would be: 

Searching tweets for emojis, you will get other emojis as most frequent bigrams. These are not always the same emojis. 


[1] "🤍" white heart

[1] 100

feature frequency rank

1 😍_😍 130 1

2 😍_😍_😍 106 2

3 😍_😍_😍_😍 82 3


Weaker:

[1] "💛"

[1] 100

feature frequency rank docfreq

1 💛_❤ 23 1 22




Technically

(14/10/2022), n=100, lang = "italian", on the first 300 emojis of the Emoji Package, simply with a for loop. Statistical methods from Quanteda. 


Monday, October 17, 2022

Historical considerations: Real life emojis

 At the Inauguration Ceremony of the Academic Year, we are all sitting in the stalls, row after row, all dressed up in black or dark blue, forming long lines of dark characters. Then, four persons enter the room and sit down at a desk on stage. The Rector, the Dean, and two Vice Deans. The first one is wearing a red robe, red gloves, and a red hat, the others are dressed in electric blue. 

They are seated in the middle ... surrounded by black lines. I would call them living emojis. 

The point is, as Carl Schmitt has noticed in his book about Roman Catholicism, that certain figures represent something otherwise not perceptible, as the priest rests individual person, but is, during mass,  at the same time God, 

Rector and Deans are representations of authority and of Science, I suggest, and that is why they have to look like emojis. 

What do emojis stand for? They are no individuals, but in their combinations they get near to ndividuality. Are they hence representing something lie emotion as such?  Human warmth amidst of black lines?

Sunday, October 9, 2022

Emoji combinations, creating worlds

Doing some twitter analysis for 🫥🤬🤷‍♀️, a philosopher from Milan who was doing conspiracy theory studies concerning tweets about a  Italian "Big Brother" couple, 🦋🐻, I ended up isolating a few recurring emojis in this material, namely 💞, 🍒, 👅 , 🐻, 🦋, 🤦‍♀️, 🌶️, 🍀, 🤪, 😂, 😅. Being typical for certain tweets, would they, used without text,  produce a series of similar tweets? 

A search_tweets with single emojis did not give any results, with average text cosines, as a first approximation,  below .3, and only the cosines between emojis above .5, as expectable. 

But what about pairs of emojis? If they are used like words or short expressions, their combinations could be significant. 

I simply framed the search query by a nested loop. 

As the result of a first try (14/09/2022), with n=100, some of the combinations could be eliminated, being rarely used, as 🍒🌶️ (only twice, probably a question of taste). Most of the others produced a lower text cosine, as 🍒🦋 (n=70, cosine= .23). But what about a tweet number of 15 for 🍒🤪, with a text cosine of .58?

Seemingly more interesting was the 👅🦋 pair. 78 documents with a text cosine of .52, and a cosine between emojis of .89.  Does this mean that certain pairs of emojis produce a coherent text world? Even a social world?

Computing a similar research (tweetr: search_tweets) about a month later (09/10/2022), with n=10000, I get 71 tweets with 👅🦋. The similarity between the tweets seems to be lower. But still, the cosine average for all the texts is 0.3721238, the emoji cosine 0.8074077. 

Among the most frequent features, no words at all. from here it looks like a closed emoji world.

textstat_frequency(matrixa, n=20)

feature frequency rank docfreq group

1 👅 177 1 71 all

2 🦋 115 2 71 all

3 💦 56 3 33 all

4 💋 56 3 33 all

5 ❤ 29 5 19 all

6 ♥ 28 6 9 all

7 🔥 21 7 10 all

8 🍆 19 8 17 all


What do they mean?

If we follow certain linguistic approaches to semanics, we might say: We will know an emoji by the company it keeps. This could mean the company of other emojis, but also a company of words. A quick topic analysis (lda) , with k=5, results in themes like love ("amore", "bellezza"), but mainly sex, with vulgar words ("figa", "troia", "patatina"), and some user names similar to "@milf43". This has to do with the meaning of "butterfly" in Italian. A month ago it had been used as personification, now as a sexual metaphor.  


A plot of the underlying (?) communication structure with igraph and actorNetwork:


In October, the social world of 👅🦋 is highly centralised, maybe by a bot or a professional ...? with little interaction between the secondary participants. 

An entire world, no: two of them or three are waiting to be explored.



















Historical considerations: Poetry Albums

When we were kids, many girls used to have a "Poesiealbum", a small book where friends, only friends!  had to write something poetical or morally instructive in verses. The handwriting had to be very accurate, at times you saw remnants of pencil lines below the words .. 

These writings often were adorned by small images kids bought from the stationary shop: flowers or couples of children, hearts ... 

This German tradition seems to be extinct. 

We still like decorating our writings with images, though.  


historical considerations: handwriting

 When we were writing by hand, as kids, our ink was blue, the writing was seen as expression of our personal qualities. No need for emojis? Seemingly we were trusting our words. Our written representation of thoughts and feelings, or maybe of doubts about the adequacy of language, looked sufficient.

Nowadays our writing takes the form of printed text. No sign of personal qualities could be seen in words composed of Times New Roman characters. 

Our hands do not make swinging, continuous movements anymore, only a kind of monotonous pecking. That is the practice, and the noise, where emojis are born. 

Such a hassle: R ndoc()

Within a for loop in R, my ndoc(corpus) did not produce anything. At first I thought my tweet search had not worked,  and tried to correct it ... After some time, I had a brilliant idea. 

Within the for loop, for some mysterious reason, I have to write print(ndoc(corpus)), 

Lost an entire hour. 

Image flow, image row. Understanding emojis? with Vilém Flusser

Little pictures in between We used to send letters to each other. We used to write down, letter after letter, word by word,  what we hoped w...