How to parse a JSON Twitter dataset and and not die trying

In June of 2013 my patience was tested when I lost some datasets due to a disk failure of a server at university. As misfortunes do not come alone, the back-up server at home crashed  at about the same time with no possibility of recovering this information. My reaction was not like Job´s (Twitter gave, and the disk has taken away; blessed be the name of   technology) because patience is not a virtue that I own. First I cried (a female advantage to let off steam) until my emotional side was calmed and then my rational part started to work finding a solution. One of the datasets lost was the Eurovision-2013. It  hurt me twice as much because it was a collaboration with a research team at the Complutense University. To enable this research, I asked  other researchers who had also collected data from Twitter of Eurovision and generously they sent me the dataset in the JSON format. It seemed something as trivial as to parse from the JSON to CSV format, but I spent a day doing it, because I found several “stones” along the way:

the JSON file size was huge and my laptop doesn´t have enough  memory to do it once, so there was only the option to do it tweet by tweet and fortunately they were delimited by a line break.
But tweets have more pitfalls than a  Chinese movie, inside the text of the tweets there is everything …..
Line breaks that split the tweet in JSON format
The letal ^ M of Microsoft that wrongs any parser
The symbol \ (backslash) that the parser gives errors if it isn’t a ‘unicode-scape’. Many users love to use it in a lot of ways \ o / \ m / \ @ / etc ..

After doing an exhaustive treatment of exceptions I got to convert JSON to CSV and now I can surprise  the research team at the Complutense University.

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.