Aug 3, 2018

Creating a Twitter Scraper in Elixir - Part 4

At the end of Part 3 we finally got around to parsing out some data from the HTML we retrieved from Twitter. We’ll continue building out our Tweet struct and aim to have all our fields populated by the end of this post.

I’d encourage you to follow along, but if you want to skip directly to the code, it’s available on GitHub at: https://github.com/riebeekn/elixir-twitter-scraper.

Getting started

If you followed along with part 3 just continue on with the code you created in part 3. If not and you’d rather jump right into part 4, you can use a clone of part 3 as a starting point.

Clone the Repo

If grabbing the code from GitHub instead of continuing along from part 3, the first step is to clone the repo.

Terminal
git clone -b part-03 https://github.com/riebeekn/elixir-twitter-scraper twitter_feed
cd twitter_feed
mix deps.get

OK, you’ve either gotten the code from GitHub or are using the existing code you created in Part 3, let’s get to parsing!

Parsing the rest of our fields

You’ll recall from part 3 we got to the point where we had parsed the display_name field of our struct.

We have 8 more fields to populate, we’ll be mucking about pretty much exclusively in the Parser module today, let’s get at it!

Parsing the user_id

The user_id, like many of the fields we want to capture, is an attribute of the .tweet div.

Let’s put together a test.

/test/twitter_feed/parser_test.ex
defmodule TwitterFeed.ParserTest do
  use ExUnit.Case, async: true

  alias TwitterFeed.Parser

  test "parsing of display_name" do
    html_snippet = "<div class=\"tweet\" data-name=\"lola\"></div>"

    assert Parser.parse_display_name(html_snippet) == "lola"
  end

  test "parsing of user_id" do
    html_snippet = "<div class=\"tweet\" data-user-id=\"2\"></div>"

    assert Parser.parse_user_id(html_snippet) == 2
  end
end

Very simple, we’ve added a new test, parsing of user_id, that contains an HTML snippet with the data-user-id attribute and checks for the value of the attribute upon parsing.

Now for the implementation.

/lib/twitter_feed/parser.ex
  ...

  ##################### PRIVATE FUNCTIONS #####################
  defp parse_tweet(tweet_html) do
    %Tweet {
      display_name: parse_display_name(tweet_html)
    }
  end

  use Publicist
  defp parse_display_name(tweet_html) do
    tweet_html
    |> Floki.attribute("data-name")
    |> hd()
  end

  defp parse_user_id(tweet_html) do
    tweet_html
    |> Floki.attribute("data-user-id")
    |> hd()
    |> String.to_integer()
  end
end

Very similar to parse_display_name. We’re just using Floki to grab the value of the attribute, and then in this case, we convert it to an integer before returning it.

With that all in place we should now have a new test that passes.

Terminal
mix test

Looking good! We’ll be following a similar pattern for the rest of the fields. The code should be pretty similar to the 2 fields we’ve completed, so explanations will be kept to a minimum from here on out. Also, I suggest you re-run the tests after each field is added; but I’ll be skipping out on that for the sake of brevity.

Parsing the user_name

The user_name is contained within the .tweet div in a data-screen-name attribute.

Let’s add a test.

/test/twitter_feed/parser_test.ex
test "parsing of user_name" do
  html_snippet = "<div class=\"tweet\" data-screen-name=\"SomeUserName\"></div>"

  assert Parser.parse_user_name(html_snippet) == "SomeUserName"
end

And now the implementation.

/lib/twitter_feed/parser.ex
defp parse_user_name(tweet_html) do
  tweet_html
  |> Floki.attribute("data-screen-name")
  |> hd()
end

Parsing the tweet_id

The tweet_id is contained within the .tweet div in a data-tweet-id attribute.

The test.

/test/twitter_feed/parser_test.ex
test "parsing of tweet_id" do
  html_snippet = "<div class=\"tweet\" data-tweet-id=\"1\"></div>"

  assert Parser.parse_tweet_id(html_snippet) == 1
end

The implementation.

/lib/twitter_feed/parser.ex
defp parse_tweet_id(tweet_html) do
  tweet_html
  |> Floki.attribute("data-tweet-id")
  |> hd()
  |> String.to_integer()
end

Parsing the timestamp

The timestamp unlike most of our fields is not a direct attribute of the .tweet div. We need to search out a span with a class of _timestamp within the .tweet div and then grab the data-time-ms attribute.

The test.

/test/twitter_feed/parser_test.ex
test "parsing of timestamp" do
  html_snippet = "<span class=\"_timestamp\" data-time-ms=\"1519339506000\"</span>"

  assert Parser.parse_timestamp(html_snippet) == "2018-02-22 22:45:06.000Z"
end

The implementation.

/lib/twitter_feed/parser.ex
defp parse_timestamp(tweet_html) do
  tweet_html
  |> Floki.find("._timestamp")
  |> Floki.attribute("data-time-ms")
  |> hd()
  |> String.to_integer()
  |> DateTime.from_unix!(:millisecond)
  |> DateTime.to_string()
end

Not too tricky, we’ve got a few whacky manipulations to get the timestamp in the format we want, but nothing crazy. If you are curious about what some of these functions do, check out the documentation in iex, for example.

Terminal
iex
Terminal
h DateTime.from_unix

Parsing the tweet text

We’ll derive the text_summary from the Tweet text which we find within a p tag with a class of tweet-text.

The test.

/test/twitter_feed/parser_test.ex
test "parsing of tweet_text" do
  html_snippet = "<p class=\"tweet-text\">some text</p>"

  assert Parser.parse_text(html_snippet) == "some text"
end

The implementation.

/lib/twitter_feed/parser.ex
defp parse_text(tweet_html) do
  tweet_html
  |> Floki.find(".tweet-text")
  |> Floki.text()
  |> String.trim()
end

Nothing we haven’t seen before, we’re applying a String.trim() at the end in order to clean up any extraneous whitespace.

Parsing the image_url

The image_url is nested in a div within the .tweet div. The nested div has a class of .AdaptiveMedia-photoContainer and the URL is specified in the data-image-url attribute.

The test.

/test/twitter_feed/parser_test.ex
test "parsing of tweet_image" do
  html_snippet = """
    <div class=\"AdaptiveMedia-photoContainer\"
         data-image-url=\"https://pbs.twimg.com/media/123.jpg\">
    </div>
    """

  assert Parser.parse_image(html_snippet) == "https://pbs.twimg.com/media/123.jpg"
end

The implementation.

/lib/twitter_feed/parser.ex
defp parse_image(tweet_html) do
  tweet_html
  |> Floki.find(".AdaptiveMedia-photoContainer")
  |> Floki.attribute("data-image-url")
  |> Floki.text()
end

Again pretty simple.

Parsing the retweet field

In the case of a retweet, the .tweet will contain a data-retweeter attribute. So we just need to check if that attribute exists or not.

We need two tests for this field, one to check the correct functionality when the Tweet is not a retweet and another for when it is.

/test/twitter_feed/parser_test.ex
test "parsing of retweeter when it is a retweet" do
  html_snippet = "<div class=\"tweet\" data-retweeter=\"TorontoComms\"></div>"

  assert Parser.parse_is_retweet(html_snippet) == true
end

test "parsing of retweeter when it is not a retweet" do
  html_snippet = "<div class=\"tweet\"></div>"

  assert Parser.parse_is_retweet(html_snippet) == false
end

The implementation.

/lib/twitter_feed/parser.ex
defp parse_is_retweet(tweet_html) do
  tweet_html
  |> Floki.attribute("data-retweeter")
  |> Enum.count() == 1
end

If you haven’t been running the tests as we’ve been adding the fields, now would be a good time to check that everything is working!

Terminal
mix test

Populating the Struct

The next step is to populate the Tweet structure with the fields we’ve parsed. This is simple, all we need to do is call into our parse functions.

/lib/twitter_feed/parser.ex
##################### PRIVATE FUNCTIONS #####################
defp parse_tweet(tweet_html) do
  %Tweet {
    user_id: parse_user_id(tweet_html),
    user_name: parse_user_name(tweet_html),
    display_name: parse_display_name(tweet_html),
    tweet_id: parse_tweet_id(tweet_html),
    timestamp: parse_timestamp(tweet_html),
    text_summary: parse_text(tweet_html) |> truncate(),
    image_url: parse_image(tweet_html),
    retweet: parse_is_retweet(tweet_html)
  }
end

Notice we also have a truncate/1 function we are calling to create the text summary. We’ll need to create this function, so let’s add a few more tests and an implementation for this.

/test/twitter_feed/parser_test.ex
test "truncation of text that does not exceed 30 chars" do
  assert Parser.truncate("some text") == "some text"
end

test "truncation of text over 30 chars is truncated" do
  text = "This is some text that is 31 ch"

  assert Parser.truncate(text) == "This is some text that is 31 c..."
end

We can see by the tests that we are expecting the truncate/1 function to truncate any text that exceeds 30 characters. It will add 3 trailing periods to the text to indicate it has been truncated, meaning the truncated text will be 33 characters total.

The implementation.

/lib/twitter_feed/parser.ex
defp truncate(text) do
  if (String.length(text)) > 30 do
    String.slice(text, 0, 30) <> "..."
  else
    text
  end
end

Not too tricky, we’re just using some existing String functions to accomplish the truncation.

Let’s make sure everything is still passing:

Terminal
mix test

Great! Now let’s see how this looks in iex.

Terminal
iex -S mix
Terminal
TwitterFeed.get_tweets("lolagil")

Perfect, things are looking good. Notice the 4th tweet in the screen-shot above is correctly identified as a retweet. The display_name, user_name and user_id fields are all correctly populated with the values for the original author.

Parsing the handle_id

You’ve probably noticed we have yet to parse the handle_id. This is because this is a little tricky and is actually dependent on some of the existing fields that we’ve now parsed.

Regardless of whether a Tweet is a retweet or not, the handle_id should always refer to the current feed we are retrieving data from. In the case when the Tweet is not a retweet, the handle_id is the same as the user_id. When it is a retweet, we can parse out the handle_id from an anchor tag that refers to the user who retweeted the Tweet.

Let’s see some tests, hopefully a concrete example will help to clarify what is going on.

/test/twitter_feed/parser_test.ex
test "parsing of handle_id when it is not a retweet" do
  assert Parser.parse_handle_id(false, 123, "some html") == 123
end

test "parsing of handle_id when it is a retweet" do
  html_snippet = """
  <span class=\"js-retweet-text\">
   <a data-user-id=\"19377913\"><b>City of Toronto</b></a> Retweeted
  </span>
  """

  assert Parser.parse_handle_id(true, 123, html_snippet) == 19377913
end

Looking at the call to the parse_handle_id/3 function in the tests we see it takes in the following parameters:

  • A boolean to indicate whether the Tweet is a retweet or not.
  • An integer representing the user_id of the Tweet.
  • The .tweet div for the current Tweet we are processing, just like all our other parse functions.

We can see in the first test where the Tweet is not a retweet, we expect the handle_id to equal the value passed in to the user_id parameter.

In the second test we’re handling a retweet and need to parse the handle_id out of the anchor tag.

The implementation.

/lib/twitter_feed/parser.ex
defp parse_handle_id(false, user_id, _tweet_html) do
  user_id
end

defp parse_handle_id(true, _user_id, tweet_html) do
  tweet_html
  |> Floki.find(".js-retweet-text > a")
  |> Floki.attribute("data-user-id")
  |> hd()
  |> String.to_integer()
end

We’re using pattern matching to determine which version of parse_handle_id/3 to run. The implementations themselves are straight forward.

We should now have a couple more passing tests.

Terminal
mix test

Now let’s see how we can hook this into the Tweet struct.

/lib/twitter_feed/parser.ex
##################### PRIVATE FUNCTIONS #####################
defp parse_tweet(tweet_html) do
  user_id = parse_user_id(tweet_html)
  is_retweet = parse_is_retweet(tweet_html)

  %Tweet {
    handle_id: parse_handle_id(is_retweet, user_id, tweet_html),
    user_id: user_id,
    user_name: parse_user_name(tweet_html),
    display_name: parse_display_name(tweet_html),
    tweet_id: parse_tweet_id(tweet_html),
    timestamp: parse_timestamp(tweet_html),
    text_summary: parse_text(tweet_html) |> truncate(),
    image_url: parse_image(tweet_html),
    retweet: is_retweet
  }
end

So we’ve pulled out the calls to parse_user_id and is_retweet in order to their return values available when we call into parse_handle_id. We also replace the function calls that we were making within the struct for user_id and retweet with the user_id and is_retweet variables.

Let’s have another look in iex.

Terminal
iex -S mix
Terminal
TwitterFeed.get_tweets("lolagil")

And there we go, the handle_id is now properly populated, both for original and retweeted Tweets.

Summary

Slowly things are starting to come together, all our fields are parsed and being returned via the Tweet structure. In the next installment we’ll switch back to working on the Scraper module.

Thanks for reading and I hope you enjoyed the post!



Comment on this post!