Grabbing Twitter Data Using a ForEachAdoLoop and Biml

gravatar

Jon

Modifying the code posted by John Minkjan, this walkthrough shows how to search Twitter for different terms by using a ForEachAdoLoop.

published 11.13.12

last updated 01.01.13


Share

Tags

  • BIML
  • ForEachAdoLoop
  • mist
  • Twitter

John Minkjan recently posted a snippet entitled "BigData:Reading a twitter feed".

It was a great snippet and really shows the power of Biml! I want to take this space and add some features to his code.

With the code right now, if more than one search term is desired, more than one package needs to be created. With the help of a small ForEachAdoLoop element, his code can be transformed from a single search term resource into a Twitter search powerhouse! Below I have instructions on how to make these changes to his code. Before working through this walkthough, please make sure that you have a working copy of his snippet. It can be found here: http://bimlscript.com/Snippet/Details/56

NOTE: I am working with SQL server 2008 on my local machine. If you work through this walkthrough and just copy and paste my code into the code from John it will NOT work as the connections are different. Any connection in John's code named "CnOleDBAdventureWorks2012" had been renamed to "Staging" in mine.

  1. To start off with, we are going to create a table to hold our search terms. Fire up SQL Server Management Studio and create a new table in the same database that dbo.TwitterLog resides in. You can use the following code to create the database:

    CREATE TABLE dbo.SearchTerms (
        SearchTermId bigint IDENTITY(1,1)PRIMARY KEY,
        SearchTerm nvarchar(140) NULL
    )
    
  2. Add the different search terms you wish to use to the table in the SearchTerm column.

  3. The code from John can be changed to allow looping through the table and searching Twitter for each term.

a. In the connection element remove the one OleDbConnection entry and replace it with the following two entries:

<OleDbConnection Name="Config" ConnectionString="Data Source=localhost;Initial Catalog=TwitterSearch;Provider=SQLNCLI10.1;Integrated Security=SSPI;Connect Timeout=30;"/>
<OleDbConnection Name="Staging" ConnectionString="Data Source=localhost;Initial Catalog=TwitterSearch;Provider=SQLNCLI10.1;Integrated Security=SSPI;Connect Timeout=30;"/>

Explanation: The "Config" connection will be used to connect to the SearchTerm table and grab the terms to be searched. The "Staging" connection will be used to put the data grabbed from Twitter into the TwitterLog table.

b. Next, navigate down in the code to the Twitter.Reader Package element and replace the tag with the following code:

<Variables>
    <Variable Name="SearchTermsCollection" DataType="Object"></Variable>
</Variables>
<Tasks>
    <ExecuteSQL Name="Get Search Terms From Database" ConnectionName="Config" ResultSet="Full">
        <DirectInput>SELECT SearchTerm from SearchTerms</DirectInput>
        <Results>
            <Result Name="0" VariableName="User.SearchTermsCollection" />
        </Results>
    </ExecuteSQL>
    <ForEachAdoLoop ConstraintMode="Linear" Name="Iterate Search Terms" SourceVariableName="User.SearchTermsCollection">
        <Variables>

Explanation: This code adds an object that the search terms from the SearchTerms table are added to. These terms are then looped through in the ForEachAdoLoop element that is near the end of the pasted code.

c. To make the SearchTerm variable pull the search term from the SearchTermsCollection object add the following code after the Variables element that contains the LastTweetNumber and SearchTerm variables:

<VariableMappings>
    <VariableMapping Name="0" VariableName="User.SearchTerm" />
</VariableMappings>

Explanation: The code now continues on as normal by using the SearchTerm variable to query twitter for new tweets.

d. The last thing to add is the closing tags for the ForEachAdoLoop and the new Tasks loop we created. To do this, add the following code before the tag near the end of the code:

    </ForEachAdoLoop>
</Tasks>
  1. Go ahead and generate and build the SSIS package. In a matter of seconds the twitter data related to the search terms in the SearchTerms table will be pulled into the TwtitterLog database.

Biml is a powerful language and can really simplify and shorten complex tasks such as grabbing twitter data using multiple search terms.

In case you are wanting the entire piece of code I used, here it is:

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
    <Annotations>
        <Annotation>
            File: Twitter.Reader.Looping.Search.biml
            Demo on reading a twitter feed using multiple search terms
            Enviroment:
                DB: MS-SQL2008
                BIML: BIDSHelper v1.6, Mist 2.0, Mist 3.0
            Code copied and modified from original source: John Minkjan biml101.blogspot.com
        </Annotation>
        <Annotation>
            Search Terms Table Script:

            CREATE TABLE [dbo].[SearchTerms] (
                [SearchTermId] [bigint] IDENTITY(1,1)PRIMARY KEY,
                [SearchTerm] [nvarchar](140) NULL
            )

            Twitter Log Table Script:

            CREATE TABLE [dbo].[TwitterLog](
                [TweetNumber] [bigint] NULL,
                [TweetID] [nvarchar](128) NULL,
                [PublishedDateTime] [nvarchar](50) NULL,
                [UpdatedDateTime] [nvarchar](50) NULL,
                [TweetContent] [nvarchar](2048) NULL,
                [TweetTitle] [nvarchar](2048) NULL,
                [TweetURI] [nvarchar](2048) NULL,
                [TweeterName] [nvarchar](2048) NULL,
                [TweetLanguage] [nvarchar](128) NULL,
                [TweetSource] [nvarchar](128) NULL,
                [TweetResultType] [nvarchar](128) NULL,
                [TweetGeoInfo] [nvarchar](128) NULL,
                [TweeterImageLink] [nvarchar](128) NULL,
                [TweetLink] [nvarchar](128) NULL,
                [SearchTerm] [nvarchar](128) NULL
            )
        </Annotation>
    </Annotations>
    <Connections>
        <OleDbConnection Name="Config" ConnectionString="Data Source=localhost;Initial Catalog=TwitterSearch;Provider=SQLNCLI10.1;Integrated Security=SSPI;Connect Timeout=30;"/>
        <OleDbConnection Name="Staging" ConnectionString="Data Source=localhost;Initial Catalog=TwitterSearch;Provider=SQLNCLI10.1;Integrated Security=SSPI;Connect Timeout=30;"/>
    </Connections>
    <ScriptProjects>
        <ScriptComponentProject Name ="SCS_Twitter_Feed">
            <AssemblyReferences>
                <AssemblyReference AssemblyPath="Microsoft.SqlServer.DTSPipelineWrap" />
                <AssemblyReference AssemblyPath="Microsoft.SqlServer.DTSRuntimeWrap" />
                <AssemblyReference AssemblyPath="Microsoft.SqlServer.PipelineHost" />
                <AssemblyReference AssemblyPath="Microsoft.SqlServer.TxScript" />
                <AssemblyReference AssemblyPath="System.dll" />
                <AssemblyReference AssemblyPath="System.AddIn.dll" />
                <AssemblyReference AssemblyPath="System.Data.dll" />
                <AssemblyReference AssemblyPath="System.Xml.dll" />
            </AssemblyReferences>
            <OutputBuffers>
                <OutputBuffer Name="Output0" IsSynchronous="false">
                    <Annotations>
                        <Annotation>
                            IsSynchronous="false" ==>
                            When there are no non-synchronus buffers,
                            the compiler does not emit the CreateNewOutputRows
                            virtual base method (to match the BIDS/SSDT behavior)
                        </Annotation>
                    </Annotations>
                    <Columns>
                        <Column Name="TweetNumber" DataType ="Int64"></Column>
                        <Column Name="TweetID" DataType="String" Length="128"></Column>
                        <Column Name="PublishedDateTime" DataType="String" Length="50"></Column>
                        <Column Name="UpdatedDateTime" DataType="String" Length="50"></Column>
                        <Column Name="TweetContent" DataType="String" Length="2048"></Column>
                        <Column Name="TweetTitle" DataType="String" Length="2048"></Column>
                        <Column Name="TweetURI" DataType="String" Length="2048"></Column>
                        <Column Name="TweeterName" DataType="String" Length="2048"></Column>
                        <Column Name="TweetLanguage" DataType="String" Length="128"></Column>
                        <Column Name="TweetSource" DataType="String" Length="128"></Column>
                        <Column Name="TweetResultType" DataType="String" Length="128"></Column>
                        <Column Name="TweetGeoInfo" DataType="String" Length="128"></Column>
                        <Column Name="TweeterImageLink" DataType="String" Length="128"></Column>
                        <Column Name="TweetLink" DataType="String" Length="128"></Column>
                    </Columns>
                </OutputBuffer>
            </OutputBuffers>
            <ReadOnlyVariables>
                <Variable Namespace="User" VariableName="LastTweetNumber" DataType="Int64"></Variable>
                <Variable Namespace="User" VariableName="SearchTerm" DataType="String"></Variable>
            </ReadOnlyVariables>
            <Files>
                <File Path="AssemblyInfo.cs">
using System.Reflection;
using System.Runtime.CompilerServices;

[assembly: AssemblyTitle("SCS_Twitter_Feed")]
[assembly: AssemblyDescription("")]
[assembly: AssemblyConfiguration("")]
[assembly: AssemblyCompany("")]
[assembly: AssemblyProduct("SCS_Twitter_Feed")]
[assembly: AssemblyCopyright("Copyright @  2012")]
[assembly: AssemblyTrademark("")]
[assembly: AssemblyCulture("")] 

[assembly: AssemblyVersion("1.0.*")]
                </File>
                <File Path="main.cs">
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using System.Xml;
using System.Web;
using System.Net;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    public override void CreateNewOutputRows()
    {
        System.Xml.XmlDocument xml_doc = new System.Xml.XmlDocument();
        string TwitterUri = "http://search.twitter.com/search.atom?q=%23" 
            + this.Variables.SearchTerm 
            +  <![CDATA["&rpp=100&result_type=recent&since_id="]]>
            + this.Variables.LastTweetNumber ;
        xml_doc = GetResponse(TwitterUri);

        XmlNodeList child_nodes = xml_doc.GetElementsByTagName("entry");
        string[] temp;

        foreach (XmlNode child in child_nodes)
        {
            Output0Buffer.AddRow();
            Output0Buffer.TweetID = child.ChildNodes.Item(0).InnerText;
            temp = child.ChildNodes.Item(0).InnerText.Split(':');
            Output0Buffer.TweetNumber = Convert.ToInt64(temp[2]);
            Output0Buffer.PublishedDateTime= child.ChildNodes.Item(1).InnerText;
            Output0Buffer.TweetLink = child.ChildNodes.Item(2).Attributes["href"].Value.ToString();
            Output0Buffer.TweetTitle = child.ChildNodes.Item(3).InnerText;
            Output0Buffer.TweetContent =child.ChildNodes.Item(4).InnerText;
            Output0Buffer.UpdatedDateTime = child.ChildNodes.Item(5).InnerText;
            Output0Buffer.TweeterImageLink = child.ChildNodes.Item(6).Attributes["href"].Value.ToString();
            Output0Buffer.TweetGeoInfo = child.ChildNodes.Item(7).InnerText;
            Output0Buffer.TweetResultType = child.ChildNodes.Item(8).ChildNodes.Item(0).InnerText;
            Output0Buffer.TweetSource = child.ChildNodes.Item(9).InnerText;
            Output0Buffer.TweetLanguage= child.ChildNodes.Item(10).InnerText;
            Output0Buffer.TweeterName = child.ChildNodes.Item(11).ChildNodes.Item(0).InnerText;
            Output0Buffer.TweetURI = child.ChildNodes.Item(11).ChildNodes.Item(1).InnerText;
        }
    }

    public XmlDocument GetResponse(string uri)
    {
        XmlDocument doc = new XmlDocument();

        WebRequest myRequest = WebRequest.Create(new Uri(uri));
        IWebProxy proxy = myRequest.Proxy;
        if (proxy != null)
        {
            proxy.GetProxy(myRequest.RequestUri);
        }

        doc.Load(myRequest.GetResponse().GetResponseStream());

        return doc;
    }
}
                </File>
            </Files>
        </ScriptComponentProject>
    </ScriptProjects>
    <Packages>
        <Package Name ="Twitter.Reader.Tut" ConstraintMode="Linear">
            <Variables>
                <Variable Name="SearchTermsCollection" DataType="Object"></Variable>
            </Variables>
            <Tasks>
                <ExecuteSQL Name="Get Search Terms From Database" ConnectionName="Config" ResultSet="Full">
                    <DirectInput>SELECT SearchTerm from SearchTerms</DirectInput>
                    <Results>
                        <Result Name="0" VariableName="User.SearchTermsCollection" />
                    </Results>
                </ExecuteSQL>
                <ForEachAdoLoop ConstraintMode="Linear" Name="Iterate Search Terms" SourceVariableName="User.SearchTermsCollection">
                    <Variables>
                        <Variable Name ="LastTweetNumber" DataType="Int64" >0</Variable>
                        <Variable Name="SearchTerm" DataType="String"></Variable>
                    </Variables>
                    <VariableMappings>
                        <VariableMapping Name="0" VariableName="User.SearchTerm" />
                    </VariableMappings>
                    <Tasks>
                        <ExecuteSQL Name ="EST Get Last TweetNumber" ConnectionName="Staging" ResultSet="SingleRow">
                            <DirectInput> SELECT isnull(max([TweetNumber]),0) FROM [dbo].[TwitterLog] where [SearchTerm] =?   </DirectInput>
                            <Parameters>
                                <Parameter Name="0" DataType="String" Length="128" Direction ="Input" VariableName ="User.SearchTerm"></Parameter>
                            </Parameters>
                            <Results>
                                <Result Name="0" VariableName="User.LastTweetNumber"></Result>
                            </Results>
                        </ExecuteSQL>
                        <Dataflow Name ="DFT Get Tweets">
                            <Transformations>
                                <ScriptComponentSource Name="SCS Twitter Feed">
                                    <ScriptComponentProjectReference ScriptComponentProjectName="SCS_Twitter_Feed"></ScriptComponentProjectReference>
                                </ScriptComponentSource>
                                <DerivedColumns Name="DC SearchTerm">
                                    <Columns>
                                        <Column Name="SearchTerm" DataType ="String" Length ="128">@[User::SearchTerm]</Column>
                                    </Columns>
                                </DerivedColumns>
                                <OleDbDestination Name="ODD dbo TwitterLog" ConnectionName="Staging">
                                    <ExternalTableOutput Table="[dbo].[TwitterLog]"></ExternalTableOutput>
                                </OleDbDestination>
                            </Transformations>
                        </Dataflow>
                    </Tasks>
                </ForEachAdoLoop>
            </Tasks>
        </Package>
    </Packages>
</Biml>
You are not authorized to comment. A verification email has been sent to your email address. Please verify your account.

Comments

gravatar

Kenneth

1:01pm 04.18.13

Is it just me or? But i get an access forbidden (403) an second search.

gravatar

Paul S. Waters

4:14pm 04.29.13

Hi Kenneth,

I tried to reproduce your error and I was not able to. I put Biml and SSIS in the Search Terms table and got results back for both terms.

Paul

gravatar

Samuel Vanga

8:49pm 09.25.13

Doesn't work for me too. First, it says "the URL is gone, 410 error" so i changed the TwitterUri to "https://twitter.com/search?q=%23" thinking maybe because of Twitter API changes recently. After changing that, it throws an error about not able to parse the element at line 46.

Any help would be greatly appreciated! Thanks!

gravatar

David Dye1

6:07pm 11.27.13

I originally received the "the URL is gone, 410 error" error also. After a bit of research I modified the script task url to "https://twitter.com/search?q=%40twitterapi". This successfully ran until the doc.Load method. I am not getting an error "An error occurred while parsing EntityName. Line 37, position 111."

gravatar

Scott Currie

2:10am 03.14.14

I think this is all related to the twitter API changes. We'll double back on this sample and update it to work with the latest APIs.

gravatar

Ed

9:46am 11.13.14

This is due to the change to adopt OAuth authentication for its REST API:

(https://dev.twitter.com/rest/public "https://dev.twitter.com/rest/public")